[00:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:46:25] (03PS2) 10Brennen Bearnes: logspam-watch.sh: fix or suppress various shellcheck warnings [puppet] - 10https://gerrit.wikimedia.org/r/1035018 (https://phabricator.wikimedia.org/T364083) [01:36:24] FIRING: [3x] HelmReleaseBadStatus: Helm release device-analytics/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:36:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:41:46] FIRING: JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:56] RESOLVED: JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:13] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:46:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:51:47] (03PS1) 10RLazarus: admin_ng: RBAC to allow mw-script user to attach to pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035070 (https://phabricator.wikimedia.org/T341553) [01:58:46] (03CR) 10RLazarus: "I know this isn't what deployExtraClusterRoles was originally for, but it seems to already have some fairly diverse uses -- can I get away" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035070 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [02:31:39] (03PS3) 10Jdlrobson: Always use desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034584 (https://phabricator.wikimedia.org/T109277) [02:36:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:18:56] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:57:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62937 and previous config saved to /var/cache/conftool/dbconfig/20240523-045722-root.json [05:00:56] (03PS1) 10Marostegui: es2025: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035075 [05:01:21] (03CR) 10Marostegui: [C:03+2] es2025: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035075 (owner: 10Marostegui) [05:06:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P62938 and previous config saved to /var/cache/conftool/dbconfig/20240523-050626-root.json [05:08:13] !log Install 10..6.18 on db1174 T365338 [05:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:20] T365338: MariaDB 10.6.18 released - https://phabricator.wikimedia.org/T365338 [05:09:14] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9824183 (10Marostegui) [05:09:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62939 and previous config saved to /var/cache/conftool/dbconfig/20240523-050950-root.json [05:10:24] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9824185 (10Marostegui) As part of {T365338... [05:12:17] (03PS1) 10Marostegui: db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035076 (https://phabricator.wikimedia.org/T365557) [05:12:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62940 and previous config saved to /var/cache/conftool/dbconfig/20240523-051228-root.json [05:12:48] (03CR) 10Marostegui: [C:03+2] db1155: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035076 (https://phabricator.wikimedia.org/T365557) (owner: 10Marostegui) [05:17:27] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1155.eqiad.wmnet with OS bookworm [05:24:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62941 and previous config saved to /var/cache/conftool/dbconfig/20240523-052456-root.json [05:27:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62942 and previous config saved to /var/cache/conftool/dbconfig/20240523-052734-root.json [05:32:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1155.eqiad.wmnet with reason: host reimage [05:35:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1155.eqiad.wmnet with reason: host reimage [05:36:25] FIRING: [3x] HelmReleaseBadStatus: Helm release device-analytics/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:40:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62944 and previous config saved to /var/cache/conftool/dbconfig/20240523-054002-root.json [05:42:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62945 and previous config saved to /var/cache/conftool/dbconfig/20240523-054240-root.json [05:44:41] RECOVERY - MegaRAID on es2022 is OK: OK: optimal, 1 logical, 12 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:45:13] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:45:43] (03PS2) 10Fabfur: benthos:cache: drop part of haproxy internal messages [puppet] - 10https://gerrit.wikimedia.org/r/1035029 (https://phabricator.wikimedia.org/T359627) [05:47:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s4 T363689 [05:47:31] T363689: Switchover s4 master (db1238 -> db1160) - https://phabricator.wikimedia.org/T363689 [05:47:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T363689 [05:48:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db1160 with weight 0 T363689', diff saved to https://phabricator.wikimedia.org/P62946 and previous config saved to /var/cache/conftool/dbconfig/20240523-054816-arnaudb.json [05:49:27] (03CR) 10Fabfur: [C:03+2] benthos:cache: drop part of haproxy internal messages [puppet] - 10https://gerrit.wikimedia.org/r/1035029 (https://phabricator.wikimedia.org/T359627) (owner: 10Fabfur) [05:54:50] (03CR) 10VolkerE: [C:03+1] Always use desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034584 (https://phabricator.wikimedia.org/T109277) (owner: 10Jdlrobson) [05:55:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62947 and previous config saved to /var/cache/conftool/dbconfig/20240523-055508-root.json [05:55:16] (03CR) 10VolkerE: [C:03+1] Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [05:56:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1155.eqiad.wmnet with OS bookworm [05:57:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62948 and previous config saved to /var/cache/conftool/dbconfig/20240523-055747-root.json [05:59:32] (03PS1) 10Marostegui: db1155: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035082 (https://phabricator.wikimedia.org/T365557) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T0600) [06:00:04] kormat, marostegui, Amir1, and arnaudb: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T0600). nyaa~ [06:00:14] (03CR) 10Marostegui: [C:03+2] db1155: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035082 (https://phabricator.wikimedia.org/T365557) (owner: 10Marostegui) [06:05:32] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 (owner: 10Ayounsi) [06:07:37] (03CR) 10Slyngshede: [C:04-1] "Ah, I didn't see the other patch." [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 (owner: 10Ayounsi) [06:10:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62949 and previous config saved to /var/cache/conftool/dbconfig/20240523-061014-root.json [06:13:05] !log Starting s4 eqiad failover from db1238 to db1160 - T363689 [06:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:09] T363689: Switchover s4 master (db1238 -> db1160) - https://phabricator.wikimedia.org/T363689 [06:14:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T363689', diff saved to https://phabricator.wikimedia.org/P62950 and previous config saved to /var/cache/conftool/dbconfig/20240523-061408-arnaudb.json [06:15:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db1160 to s4 primary and set section read-write T363689', diff saved to https://phabricator.wikimedia.org/P62951 and previous config saved to /var/cache/conftool/dbconfig/20240523-061524-arnaudb.json [06:16:45] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [06:17:21] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1024756 (https://phabricator.wikimedia.org/T363689) (owner: 10Gerrit maintenance bot) [06:20:22] (03PS1) 10Arnaudb: mariadb: hotfix s4 config [puppet] - 10https://gerrit.wikimedia.org/r/1034931 (https://phabricator.wikimedia.org/T363689) [06:20:47] (03CR) 10Marostegui: [C:03+1] mariadb: hotfix s4 config [puppet] - 10https://gerrit.wikimedia.org/r/1034931 (https://phabricator.wikimedia.org/T363689) (owner: 10Arnaudb) [06:21:20] (03CR) 10Arnaudb: [C:03+2] mariadb: hotfix s4 config [puppet] - 10https://gerrit.wikimedia.org/r/1034931 (https://phabricator.wikimedia.org/T363689) (owner: 10Arnaudb) [06:24:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Upgrade GitLab Replica to new version [06:24:13] (03CR) 10Arnaudb: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1024757 (https://phabricator.wikimedia.org/T363689) (owner: 10Gerrit maintenance bot) [06:24:58] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [06:25:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62952 and previous config saved to /var/cache/conftool/dbconfig/20240523-062521-root.json [06:26:24] (03PS1) 10Arnaudb: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1034932 (https://phabricator.wikimedia.org/T363689) [06:26:37] (03CR) 10Marostegui: [C:03+1] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1034932 (https://phabricator.wikimedia.org/T363689) (owner: 10Arnaudb) [06:27:44] (03CR) 10Arnaudb: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1034932 (https://phabricator.wikimedia.org/T363689) (owner: 10Arnaudb) [06:30:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1238 T363689', diff saved to https://phabricator.wikimedia.org/P62953 and previous config saved to /var/cache/conftool/dbconfig/20240523-063025-arnaudb.json [06:30:30] T363689: Switchover s4 master (db1238 -> db1160) - https://phabricator.wikimedia.org/T363689 [06:31:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [06:32:13] (03CR) 10DCausse: [C:03+1] cirrus: Keep archive writes running through cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035061 (owner: 10Ebernhardson) [06:40:10] (03CR) 10Muehlenhoff: [C:03+2] profile::parsoid::mediawiki: Don't hardcode the PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1034535 (owner: 10Muehlenhoff) [06:40:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62954 and previous config saved to /var/cache/conftool/dbconfig/20240523-064027-root.json [06:45:27] !log dcausse@deploy1002 Started deploy [airflow-dags/search@49369da]: search: automate graph split and n3 dump generation [06:45:46] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@49369da]: search: automate graph split and n3 dump generation (duration: 00m 19s) [06:55:01] (03CR) 10JMeybohm: [C:03+1] kask: checksum tls certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035010 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [07:00:05] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T0700). [07:00:05] dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:23] o/ [07:00:45] I can deploy [07:01:12] (03CR) 10DCausse: [C:03+2] extension registration: Fix handling of null default values [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034989 (https://phabricator.wikimedia.org/T365190) (owner: 10DCausse) [07:01:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035061 (owner: 10Ebernhardson) [07:02:27] (03Merged) 10jenkins-bot: cirrus: Keep archive writes running through cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035061 (owner: 10Ebernhardson) [07:02:32] (03CR) 10Jelto: [C:03+2] gitlab.runners: Add *.toolforge.org to allowed services [puppet] - 10https://gerrit.wikimedia.org/r/1034971 (https://phabricator.wikimedia.org/T365561) (owner: 10BryanDavis) [07:03:21] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1035061|cirrus: Keep archive writes running through cirrus]] [07:06:08] !log dcausse@deploy1002 ebernhardson and dcausse: Backport for [[gerrit:1035061|cirrus: Keep archive writes running through cirrus]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:08:30] !log dcausse@deploy1002 ebernhardson and dcausse: Continuing with sync [07:13:03] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346#9824345 (10Marostegui) @Jclark-ctr @VRiley-WMF do you happen to have spare disks? [07:14:01] (03PS4) 10Stevemunene: dns: provision datahub-next subdomain [dns] - 10https://gerrit.wikimedia.org/r/1034887 (https://phabricator.wikimedia.org/T365576) [07:16:00] (03PS1) 10Marostegui: Revert "db1155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035147 [07:16:10] (03PS2) 10Ayounsi: Add ApereoSocialPipeline [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034962 (https://phabricator.wikimedia.org/T308002) [07:16:10] (03PS2) 10Ayounsi: Update requirements [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 [07:16:12] (03CR) 10CI reject: [V:04-1] Revert "db1155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035147 (owner: 10Marostegui) [07:16:26] (03PS5) 10Stevemunene: dns: provision datahub-next subdomain [dns] - 10https://gerrit.wikimedia.org/r/1034887 (https://phabricator.wikimedia.org/T365576) [07:17:07] (03CR) 10Ayounsi: Update requirements (031 comment) [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 (owner: 10Ayounsi) [07:17:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:17:27] (03PS10) 10Stevemunene: provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) [07:17:29] (03PS1) 10Marostegui: db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035267 [07:17:37] (03Abandoned) 10Marostegui: Revert "db1155: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035147 (owner: 10Marostegui) [07:17:45] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 (owner: 10Ayounsi) [07:18:02] (03CR) 10Marostegui: [C:03+2] db1155: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035267 (owner: 10Marostegui) [07:18:10] (03CR) 10Ayounsi: [C:03+2] sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [07:18:26] (03CR) 10Ayounsi: sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [07:20:40] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1035061|cirrus: Keep archive writes running through cirrus]] (duration: 17m 19s) [07:21:59] (03Merged) 10jenkins-bot: extension registration: Fix handling of null default values [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034989 (https://phabricator.wikimedia.org/T365190) (owner: 10DCausse) [07:22:08] (03PS1) 10Stevemunene: trafficserver: add datahub and datahub-next redirects [puppet] - 10https://gerrit.wikimedia.org/r/1035268 (https://phabricator.wikimedia.org/T365668) [07:22:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:22:31] (03CR) 10Ayounsi: [C:03+1] Rename kubernetes2023 to wikikube-worker2001 [puppet] - 10https://gerrit.wikimedia.org/r/1034976 (https://phabricator.wikimedia.org/T365571) (owner: 10JMeybohm) [07:23:19] (03CR) 10Ayounsi: [C:03+1] Add wikikube-worker config [puppet] - 10https://gerrit.wikimedia.org/r/1034956 (https://phabricator.wikimedia.org/T365571) (owner: 10JMeybohm) [07:25:30] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:1034989|extension registration: Fix handling of null default values (T365190)]] [07:25:36] T365190: Cannot provide empty array to wikis as $wgCirrusSearchWriteClusters - https://phabricator.wikimedia.org/T365190 [07:26:14] (03CR) 10JMeybohm: [C:03+2] Rename kubernetes2023 to wikikube-worker2001 [puppet] - 10https://gerrit.wikimedia.org/r/1034976 (https://phabricator.wikimedia.org/T365571) (owner: 10JMeybohm) [07:26:17] (03CR) 10JMeybohm: [C:03+2] Add wikikube-worker config [puppet] - 10https://gerrit.wikimedia.org/r/1034956 (https://phabricator.wikimedia.org/T365571) (owner: 10JMeybohm) [07:28:10] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:1034989|extension registration: Fix handling of null default values (T365190)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:30:18] !log dcausse@deploy1002 dcausse: Continuing with sync [07:32:15] (03CR) 10Brouberol: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1032775 (owner: 10Muehlenhoff) [07:33:30] (03PS2) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for FPM [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) [07:33:47] (03CR) 10Brouberol: dse-k8s: add new airflow service to k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [07:35:09] (03CR) 10Brouberol: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1034892 (owner: 10Muehlenhoff) [07:35:12] (03PS3) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for FPM [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) [07:35:52] (03CR) 10Muehlenhoff: "I've added a new wmflib::wmf_php_version() function and updated the patch to use it." [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:36:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:36:26] (03CR) 10Brouberol: dse-k8s: add airflow namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [07:39:25] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:24] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:42:27] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:1034989|extension registration: Fix handling of null default values (T365190)]] (duration: 16m 56s) [07:42:35] T365190: Cannot provide empty array to wikis as $wgCirrusSearchWriteClusters - https://phabricator.wikimedia.org/T365190 [07:44:25] FIRING: [5x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:21] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1035050 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [07:48:49] !log ayounsi@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2023 to wikikube-worker2001 [07:49:09] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=99) from kubernetes2023 to wikikube-worker2001 [07:51:55] hashar, andre while deploying the train it is possible that you see errors like "Received cirrusSearchElasticaWrite job with page updates for an unwritable cluster eqiad", these are expected and should not last long [07:52:06] (03PS2) 10Gerrit maintenance bot: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028940 (https://phabricator.wikimedia.org/T364541) [07:54:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1238.eqiad.wmnet with reason: reimage [07:55:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1238.eqiad.wmnet with reason: reimage [07:55:06] (03PS1) 10Marostegui: mariadb: Promote db1192 to master [puppet] - 10https://gerrit.wikimedia.org/r/1035315 (https://phabricator.wikimedia.org/T364541) [07:55:47] (03PS8) 10Ayounsi: sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 [07:56:04] (03CR) 10Brouberol: "I have a couple of suggestions and a lot of toughts." [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [07:56:13] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1035315 (https://phabricator.wikimedia.org/T364541) (owner: 10Marostegui) [07:56:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1238.eqiad.wmnet with OS bookworm [07:56:43] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [07:57:04] dcausse: Thanks for the info! There are already quite a few in Logstash but I assumed they come from the backport window [07:57:17] (03CR) 10JMeybohm: [C:03+1] sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [07:57:30] !log ayounsi@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2023 to wikikube-worker2001 [07:57:36] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [07:57:59] dcausse, I guess you've finished backporting? [07:59:41] andre: yes, sorry forgot to mention that [07:59:51] nah, all good. thank [07:59:54] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2023 to wikikube-worker2001 - ayounsi@cumin1002" [08:00:05] andre and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T0800) [08:01:16] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2023 to wikikube-worker2001 - ayounsi@cumin1002" [08:01:16] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:01:16] !log ayounsi@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2001 [08:01:36] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2001 [08:02:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2023 to wikikube-worker2001 [08:03:56] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [08:04:25] FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:40] FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:50] (03CR) 10Filippo Giunchedi: [C:03+2] pki: add temporary profile for prometheus + k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034048 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [08:05:46] (03PS1) 10David Caro: Reapply "openstack::bobcat: apply cloud yaml patch"" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 [08:05:53] (03CR) 10Stevemunene: [C:03+2] dns: provision datahub-next subdomain [dns] - 10https://gerrit.wikimedia.org/r/1034887 (https://phabricator.wikimedia.org/T365576) (owner: 10Stevemunene) [08:07:44] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2001.codfw.wmnet with OS bullseye [08:08:56] RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [08:09:08] (03CR) 10David Caro: "yay πŸŽ‰" [puppet] - 10https://gerrit.wikimedia.org/r/1034971 (https://phabricator.wikimedia.org/T365561) (owner: 10BryanDavis) [08:09:25] FIRING: [6x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:34] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:10:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1238.eqiad.wmnet with reason: host reimage [08:10:44] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:11:13] this is me and XioNoX ^ [08:11:54] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [08:13:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1238.eqiad.wmnet with reason: host reimage [08:15:43] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [08:16:06] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035320 (https://phabricator.wikimedia.org/T361400) [08:16:08] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035320 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [08:16:49] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035320 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [08:18:21] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2116.codfw.wmnet [08:20:47] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [08:21:59] (03PS1) 10Muehlenhoff: Switch db2116 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035321 (https://phabricator.wikimedia.org/T349619) [08:23:11] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2001.codfw.wmnet with reason: host reimage [08:25:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2001.codfw.wmnet with reason: host reimage [08:29:23] (03CR) 10Muehlenhoff: [C:03+2] Switch db2116 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035321 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:29:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:31:40] !log aklapper@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.6 refs T361400 [08:31:44] T361400: 1.43.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T361400 [08:31:54] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:43] (03Abandoned) 10Isabelle Hurbain-Palatin: Fix serialization errors in PageBundle extensiondata [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032807 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [08:32:46] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [08:32:51] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:33:39] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc1054.eqiad.wmnet with OS bookworm [08:33:41] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc2054.codfw.wmnet with OS bookworm [08:34:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1238.eqiad.wmnet with OS bookworm [08:34:38] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: use 'prometheus' profile for k8s certs [puppet] - 10https://gerrit.wikimedia.org/r/1034050 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [08:34:47] (03PS2) 10Filippo Giunchedi: prometheus: use 'prometheus' profile for k8s certs [puppet] - 10https://gerrit.wikimedia.org/r/1034050 (https://phabricator.wikimedia.org/T343529) [08:34:56] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] prometheus: use 'prometheus' profile for k8s certs [puppet] - 10https://gerrit.wikimedia.org/r/1034050 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [08:35:27] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [08:36:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2116.codfw.wmnet [08:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:51] RESOLVED: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:38:09] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2130.codfw.wmnet [08:39:25] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:59] (03PS1) 10Muehlenhoff: Switch db2130 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035326 (https://phabricator.wikimedia.org/T349619) [08:42:24] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:42:50] (03CR) 10Muehlenhoff: [C:03+2] Switch db2130 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035326 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:43:04] (03PS1) 10Slyngshede: Docker: Allow configuration of LDAP authentication. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035327 [08:44:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2001.codfw.wmnet with OS bullseye [08:48:22] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1054.eqiad.wmnet with reason: host reimage [08:48:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62959 and previous config saved to /var/cache/conftool/dbconfig/20240523-084834-arnaudb.json [08:49:19] (03CR) 10Slyngshede: [C:03+2] Docker: Allow configuration of LDAP authentication. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035327 (owner: 10Slyngshede) [08:49:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2130.codfw.wmnet [08:49:44] 10SRE-tools, 10Spicerack: Redfish _get_dummy_response() should return empty json - https://phabricator.wikimedia.org/T365680 (10ayounsi) 03NEW p:05Triageβ†’03Low [08:50:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1238', diff saved to https://phabricator.wikimedia.org/P62960 and previous config saved to /var/cache/conftool/dbconfig/20240523-085023-root.json [08:50:48] (03Merged) 10jenkins-bot: Docker: Allow configuration of LDAP authentication. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035327 (owner: 10Slyngshede) [08:50:55] (03CR) 10Ayounsi: [C:03+2] sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [08:51:10] (03PS1) 10Effie Mouzeli: mediawiki::memcached: switch to running as user memcache mcX050-mcX054 [puppet] - 10https://gerrit.wikimedia.org/r/1035328 (https://phabricator.wikimedia.org/T273950) [08:51:19] !log Deploy schema change on s4 eqiad old master db1238 dbmaint T356166 [08:51:20] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1054.eqiad.wmnet with reason: host reimage [08:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:23] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [08:51:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T364299)', diff saved to https://phabricator.wikimedia.org/P62961 and previous config saved to /var/cache/conftool/dbconfig/20240523-085137-marostegui.json [08:51:41] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2054.codfw.wmnet with reason: host reimage [08:51:42] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:52:25] (03PS2) 10JMeybohm: Rename kubernetes2032 to wikikube-worker2002 [puppet] - 10https://gerrit.wikimedia.org/r/1034977 (https://phabricator.wikimedia.org/T365571) [08:52:58] [TRAIN] Finished the wmf.6 deployment to group2. All seems fine. [08:53:55] (03CR) 10Muehlenhoff: [C:03+2] kafka::mirror: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1034892 (owner: 10Muehlenhoff) [08:54:15] (03CR) 10Ayounsi: [C:03+1] Rename kubernetes2032 to wikikube-worker2002 [puppet] - 10https://gerrit.wikimedia.org/r/1034977 (https://phabricator.wikimedia.org/T365571) (owner: 10JMeybohm) [08:54:28] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2054.codfw.wmnet with reason: host reimage [08:54:39] (03Merged) 10jenkins-bot: sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [08:55:57] (03CR) 10Effie Mouzeli: [C:03+2] "PCC OK https://puppet-compiler.wmflabs.org/output/1035328/2601/" [puppet] - 10https://gerrit.wikimedia.org/r/1035328 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [08:58:27] (03CR) 10Ayounsi: [C:03+1] sre.hosts.reimage: add support for VLAN move (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [08:58:31] (03PS1) 10Muehlenhoff: kafka::mirror: Remove obsolete class parameter [puppet] - 10https://gerrit.wikimedia.org/r/1035329 [09:01:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035329 (owner: 10Muehlenhoff) [09:02:15] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032775 (owner: 10Muehlenhoff) [09:04:11] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [09:04:30] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [09:05:42] (03CR) 10Muehlenhoff: "The PCC output for kafka/jumbo is a PCC glitch (race in updating the hosts data in the worker nodes)" [puppet] - 10https://gerrit.wikimedia.org/r/1035329 (owner: 10Muehlenhoff) [09:06:10] FIRING: [3x] HelmReleaseBadStatus: Helm release device-analytics/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:06:38] (03CR) 10Muehlenhoff: [C:03+2] druid: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032775 (owner: 10Muehlenhoff) [09:06:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P62962 and previous config saved to /var/cache/conftool/dbconfig/20240523-090645-marostegui.json [09:07:07] !log btullis@cumin1002 START - Cookbook sre.discovery.service-route depool device-analytics in codfw: maintenance [09:08:53] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1054.eqiad.wmnet with OS bookworm [09:12:11] !log btullis@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool device-analytics in codfw: maintenance [09:12:48] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2054.codfw.wmnet with OS bookworm [09:13:54] (03PS1) 10Muehlenhoff: Remove profile::zookeeper::firewall::srange [puppet] - 10https://gerrit.wikimedia.org/r/1035334 [09:13:55] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 447, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:13:59] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 523, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:15:07] PROBLEM - Host mc2054 is DOWN: PING CRITICAL - Packet loss = 100% [09:15:35] RECOVERY - Host mc2054 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [09:17:40] (03CR) 10Hnowlan: [C:03+2] kask: checksum tls certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035010 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [09:17:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035334 (owner: 10Muehlenhoff) [09:17:49] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [09:18:08] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [09:18:45] (03Merged) 10jenkins-bot: kask: checksum tls certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035010 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [09:18:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2145.codfw.wmnet [09:19:42] !log btullis@cumin1002 START - Cookbook sre.discovery.service-route pool device-analytics in codfw: maintenance [09:21:09] 06SRE, 10Release-Engineering-Team (Radar): scap train failure due to earlier host rename - https://phabricator.wikimedia.org/T365683 (10Aklapper) 03NEW [09:21:10] FIRING: [3x] HelmReleaseBadStatus: Helm release device-analytics/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:21:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P62963 and previous config saved to /var/cache/conftool/dbconfig/20240523-092153-marostegui.json [09:22:14] (03PS1) 10Muehlenhoff: Switch db2145 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035344 (https://phabricator.wikimedia.org/T349619) [09:22:23] (03CR) 10Jelto: [V:03+1 C:03+2] prometheus::ops: scrape custom gitlab exporter [puppet] - 10https://gerrit.wikimedia.org/r/1029169 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [09:23:21] (03CR) 10Muehlenhoff: [C:03+2] Switch db2145 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035344 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:24:46] !log btullis@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool device-analytics in codfw: maintenance [09:25:55] PROBLEM - Host mc1054 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:23] RECOVERY - Host mc1054 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [09:27:41] (03PS1) 10Btullis: Add a new partman reuser recipe for stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/1035346 (https://phabricator.wikimedia.org/T329360) [09:28:02] (03PS2) 10Btullis: Add a new partman reuse recipe for stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/1035346 (https://phabricator.wikimedia.org/T329360) [09:28:44] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9824816 (10cmooney) We also now have the issue from T365204 that we can resolve with an upgrade of JunOS. Not essential in eqiad but still I think we need to stop proc... [09:29:52] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [09:30:02] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [09:30:15] (03CR) 10Brouberol: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1035346 (https://phabricator.wikimedia.org/T329360) (owner: 10Btullis) [09:30:40] !log installing zeromq3 bugfix updates from Bullseye point release [09:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:44] (03CR) 10CI reject: [V:04-1] Add a new partman reuse recipe for stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/1035346 (https://phabricator.wikimedia.org/T329360) (owner: 10Btullis) [09:32:18] (03PS3) 10Btullis: Add a new partman reuse recipe for stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/1035346 (https://phabricator.wikimedia.org/T329360) [09:32:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2145.codfw.wmnet [09:36:36] (03CR) 10Btullis: [C:03+2] Add a new partman reuse recipe for stat1008 [puppet] - 10https://gerrit.wikimedia.org/r/1035346 (https://phabricator.wikimedia.org/T329360) (owner: 10Btullis) [09:36:44] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9824899 (10MoritzMuehlenhoff) [09:37:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T364299)', diff saved to https://phabricator.wikimedia.org/P62965 and previous config saved to /var/cache/conftool/dbconfig/20240523-093703-marostegui.json [09:37:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [09:37:07] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:37:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [09:37:08] !log btullis@cumin1002 START - Cookbook sre.discovery.service-route depool device-analytics in eqiad: maintenance [09:37:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [09:37:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance [09:37:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T364299)', diff saved to https://phabricator.wikimedia.org/P62966 and previous config saved to /var/cache/conftool/dbconfig/20240523-093720-marostegui.json [09:38:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T364299)', diff saved to https://phabricator.wikimedia.org/P62967 and previous config saved to /var/cache/conftool/dbconfig/20240523-093830-marostegui.json [09:39:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:39:45] (03CR) 10Ayounsi: [C:03+1] sre.hosts.move-vlan: add new cookbook (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [09:40:13] (03PS24) 10Ayounsi: sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [09:42:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool device-analytics in eqiad: maintenance [09:42:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2146.codfw.wmnet [09:43:39] (03CR) 10CI reject: [V:04-1] sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [09:44:17] (03PS1) 10Muehlenhoff: Switch db2146 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035347 (https://phabricator.wikimedia.org/T349619) [09:44:55] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:45:13] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:45:21] (03PS25) 10Ayounsi: sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [09:45:57] PROBLEM - OSPF status on cr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:46:32] (03CR) 10Muehlenhoff: [C:03+2] Switch db2146 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035347 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:46:55] RECOVERY - BFD status on cr2-eqiad is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:46:57] RECOVERY - OSPF status on cr1-esams is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:47:23] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host stat1008.eqiad.wmnet with OS bullseye [09:49:58] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [09:50:26] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [09:51:10] RESOLVED: HelmReleaseBadStatus: Helm release device-analytics/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=device-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:51:20] 06SRE, 10Release-Engineering-Team (Radar): scap train failure due to earlier host rename - https://phabricator.wikimedia.org/T365683#9824942 (10hashar) When looking at deploy1002 `/etc/ssh/ssh_known_hosts` was last modified at 5:56 The host got renamed around ~ 08:02 and kubernetes2003 was still showing in `... [09:53:26] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9824957 (10MoritzMuehlenhoff) [09:53:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P62968 and previous config saved to /var/cache/conftool/dbconfig/20240523-095338-marostegui.json [09:54:33] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9824958 (10MoritzMuehlenhoff) 05Openβ†’03Resolved a:03MoritzMuehlenhoff This is complete [09:55:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2146.codfw.wmnet [09:56:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62969 and previous config saved to /var/cache/conftool/dbconfig/20240523-095627-root.json [09:57:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [09:57:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T1000) [10:00:58] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es2022 - https://phabricator.wikimedia.org/T365213#9824966 (10ABran-WMF) RAID is fully rebuilt @Marostegui {F54214313} [10:01:14] !log btullis@cumin1002 START - Cookbook sre.discovery.service-route pool device-analytics in eqiad: maintenance [10:02:22] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es2022 - https://phabricator.wikimedia.org/T365213#9824970 (10Marostegui) Awesome thanks! [10:04:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2130.codfw.wmnet with reason: reimage [10:04:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2130.codfw.wmnet with reason: reimage [10:04:30] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2602/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (owner: 10David Caro) [10:04:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:04:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2130 T364290', diff saved to https://phabricator.wikimedia.org/P62970 and previous config saved to /var/cache/conftool/dbconfig/20240523-100452-arnaudb.json [10:04:58] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [10:06:18] !log btullis@cumin1002 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool device-analytics in eqiad: maintenance [10:06:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2130.codfw.wmnet with OS bookworm [10:07:41] (03PS1) 10Btullis: Fix the stat1008 partman-reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1035348 (https://phabricator.wikimedia.org/T329360) [10:08:25] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2603/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (owner: 10David Caro) [10:08:50] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host stat1008.eqiad.wmnet with OS bullseye [10:11:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62971 and previous config saved to /var/cache/conftool/dbconfig/20240523-101133-root.json [10:12:58] (03PS1) 10Effie Mouzeli: mediawiki::memcached: set number of memcached threads mcX050-mcX054 [puppet] - 10https://gerrit.wikimedia.org/r/1035349 (https://phabricator.wikimedia.org/T273950) [10:14:07] (03CR) 10Brouberol: [C:03+1] "Oh right, I forgot about this naming convention. Good spot." [puppet] - 10https://gerrit.wikimedia.org/r/1035348 (https://phabricator.wikimedia.org/T329360) (owner: 10Btullis) [10:17:31] (03CR) 10Btullis: [C:03+2] Fix the stat1008 partman-reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/1035348 (https://phabricator.wikimedia.org/T329360) (owner: 10Btullis) [10:18:15] (03PS2) 10Effie Mouzeli: mediawiki::memcached: set number of memcached threads mcX050-mcX054 [puppet] - 10https://gerrit.wikimedia.org/r/1035349 (https://phabricator.wikimedia.org/T273950) [10:18:37] (03CR) 10CI reject: [V:04-1] mediawiki::memcached: set number of memcached threads mcX050-mcX054 [puppet] - 10https://gerrit.wikimedia.org/r/1035349 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [10:24:47] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Redfish _get_dummy_response() should return empty json - https://phabricator.wikimedia.org/T365680#9825003 (10Volans) I guess we could use something like: `lang=python >>> a = requests.Response() >>> a.status_code = 200 >>> a.raw = BytesIO(b'{}') >>> a... [10:25:03] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host stat1008.eqiad.wmnet with OS bullseye [10:25:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2130.codfw.wmnet with reason: host reimage [10:26:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62972 and previous config saved to /var/cache/conftool/dbconfig/20240523-102639-root.json [10:26:45] (03PS1) 10Muehlenhoff: maps: Add option to use PKI (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) [10:27:07] (03PS2) 10David Caro: Reapply "openstack::bobcat: apply cloud yaml patch"" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 [10:28:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2130.codfw.wmnet with reason: host reimage [10:28:46] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2606/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (owner: 10David Caro) [10:29:01] (03PS3) 10Effie Mouzeli: mediawiki::memcached: set number of memcached threads mcX050-mcX054 [puppet] - 10https://gerrit.wikimedia.org/r/1035349 (https://phabricator.wikimedia.org/T273950) [10:32:23] (03CR) 10Phuedx: [C:03+1] "Re-applying my +1 after reviewing PS19-PS29. Nice work, @Ottomata." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [10:36:12] (03PS4) 10Effie Mouzeli: mediawiki::memcached: increase number of threads mcX050-mcX054 [puppet] - 10https://gerrit.wikimedia.org/r/1035349 (https://phabricator.wikimedia.org/T273950) [10:37:46] (03PS1) 10Btullis: Remove trailing slash from stat1008 partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1035353 (https://phabricator.wikimedia.org/T329360) [10:37:48] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki::memcached: increase number of threads mcX050-mcX054 [puppet] - 10https://gerrit.wikimedia.org/r/1035349 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [10:38:01] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [10:38:40] (03CR) 10Btullis: [C:03+2] Remove trailing slash from stat1008 partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1035353 (https://phabricator.wikimedia.org/T329360) (owner: 10Btullis) [10:39:20] !log hnowlan@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-worker2001.codfw.wmnet [10:40:46] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host stat1008.eqiad.wmnet with OS bullseye [10:41:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:41:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62973 and previous config saved to /var/cache/conftool/dbconfig/20240523-104145-root.json [10:42:01] !log hnowlan@cumin1002 conftool action : set/pooled=no; selector: name=wikikube-worker2001.codfw.wmnet [10:42:07] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2170.codfw.wmnet [10:43:06] (03PS1) 10Muehlenhoff: Switch db2170 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035354 (https://phabricator.wikimedia.org/T349619) [10:44:33] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1053.eqiad.wmnet with OS bookworm [10:44:44] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2053.codfw.wmnet with OS bookworm [10:45:17] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host stat1008.eqiad.wmnet with OS bullseye [10:45:40] (03CR) 10Muehlenhoff: [C:03+2] Switch db2170 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035354 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:47:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:47:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2114.codfw.wmnet with reason: Maintenance [10:48:52] (03PS1) 10Hnowlan: sessionstore: update certs in advance of expiry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035357 (https://phabricator.wikimedia.org/T363996) [10:48:52] (03CR) 10EoghanGaffney: [C:03+1] phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1035050 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [10:50:32] (03PS3) 10Effie Mouzeli: x-wikimedia-debug: add datacenter options for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034514 (https://phabricator.wikimedia.org/T365478) [10:52:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2130.codfw.wmnet with OS bookworm [10:56:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62974 and previous config saved to /var/cache/conftool/dbconfig/20240523-105651-root.json [10:57:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2170.codfw.wmnet [10:57:57] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1053.eqiad.wmnet with reason: host reimage [11:01:12] (03PS2) 10Muehlenhoff: maps: Add option to use PKI [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) [11:02:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62975 and previous config saved to /var/cache/conftool/dbconfig/20240523-110249-arnaudb.json [11:02:54] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2053.codfw.wmnet with reason: host reimage [11:02:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1053.eqiad.wmnet with reason: host reimage [11:05:33] (03PS1) 10Effie Mouzeli: x-wikimedia-debug: add datacenter options for k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) [11:05:44] (03CR) 10EoghanGaffney: [C:03+1] wikitech: Add credentials for GitLab account blocking [puppet] - 10https://gerrit.wikimedia.org/r/1034532 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [11:06:16] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2053.codfw.wmnet with reason: host reimage [11:08:28] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on stat1008.eqiad.wmnet with reason: host reimage [11:11:34] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat1008.eqiad.wmnet with reason: host reimage [11:11:36] (03PS1) 10Muehlenhoff: tlsproxy::localssl: Remove support for OCSP handling [puppet] - 10https://gerrit.wikimedia.org/r/1035362 [11:11:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62976 and previous config saved to /var/cache/conftool/dbconfig/20240523-111157-root.json [11:14:17] (03CR) 10Fabfur: [C:03+1] x-wikimedia-debug: add datacenter options for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034514 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [11:14:23] (03CR) 10Fabfur: [C:03+1] x-wikimedia-debug: Drop old 'k8s-experimental' alias label [puppet] - 10https://gerrit.wikimedia.org/r/1034108 (https://phabricator.wikimedia.org/T362662) (owner: 10Jforrester) [11:17:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62977 and previous config saved to /var/cache/conftool/dbconfig/20240523-111755-arnaudb.json [11:18:54] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1053.eqiad.wmnet with OS bookworm [11:22:39] (03PS2) 10Hnowlan: sessionstore: update certs in advance of expiry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035357 (https://phabricator.wikimedia.org/T363996) [11:23:23] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#9825128 (10BTullis) [11:23:24] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9825127 (10BTullis) [11:23:32] (03CR) 10Effie Mouzeli: [C:03+1] sessionstore: update certs in advance of expiry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035357 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:24:07] (03CR) 10Effie Mouzeli: [C:03+2] x-wikimedia-debug: Drop old 'k8s-experimental' alias label [puppet] - 10https://gerrit.wikimedia.org/r/1034108 (https://phabricator.wikimedia.org/T362662) (owner: 10Jforrester) [11:24:28] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2053.codfw.wmnet with OS bookworm [11:25:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035362 (owner: 10Muehlenhoff) [11:26:10] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#9825133 (10BTullis) [11:27:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1226 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62978 and previous config saved to /var/cache/conftool/dbconfig/20240523-112704-root.json [11:32:01] (03PS1) 10Muehlenhoff: Switch Typha firewall config to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1035365 (https://phabricator.wikimedia.org/T365687) [11:33:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62979 and previous config saved to /var/cache/conftool/dbconfig/20240523-113301-arnaudb.json [11:34:17] (03CR) 10JMeybohm: cache: fix and improve the code in the s3 module that allows a proxy (031 comment) [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1035006 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [11:35:13] 06SRE, 10Icinga, 10observability, 10Observability-Alerting, 10Scap: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777#9825144 (10hashar) 05Openβ†’03Declined I filed this subtask to express an idea filed in the parent T78319. Th... [11:36:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035365 (https://phabricator.wikimedia.org/T365687) (owner: 10Muehlenhoff) [11:37:15] 06SRE, 10Release-Engineering-Team (Radar): scap train failure due to earlier host rename - https://phabricator.wikimedia.org/T365683#9825148 (10hashar) I think that is essentially the same as T78319 which asks for a way to filter out some hosts from the dsh groups. [11:40:18] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stat1008.eqiad.wmnet with OS bullseye [11:42:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [11:42:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [11:43:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T364299)', diff saved to https://phabricator.wikimedia.org/P62980 and previous config saved to /var/cache/conftool/dbconfig/20240523-114259-marostegui.json [11:43:06] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:43:13] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1052.eqiad.wmnet with OS bookworm [11:43:25] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2052.codfw.wmnet with OS bookworm [11:44:07] (03PS4) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for FPM [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) [11:44:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:44:35] (03PS4) 10Effie Mouzeli: x-wikimedia-debug: add datacenter options for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034514 (https://phabricator.wikimedia.org/T365478) [11:47:22] (03PS1) 10Jelto: sre: add alert for trusted gitlab-runner config [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) [11:47:44] (03PS1) 10Ayounsi: Enable BFD on Telxius transit [homer/public] - 10https://gerrit.wikimedia.org/r/1035371 (https://phabricator.wikimedia.org/T362421) [11:48:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62981 and previous config saved to /var/cache/conftool/dbconfig/20240523-114807-arnaudb.json [11:48:58] (03CR) 10CI reject: [V:04-1] sre: add alert for trusted gitlab-runner config [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:49:05] (03CR) 10JMeybohm: [C:03+1] Switch Typha firewall config to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1035365 (https://phabricator.wikimedia.org/T365687) (owner: 10Muehlenhoff) [11:49:12] (03PS1) 10Vgutierrez: depool upload@esams before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1035373 (https://phabricator.wikimedia.org/T357257) [11:51:47] (03PS2) 10Jelto: sre: add alert for trusted gitlab-runner config [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) [11:52:25] (03CR) 10Vgutierrez: [C:03+2] depool upload@esams before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1035373 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [11:52:32] !log depool upload@esams before enabling IPIP encapsulation - T357257 [11:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:38] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [11:53:01] (03CR) 10CI reject: [V:04-1] sre: add alert for trusted gitlab-runner config [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:55:24] (03PS3) 10Jelto: sre: add alert for trusted gitlab-runner config [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) [11:56:31] 10SRE-tools, 06Infrastructure-Foundations, 10netbox, 10Spicerack: Cookbooks: move Netbox IP allocation to spicerack module - https://phabricator.wikimedia.org/T365694 (10ayounsi) 03NEW p:05Triageβ†’03Low [11:56:32] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1052.eqiad.wmnet with reason: host reimage [11:57:16] (03CR) 10Ayounsi: sre.hosts.move-vlan: add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [11:57:59] (03CR) 10Jelto: "I tested this with the gitlab replica and changed the config for the Trusted runners:" [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:59:08] (03CR) 10Majavah: [C:03+1] cloudweb: Enable profile::auto_restarts::service for FPM [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:59:59] (03PS1) 10Marostegui: db1238: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035374 [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T1200) [12:00:37] (03CR) 10Marostegui: [C:03+2] db1238: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035374 (owner: 10Marostegui) [12:01:29] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2052.codfw.wmnet with reason: host reimage [12:01:32] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1052.eqiad.wmnet with reason: host reimage [12:02:45] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic2@esams [puppet] - 10https://gerrit.wikimedia.org/r/1034968 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:03:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62982 and previous config saved to /var/cache/conftool/dbconfig/20240523-120313-arnaudb.json [12:04:29] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2052.codfw.wmnet with reason: host reimage [12:06:18] (03PS26) 10Ayounsi: sre.hosts.move-vlan: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [12:10:29] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on esams@upload [puppet] - 10https://gerrit.wikimedia.org/r/1034969 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:11:29] (03PS6) 10Klausman: charts/kserve: Switch to using k8s service targets for NP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034492 (https://phabricator.wikimedia.org/T365479) [12:12:05] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [12:15:32] (03PS1) 10Ayounsi: Arelion IPv6 renumbering [homer/public] - 10https://gerrit.wikimedia.org/r/1035376 (https://phabricator.wikimedia.org/T365697) [12:17:34] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1052.eqiad.wmnet with OS bookworm [12:18:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62983 and previous config saved to /var/cache/conftool/dbconfig/20240523-121819-arnaudb.json [12:21:41] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2052.codfw.wmnet with OS bookworm [12:22:38] (03PS2) 10TChin: datasets-config: Add volume for configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034581 (https://phabricator.wikimedia.org/T357434) [12:24:46] (03CR) 10TChin: datasets-config: Add volume for configmap (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034581 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:28:53] (03CR) 10Volans: "From the cookbook structure PoV LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [12:29:35] (03CR) 10Volans: wdqs.data-reload: support HDFS as a source (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [12:31:00] (03CR) 10Volans: "forgot to mention teh runtime_description" [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [12:31:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T364299)', diff saved to https://phabricator.wikimedia.org/P62984 and previous config saved to /var/cache/conftool/dbconfig/20240523-123145-marostegui.json [12:33:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62985 and previous config saved to /var/cache/conftool/dbconfig/20240523-123325-arnaudb.json [12:34:07] (03PS1) 10Muehlenhoff: Remove access for kormat [puppet] - 10https://gerrit.wikimedia.org/r/1035409 [12:35:19] (03CR) 10Muehlenhoff: [C:03+2] Remove access for kormat [puppet] - 10https://gerrit.wikimedia.org/r/1035409 (owner: 10Muehlenhoff) [12:36:38] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Kormat out of all services on: 2199 hosts [12:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Kormat out of all services on: 2199 hosts [12:37:36] (03PS1) 10Stevemunene: Add datahub-next missing values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035411 (https://phabricator.wikimedia.org/T365674) [12:40:59] (03PS1) 10David Caro: horizon: remove openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1035412 [12:41:56] (03CR) 10Hashar: [C:03+1] "I don't see the point in keeping it given it was a transient role. It is easy to recreate if we need again one day." [puppet] - 10https://gerrit.wikimedia.org/r/1034955 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [12:45:26] (03CR) 10Btullis: "We do need the service provided by datahub-gms to be accessible from outside the cluster, but unlike the datahub-frontend server, it is no" [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene) [12:45:34] (03CR) 10Hashar: "acknowledging a note I have made to myself" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [12:46:08] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2608/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035412 (owner: 10David Caro) [12:46:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P62986 and previous config saved to /var/cache/conftool/dbconfig/20240523-124654-marostegui.json [12:48:30] (03CR) 10Majavah: [C:03+1] horizon: remove openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1035412 (owner: 10David Caro) [12:48:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62987 and previous config saved to /var/cache/conftool/dbconfig/20240523-124832-arnaudb.json [12:49:07] (03CR) 10Majavah: [C:03+1] "We should ideally re-image the cloudweb servers after applying this one." [puppet] - 10https://gerrit.wikimedia.org/r/1035412 (owner: 10David Caro) [12:50:04] !log rolling restart of pybal on lvs3010 and lvs3009 - T357257 [12:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:08] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [12:51:14] (03PS5) 10Reedy: interwiki.php: Remove duplicates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035389 (https://phabricator.wikimedia.org/T365679) [12:53:33] (03PS2) 10Elukey: cache: fix and improve the code in the s3 module that allows a proxy [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1035006 (https://phabricator.wikimedia.org/T344324) [12:53:48] (03CR) 10Elukey: cache: fix and improve the code in the s3 module that allows a proxy (031 comment) [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1035006 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [12:56:13] (03PS1) 10Vgutierrez: Revert "depool upload@esams before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1035390 (https://phabricator.wikimedia.org/T357257) [12:56:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2116 T364290', diff saved to https://phabricator.wikimedia.org/P62988 and previous config saved to /var/cache/conftool/dbconfig/20240523-125641-arnaudb.json [12:56:47] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [12:56:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2116.codfw.wmnet with reason: reimage [12:57:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2116.codfw.wmnet with reason: reimage [12:57:19] (03CR) 10Vgutierrez: [C:03+2] Revert "depool upload@esams before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1035390 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:57:28] !log repool upload@esams with IPIP encapsulation enabled - T357257 [12:57:30] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1051.eqiad.wmnet with OS bookworm [12:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:32] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [12:57:44] eoghan, XioNoX ^^ [12:57:48] (03CR) 10Btullis: "Quick answers to suggestions 1 & 2." [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [12:58:20] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2051.codfw.wmnet with OS bookworm [12:58:40] (03PS1) 10Stevemunene: idp-test: Change datahub staging url [puppet] - 10https://gerrit.wikimedia.org/r/1035414 (https://phabricator.wikimedia.org/T365674) [12:59:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2116.codfw.wmnet with OS bookworm [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T1300). [13:00:04] No Gerrit patches in the queue for this window AFAICS. [13:00:34] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Rename X-Wikimedia-Debug k8s-experimental option - https://phabricator.wikimedia.org/T362662#9825484 (10Jdforrester-WMF) 05In progressβ†’03Resolved [13:00:56] (03CR) 10Muehlenhoff: [C:03+2] cloudweb: Enable profile::auto_restarts::service for FPM [puppet] - 10https://gerrit.wikimedia.org/r/1026453 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:01:22] vgutierrez: good job! [13:02:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P62989 and previous config saved to /var/cache/conftool/dbconfig/20240523-130202-marostegui.json [13:05:16] (03PS4) 10Jelto: sre: add alert for trusted gitlab-runner config [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) [13:07:32] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [13:08:22] (03PS1) 10Reedy: interwiki.php: Alphasort keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035417 [13:09:02] (03PS1) 10Jelto: gitlab: bump exporter version to v1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/1035418 (https://phabricator.wikimedia.org/T354656) [13:09:17] (03PS1) 10Reedy: interwiki-labs.php: Rebuild and alphasort as per production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035419 [13:10:05] (03CR) 10Jelto: [C:03+2] gitlab: bump exporter version to v1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/1035418 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [13:10:45] (03CR) 10CI reject: [V:04-1] interwiki-labs.php: Rebuild and alphasort as per production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035419 (owner: 10Reedy) [13:10:45] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1051.eqiad.wmnet with reason: host reimage [13:11:22] (03PS2) 10Reedy: interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035419 [13:12:56] (03CR) 10Btullis: "> we redirect https://datahub-next.wikimedia.org to https://datahub-next.svc.eqiad.wmnet:30443 at the ATS level" [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [13:13:00] (03CR) 10CI reject: [V:04-1] interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035419 (owner: 10Reedy) [13:13:22] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1051.eqiad.wmnet with reason: host reimage [13:13:49] (03PS3) 10Reedy: interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035419 [13:13:49] (03PS1) 10Reedy: phpcs.xml: Add interwiki-labs.php to exclude-pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035420 [13:13:49] (03PS1) 10Reedy: interwiki-labs.php: De-duplicate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035421 [13:13:50] (03PS1) 10Reedy: interwiki-labs.php: Alphasort keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035422 [13:13:53] jouncebot: nowandnext [13:13:53] For the next 0 hour(s) and 46 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T1300) [13:13:54] In 2 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T1600) [13:14:44] (03PS2) 10Reedy: phpcs.xml: Add interwiki-labs.php to exclude-pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035420 [13:14:44] (03PS4) 10Reedy: interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035419 [13:14:44] (03PS2) 10Reedy: interwiki-labs.php: De-duplicate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035421 [13:14:45] (03PS2) 10Reedy: interwiki-labs.php: Alphasort keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035422 [13:15:09] (03CR) 10CI reject: [V:04-1] interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035419 (owner: 10Reedy) [13:15:14] (03PS3) 10Reedy: phpcs.xml: Add interwiki-labs.php to exclude-pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035420 [13:15:26] (03CR) 10Reedy: [C:03+2] phpcs.xml: Add interwiki-labs.php to exclude-pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035420 (owner: 10Reedy) [13:16:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2116.codfw.wmnet with reason: host reimage [13:16:23] (03Merged) 10jenkins-bot: phpcs.xml: Add interwiki-labs.php to exclude-pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035420 (owner: 10Reedy) [13:16:38] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2051.codfw.wmnet with reason: host reimage [13:17:08] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9825583 (10dcaro) ` root@cloudcephosd1031:~# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : act... [13:17:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T364299)', diff saved to https://phabricator.wikimedia.org/P62990 and previous config saved to /var/cache/conftool/dbconfig/20240523-131710-marostegui.json [13:17:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance [13:17:16] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:17:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance [13:17:27] (03CR) 10Reedy: [C:04-1] "Well, that isn't working right..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035419 (owner: 10Reedy) [13:17:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T364299)', diff saved to https://phabricator.wikimedia.org/P62991 and previous config saved to /var/cache/conftool/dbconfig/20240523-131734-marostegui.json [13:18:55] (03Abandoned) 10Reedy: interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035419 (owner: 10Reedy) [13:18:59] (03Abandoned) 10Reedy: interwiki-labs.php: De-duplicate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035421 (owner: 10Reedy) [13:19:03] (03Abandoned) 10Reedy: interwiki-labs.php: Alphasort keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035422 (owner: 10Reedy) [13:19:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2116.codfw.wmnet with reason: host reimage [13:20:21] (03CR) 10Brouberol: [C:03+1] idp-test: Change datahub staging url [puppet] - 10https://gerrit.wikimedia.org/r/1035414 (https://phabricator.wikimedia.org/T365674) (owner: 10Stevemunene) [13:22:45] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2051.codfw.wmnet with reason: host reimage [13:22:52] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9825627 (10dcaro) The support assist logs are on google drive https://drive.google.com/file/d/1tS2cy8EF5AgsTLpdK2r8dTR0YQ06ntIZ/view?usp=drive_link (phabricat... [13:25:17] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice! LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034978 (https://phabricator.wikimedia.org/T365626) (owner: 10CDanis) [13:25:50] (03CR) 10Stevemunene: "Yes it does, so for this change we can just introduce `datahub-frontend/gms-next.svc.eqiad.wmnet` then do the rest as we remove them from " [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [13:25:57] (03PS1) 10Ssingh: P:lvs::configuration: remove obsolete $lvs_classes [puppet] - 10https://gerrit.wikimedia.org/r/1035424 [13:26:14] (03CR) 10Btullis: dse-k8s: add airflow namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [13:26:53] 10ops-eqiad, 06DC-Ops, 06serviceops: Relabel eqiad Kubernetes hosts - https://phabricator.wikimedia.org/T365711 (10hnowlan) 03NEW [13:28:53] (03CR) 10CI reject: [V:04-1] P:lvs::configuration: remove obsolete $lvs_classes [puppet] - 10https://gerrit.wikimedia.org/r/1035424 (owner: 10Ssingh) [13:29:12] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1051.eqiad.wmnet with OS bookworm [13:29:38] (03CR) 10Btullis: dse-k8s: add new airflow service to k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [13:30:55] 10ops-codfw, 06DC-Ops, 06serviceops: Relabel codfw Kubernetes hosts - https://phabricator.wikimedia.org/T365712 (10hnowlan) 03NEW [13:32:05] (03PS1) 10Reedy: interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035426 [13:32:32] (03CR) 10Reedy: [C:03+2] interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035426 (owner: 10Reedy) [13:33:14] (03Merged) 10jenkins-bot: interwiki-labs.php: Update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035426 (owner: 10Reedy) [13:36:59] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2116.codfw.wmnet with OS bookworm [13:37:56] (03CR) 10Muehlenhoff: [C:03+2] Switch Typha firewall config to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1035365 (https://phabricator.wikimedia.org/T365687) (owner: 10Muehlenhoff) [13:38:07] (03Abandoned) 10Ssingh: P:lvs::configuration: remove obsolete $lvs_classes [puppet] - 10https://gerrit.wikimedia.org/r/1035424 (owner: 10Ssingh) [13:41:10] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2051.codfw.wmnet with OS bookworm [13:44:25] FIRING: SystemdUnitFailed: ferm.service on kubernetes1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:13] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:46:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2116.codfw.wmnet with OS bookworm [13:47:15] (03PS3) 10Elukey: [WIP] Initial import of ceph-csi-rbd chart for inspection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [13:47:15] (03CR) 10Elukey: "Left some comments! Overall the chart seems having a good quality, I am concerned about the usage of priviledged pods in two use cases, ma" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [13:49:25] FIRING: [10x] SystemdUnitFailed: ferm.service on kubernetes1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:58] !log reedy@deploy1002 Synchronized wmf-config/interwiki-labs.php: (no justification provided) (duration: 16m 30s) [13:53:26] (03Restored) 10Reedy: interwiki-labs.php: De-duplicate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035421 (owner: 10Reedy) [13:53:33] (03Restored) 10Reedy: interwiki-labs.php: Alphasort keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035422 (owner: 10Reedy) [13:53:49] (03PS3) 10Reedy: interwiki-labs.php: Remove duplicates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035421 [13:54:00] (03CR) 10CI reject: [V:04-1] interwiki-labs.php: Remove duplicates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035421 (owner: 10Reedy) [13:55:13] (03PS6) 10Reedy: interwiki.php: Remove duplicates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035389 (https://phabricator.wikimedia.org/T365679) [13:55:13] (03PS2) 10Reedy: interwiki.php: Alphasort keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035417 [13:55:13] (03PS4) 10Reedy: interwiki-labs.php: Remove duplicates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035421 [13:55:14] (03PS3) 10Reedy: interwiki-labs.php: Alphasort keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035422 [13:56:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-main1009.mgmt.eqiad.wmnet with reboot policy FORCED [14:03:18] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2116.codfw.wmnet with reason: host reimage [14:04:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:05:35] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346#9825785 (10Jclark-ctr) @marostegui I do have a spare disk can I swap it at anytime [14:05:37] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346#9825787 (10Jclark-ctr) a:03Jclark-ctr [14:06:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346#9825788 (10Marostegui) @Jclark-ctr thanks - you can proceed whenever you like [14:06:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2116.codfw.wmnet with reason: host reimage [14:11:52] (03PS2) 10Jsn.sherman: CommonSettings: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034131 (https://phabricator.wikimedia.org/T361643) [14:11:52] (03PS3) 10Jsn.sherman: InitializeSettings: testwiki enable AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) [14:12:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346#9825813 (10Jclark-ctr) Replaced drive [14:13:14] (03PS4) 10Jsn.sherman: InitializeSettings: testwiki enable AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) [14:13:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Relabel eqiad Kubernetes hosts - https://phabricator.wikimedia.org/T365711#9825819 (10Jclark-ctr) a:03Jclark-ctr [14:13:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T364299)', diff saved to https://phabricator.wikimedia.org/P62992 and previous config saved to /var/cache/conftool/dbconfig/20240523-141334-marostegui.json [14:13:44] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [14:14:06] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346#9825828 (10Marostegui) Thanks - I can see it rebuilding [14:14:25] FIRING: [11x] SystemdUnitFailed: ferm.service on kubernetes1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:13] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9825838 (10Jclark-ctr) The replacement cable did just arrive yesterday. After multiple back and forth with dell Can we leave this open for 1 more week make sure error will not return. leave server running and... [14:18:49] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9825840 (10Marostegui) Sounds good thanks [14:19:24] (03CR) 10Ayounsi: [C:03+1] Set DHCP relay for EVPN switches in codfw to 'forward-only' mode [homer/public] - 10https://gerrit.wikimedia.org/r/1035019 (https://phabricator.wikimedia.org/T365204) (owner: 10Cathal Mooney) [14:19:25] FIRING: [12x] SystemdUnitFailed: ferm.service on kubernetes1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9825844 (10Marostegui) @ABran-WMF can you coordinate with @Jclark-ctr to schedule downtime for this host whenever he needs it? [14:20:25] (03PS1) 10Slyngshede: LDAP Authentication: Allow more flexibility in configuration. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035438 [14:21:19] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2042 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:22:57] (03CR) 10Ayounsi: [C:03+1] "nice!" [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [14:24:07] (03CR) 10Slyngshede: [C:03+2] LDAP Authentication: Allow more flexibility in configuration. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035438 (owner: 10Slyngshede) [14:24:25] FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:25:35] (03Merged) 10jenkins-bot: LDAP Authentication: Allow more flexibility in configuration. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035438 (owner: 10Slyngshede) [14:25:43] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 5769 [14:26:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 5769 [14:26:39] (03PS1) 10Aklapper: Add .gitignore [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035439 (https://phabricator.wikimedia.org/T365716) [14:26:58] I stuck a testwiki deployment for the automoderator extension on today's utc late backport. I've asked several folks about how this should work but haven't heard back. I'm willing to do the deploy and I'm open to feedback. Erring on the side of obnoxiousness to the listed deployers RoanKattouw: Urbanecm: cjming: TheresNoTime: kindrobot: does this seem okay? can anybody be around to support in case things go boom? [14:27:01] (03PS1) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) [14:28:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P62993 and previous config saved to /var/cache/conftool/dbconfig/20240523-142843-marostegui.json [14:28:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2116.codfw.wmnet with OS bookworm [14:29:01] JSherman: It looks like it'll be fine as the prep has been done [14:30:01] JSherman: looking at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1034131 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1034132 it looks reasonable β€” is there anything specific you're concerned about? I'll likely be around for that window anyway [14:30:25] FIRING: SystemdUnitFailed: ferm.service on ml-serve1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:55] (03CR) 10Elukey: [C:03+1] kafka::mirror: Remove obsolete class parameter [puppet] - 10https://gerrit.wikimedia.org/r/1035329 (owner: 10Muehlenhoff) [14:32:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62994 and previous config saved to /var/cache/conftool/dbconfig/20240523-143213-arnaudb.json [14:34:20] (03PS26) 10Hashar: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:34:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [14:35:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host stat1008.eqiad.wmnet [14:35:17] (03PS1) 10Elukey: amd-pytorch: refactor the common bits to DRY the Dockerfiles [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 [14:35:39] (03PS1) 10Aklapper: Remove src/.phutil_module_cache from repository [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035442 (https://phabricator.wikimedia.org/T365716) [14:35:59] (03PS2) 10Elukey: amd-pytorch: refactor the common bits to DRY the Dockerfiles [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 [14:36:10] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9825953 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm [14:36:13] (03PS1) 10Muehlenhoff: Switch stat1008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035443 (https://phabricator.wikimedia.org/T349619) [14:36:20] TheresNoTime: and Reedy: We don't have db tables or anything, so I'm hoping this will be low impact. I think we're in pretty good shape. We don't have translations from translatewiki yet, and we haven't completed the performance review yet, but otherwise I think we're exactly where we are expected to be. We found that we could only do limited testing on beta due to the fact that the api we call can only look at [14:36:21] revisions in production tables. [14:37:12] (03PS3) 10Elukey: amd-pytorch: refactor the common bits to DRY the Dockerfiles [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 [14:37:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1235 T364290', diff saved to https://phabricator.wikimedia.org/P62995 and previous config saved to /var/cache/conftool/dbconfig/20240523-143742-arnaudb.json [14:37:48] (03CR) 10David Caro: [V:03+1] "There's some cleanup needed here before applying, there's some hosts that need to get the python3-openstack* packages upgraded, and a few " [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (owner: 10David Caro) [14:37:48] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [14:38:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1235.eqiad.wmnet with reason: reimage [14:38:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1235.eqiad.wmnet with reason: reimage [14:38:28] (03CR) 10Dzahn: [C:03+2] role: delete ci_test role, not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/1034955 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [14:38:55] PROBLEM - Check whether ferm is active by checking the default input chain on mw1451 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:39:01] (03CR) 10Jforrester: "Let's put these first in the list, so people pick them ahead of the legacy bare-metal servers? Also, are we going to retain the general k8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035361 (https://phabricator.wikimedia.org/T365478) (owner: 10Effie Mouzeli) [14:39:07] (03CR) 10Elukey: [C:03+2] "Going forward with the build+deploy to test if it works fine on k8s staging." [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1035006 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [14:39:16] (03CR) 10Muehlenhoff: [C:03+2] Switch stat1008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1035443 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:39:20] (03Abandoned) 10Dzahn: base: add a firewall alias for the default docker network [puppet] - 10https://gerrit.wikimedia.org/r/1017367 (owner: 10Dzahn) [14:39:25] FIRING: [10x] SystemdUnitFailed: ferm.service on kubernetes1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:47] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1235.eqiad.wmnet with OS bookworm [14:39:52] (03Merged) 10jenkins-bot: cache: fix and improve the code in the s3 module that allows a proxy [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1035006 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [14:41:11] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:41:55] (03CR) 10Dzahn: [C:03+2] logspam-watch.sh: fix or suppress various shellcheck warnings [puppet] - 10https://gerrit.wikimedia.org/r/1035018 (https://phabricator.wikimedia.org/T364083) (owner: 10Brennen Bearnes) [14:42:22] (03CR) 10Pppery: [C:03+1] Add .gitignore [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035439 (https://phabricator.wikimedia.org/T365716) (owner: 10Aklapper) [14:42:39] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:43:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P62996 and previous config saved to /var/cache/conftool/dbconfig/20240523-144351-marostegui.json [14:44:00] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Relabel codfw Kubernetes hosts - https://phabricator.wikimedia.org/T365712#9826025 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm completed [14:44:21] (03PS1) 10Elukey: services: update Tegola's Docker settings in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035445 (https://phabricator.wikimedia.org/T344324) [14:44:25] FIRING: [12x] SystemdUnitFailed: ferm.service on kubernetes1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:45:33] (03CR) 10Elukey: [C:03+2] services: update Tegola's Docker settings in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035445 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [14:46:05] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 29 probes of 799 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:46:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host stat1008.eqiad.wmnet [14:47:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62997 and previous config saved to /var/cache/conftool/dbconfig/20240523-144719-arnaudb.json [14:47:21] (03CR) 10Ilias Sarantopoulos: "I like the idea, makes it simpler!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 (owner: 10Elukey) [14:47:23] (03CR) 10David Caro: [V:03+1] "cloudbackup* have only the clientpackages (does not include repos)" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (owner: 10David Caro) [14:47:39] (03PS3) 10David Caro: Reapply "openstack::bobcat: apply cloud yaml patch"" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (https://phabricator.wikimedia.org/T365640) [14:48:07] (03PS1) 10Scott French: aqs-http-gateway: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035466 (https://phabricator.wikimedia.org/T362978) [14:49:25] FIRING: [12x] SystemdUnitFailed: ferm.service on kubernetes1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:26] (03CR) 10Samtar: [C:03+1] CommonSettings: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034131 (https://phabricator.wikimedia.org/T361643) (owner: 10Jsn.sherman) [14:49:34] (03CR) 10Samtar: [C:03+1] InitializeSettings: testwiki enable AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) (owner: 10Jsn.sherman) [14:50:12] (03CR) 10DCausse: [C:03+2] cirrus: add alerts on fetch error rates [alerts] - 10https://gerrit.wikimedia.org/r/1031522 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse) [14:51:04] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [14:51:19] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2042 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:51:21] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [14:51:23] (03Merged) 10jenkins-bot: cirrus: add alerts on fetch error rates [alerts] - 10https://gerrit.wikimedia.org/r/1031522 (https://phabricator.wikimedia.org/T364837) (owner: 10DCausse) [14:51:24] (03CR) 10Reedy: InitializeSettings: testwiki enable AutoModerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) (owner: 10Jsn.sherman) [14:51:29] (03CR) 10Zabe: [C:03+2] beta: Remove password config override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034922 (owner: 10Zabe) [14:52:15] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Idle - Telia, AS1299/IPv6: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:52:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1235.eqiad.wmnet with reason: host reimage [14:53:09] (03Merged) 10jenkins-bot: beta: Remove password config override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034922 (owner: 10Zabe) [14:54:25] FIRING: [13x] SystemdUnitFailed: ferm.service on kubernetes1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1235.eqiad.wmnet with reason: host reimage [14:57:20] (03PS4) 10Elukey: amd-pytorch: refactor the common bits to DRY the Dockerfiles [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 [14:57:24] (03CR) 10Elukey: amd-pytorch: refactor the common bits to DRY the Dockerfiles (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 (owner: 10Elukey) [14:58:24] (03PS4) 10David Caro: Reapply "openstack::bobcat: apply cloud yaml patch"" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (https://phabricator.wikimedia.org/T365640) [14:58:24] (03PS2) 10David Caro: horizon: remove openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1035412 [14:58:35] (03CR) 10David Caro: "I suspect that there's some more to cleanup, will recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1035412 (owner: 10David Caro) [14:58:42] (03PS5) 10Jsn.sherman: InitialiseSettings: testwiki enable AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) [14:58:50] (03CR) 10BryanDavis: [C:04-2] "I have been thinking about my reaction here and have decided that waiting for a clear future Redis replacement is being too conservative." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012797 (https://phabricator.wikimedia.org/T360378) (owner: 10BryanDavis) [14:58:57] (03CR) 10BryanDavis: Add redis image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012797 (https://phabricator.wikimedia.org/T360378) (owner: 10BryanDavis) [14:58:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T364299)', diff saved to https://phabricator.wikimedia.org/P62998 and previous config saved to /var/cache/conftool/dbconfig/20240523-145858-marostegui.json [14:59:01] (03CR) 10Jsn.sherman: InitialiseSettings: testwiki enable AutoModerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) (owner: 10Jsn.sherman) [14:59:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:59:03] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [14:59:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [14:59:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:59:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [14:59:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T364299)', diff saved to https://phabricator.wikimedia.org/P62999 and previous config saved to /var/cache/conftool/dbconfig/20240523-145938-marostegui.json [15:00:13] (03CR) 10Hnowlan: [C:03+1] aqs-http-gateway: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035466 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:00:25] RESOLVED: SystemdUnitFailed: ferm.service on ml-serve1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:11] (03CR) 10Eevans: [C:03+1] sessionstore: update certs in advance of expiry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035357 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [15:02:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63000 and previous config saved to /var/cache/conftool/dbconfig/20240523-150225-arnaudb.json [15:03:34] (03PS1) 10Lucas Werkmeister (WMDE): Remove statistics::wmde::wdcm [puppet] - 10https://gerrit.wikimedia.org/r/1035468 (https://phabricator.wikimedia.org/T364965) [15:04:24] (03CR) 10Lucas Werkmeister (WMDE): "Note: this doesn’t mark the git clone as `absent`, so IIUC, it will still continue to exist on stat1011 (it’s at `/srv/analytics-wmde/wdcm" [puppet] - 10https://gerrit.wikimedia.org/r/1035468 (https://phabricator.wikimedia.org/T364965) (owner: 10Lucas Werkmeister (WMDE)) [15:06:08] (03CR) 10Pppery: interwiki.php: Remove duplicates (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035389 (https://phabricator.wikimedia.org/T365679) (owner: 10Reedy) [15:06:28] (03CR) 10Hnowlan: [C:03+2] sessionstore: update certs in advance of expiry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035357 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [15:07:19] (03Merged) 10jenkins-bot: sessionstore: update certs in advance of expiry [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035357 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [15:07:38] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9826085 (10Lucas_Werkmeister_WMDE) >>! In T351072#9817102, @AndrewTavis_WMDE wrote: > So basically removing the wdcm.pp related file on GitHub and its... [15:08:55] RECOVERY - Check whether ferm is active by checking the default input chain on mw1451 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:09:06] (03CR) 10Cathal Mooney: [C:03+1] Enable BFD on Telxius transit [homer/public] - 10https://gerrit.wikimedia.org/r/1035371 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [15:09:25] FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes1033:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:39] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:16:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1235.eqiad.wmnet with OS bookworm [15:17:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63001 and previous config saved to /var/cache/conftool/dbconfig/20240523-151731-arnaudb.json [15:18:12] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [15:18:20] (03PS1) 10Aklapper: Remove src/.phutil_module_cache from repository [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035470 (https://phabricator.wikimedia.org/T365716) [15:18:32] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [15:18:40] (03CR) 10Scott French: [C:03+2] aqs-http-gateway: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035466 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:18:53] (03Abandoned) 10Aklapper: Remove src/.phutil_module_cache from repository [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035442 (https://phabricator.wikimedia.org/T365716) (owner: 10Aklapper) [15:19:25] RESOLVED: [3x] SystemdUnitFailed: ferm.service on mw1455:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:58] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:20:04] (03Merged) 10jenkins-bot: aqs-http-gateway: add securityContext to all containers (attempt 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035466 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:20:23] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9826145 (10cmooney) [15:21:33] (03PS1) 10JHathaway: jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1035471 [15:22:37] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [15:22:39] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [15:24:10] (03CR) 10JHathaway: [C:03+2] jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1035471 (owner: 10JHathaway) [15:24:30] (03CR) 10Aklapper: [C:03+2] Re-extract i18n to pick up latest changes [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032094 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [15:24:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63002 and previous config saved to /var/cache/conftool/dbconfig/20240523-152431-arnaudb.json [15:24:50] (03CR) 10Aklapper: [V:03+2 C:03+2] "Tested locally and applies cleanly. Thanks!" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032094 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [15:25:24] (03PS1) 10Sergio Gimeno: [GrowthExperiments] Disable personalized praise in eswiki labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T359038) [15:25:54] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [15:25:55] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [15:26:10] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [15:26:36] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [15:26:50] (03CR) 10EoghanGaffney: [C:03+2] wikitech: Add credentials for GitLab account blocking [puppet] - 10https://gerrit.wikimedia.org/r/1034532 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:28:07] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [15:28:28] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: increase min replicas for ruwiki-goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034114 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [15:28:39] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [15:29:27] (03Merged) 10jenkins-bot: ml-services: increase min replicas for ruwiki-goodfaith [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034114 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [15:29:43] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [15:29:58] RESOLVED: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:30:09] !log moving phabricator outbound email to postfix based mx-out{1001,2001} [15:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:18] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [15:30:20] (03CR) 10JHathaway: [C:03+2] phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1035050 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [15:30:36] (03CR) 10David Caro: "Did this break puppet for cloudwebs?" [puppet] - 10https://gerrit.wikimedia.org/r/1034532 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:31:25] (03CR) 10David Caro: "(I'm guessing that the secret was added only to the really "secret" repo xd)" [puppet] - 10https://gerrit.wikimedia.org/r/1034532 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:32:11] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [15:32:14] JSherman: I'm unavailable unfortunately [15:32:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63003 and previous config saved to /var/cache/conftool/dbconfig/20240523-153237-arnaudb.json [15:32:44] (03CR) 10Ilias Sarantopoulos: [C:03+1] amd-pytorch: refactor the common bits to DRY the Dockerfiles (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 (owner: 10Elukey) [15:32:55] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [15:33:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [15:33:53] (03CR) 10Pppery: "I personally think the current ordering makes more sense than alphabetical - the current ordering is first the Meta interwiki map, then la" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035417 (owner: 10Reedy) [15:33:57] (03CR) 10BryanDavis: "We ended up reusing an existing secret, but it is very possible nobody ever added a placeholder for `profile::gitlab::ldap_group_sync_bot_" [puppet] - 10https://gerrit.wikimedia.org/r/1034532 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:34:22] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [15:34:36] JSherman: But those patches look straightforward and routine, I don't expect any trouble unless there's some sort of terrible bug in the AutoModerator code. And if that happens you can turn it back off [15:34:54] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [15:35:05] (03CR) 10David Caro: "πŸ‘ I can take a look :)" [puppet] - 10https://gerrit.wikimedia.org/r/1034532 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:36:37] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply [15:37:03] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [15:37:05] (03PS2) 10BryanDavis: wikitech: Add dummy GitLab API token [labs/private] - 10https://gerrit.wikimedia.org/r/1034533 (https://phabricator.wikimedia.org/T316418) [15:37:23] (03PS1) 10Ssingh: P:cumin: add support for aliasing LVS host classes [puppet] - 10https://gerrit.wikimedia.org/r/1035474 [15:37:49] (03CR) 10EoghanGaffney: [C:03+1] wikitech: Add dummy GitLab API token [labs/private] - 10https://gerrit.wikimedia.org/r/1034533 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:38:12] (03CR) 10EoghanGaffney: [C:03+2] wikitech: Add dummy GitLab API token [labs/private] - 10https://gerrit.wikimedia.org/r/1034533 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:38:16] (03PS1) 10Ilias Sarantopoulos: ml-services: update hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035476 (https://phabricator.wikimedia.org/T357986) [15:38:17] (03PS3) 10David Caro: wikitech: Add dummy GitLab API token [labs/private] - 10https://gerrit.wikimedia.org/r/1034533 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:38:23] (03CR) 10RLazarus: [C:03+2] tegola-vector-tiles: Add securityContext and update dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) (owner: 10RLazarus) [15:38:53] (03CR) 10BryanDavis: "https://gerrit.wikimedia.org/r/c/labs/private/+/1034533 has been updated. I think Eoghan is on top of merging it." [puppet] - 10https://gerrit.wikimedia.org/r/1034532 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:39:15] (03Merged) 10jenkins-bot: tegola-vector-tiles: Add securityContext and update dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) (owner: 10RLazarus) [15:39:22] (03CR) 10EoghanGaffney: [C:03+2] wikitech: Add dummy GitLab API token [labs/private] - 10https://gerrit.wikimedia.org/r/1034533 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:39:23] (03CR) 10EoghanGaffney: [V:03+2 C:03+2] wikitech: Add dummy GitLab API token [labs/private] - 10https://gerrit.wikimedia.org/r/1034533 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [15:39:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63004 and previous config saved to /var/cache/conftool/dbconfig/20240523-153937-arnaudb.json [15:40:04] (03PS5) 10Btullis: Migrate AQS2 services and image-suggestions to calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) [15:40:23] !log rzl@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [15:40:54] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [15:41:02] !log rzl@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [15:41:31] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [15:41:58] (03CR) 10Vgutierrez: [C:03+1] "TLS material looks good on both endpoints:" [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [15:42:22] (03CR) 10Brouberol: [C:03+2] trafficserver: Add CDN config for datasets-config.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [15:44:20] (03PS5) 10David Caro: Reapply "openstack::bobcat: apply cloud yaml patch"" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (https://phabricator.wikimedia.org/T365640) [15:44:20] (03PS3) 10David Caro: horizon: remove openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1035412 [15:45:33] (03CR) 10Ssingh: "PCC looks good https://puppet-compiler.wmflabs.org/output/1035474/2615/cumin1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1035474 (owner: 10Ssingh) [15:46:31] RoanKattouw: ack; thanks! [15:47:21] !log rzl@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [15:47:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63006 and previous config saved to /var/cache/conftool/dbconfig/20240523-154743-arnaudb.json [15:47:51] !log rzl@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [15:50:34] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [15:51:21] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [15:53:04] (03PS1) 10BryanDavis: wikitech: Fix missing param for '::openstack::wikitech::web' [puppet] - 10https://gerrit.wikimedia.org/r/1035478 [15:53:36] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [15:53:36] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9826396 (10cscott) So, having written the above two patches to replace byte-size limits with character-size... [15:53:42] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1035478 (owner: 10BryanDavis) [15:54:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T364299)', diff saved to https://phabricator.wikimedia.org/P63008 and previous config saved to /var/cache/conftool/dbconfig/20240523-155413-marostegui.json [15:54:16] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [15:54:18] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:54:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [15:54:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:54:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63009 and previous config saved to /var/cache/conftool/dbconfig/20240523-155444-arnaudb.json [15:55:17] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2617/console" [puppet] - 10https://gerrit.wikimedia.org/r/1035478 (owner: 10BryanDavis) [15:55:37] (03CR) 10Majavah: [V:03+1 C:03+2] wikitech: Fix missing param for '::openstack::wikitech::web' [puppet] - 10https://gerrit.wikimedia.org/r/1035478 (owner: 10BryanDavis) [15:55:41] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [15:55:42] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1041.eqiad.wmnet with OS bookworm [15:55:52] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9826417 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:... [15:56:08] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [15:57:55] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [15:58:20] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2619/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (https://phabricator.wikimedia.org/T365640) (owner: 10David Caro) [15:58:32] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [15:59:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:59:49] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [16:00:03] (03PS1) 10Majavah: openstack: wikitech: Do not log file diff [puppet] - 10https://gerrit.wikimedia.org/r/1035479 [16:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:39] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [16:00:41] (03CR) 10Cathal Mooney: [C:03+2] Enable BFD on Telxius transit [homer/public] - 10https://gerrit.wikimedia.org/r/1035371 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [16:00:48] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1035479 (owner: 10Majavah) [16:00:49] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9826440 (10Jclark-ctr) @aborrero I am stuck right now i did attempt to reimage with no luck. Unsure what version of grub we have installed but looks like the same as thi... [16:00:56] (03CR) 10Majavah: [C:03+2] openstack: wikitech: Do not log file diff [puppet] - 10https://gerrit.wikimedia.org/r/1035479 (owner: 10Majavah) [16:00:57] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2620/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035412 (owner: 10David Caro) [16:01:47] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [16:02:23] !log rzl@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [16:02:34] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [16:02:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63010 and previous config saved to /var/cache/conftool/dbconfig/20240523-160249-arnaudb.json [16:02:57] !log rzl@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [16:03:55] (03Merged) 10jenkins-bot: Enable BFD on Telxius transit [homer/public] - 10https://gerrit.wikimedia.org/r/1035371 (https://phabricator.wikimedia.org/T362421) (owner: 10Ayounsi) [16:03:59] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [16:04:16] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [16:04:19] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [16:04:39] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [16:05:10] (03PS1) 10Hnowlan: api-gateway: add normalise_paths option, enable in api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035481 (https://phabricator.wikimedia.org/T365439) [16:05:12] elukey: are we both deploying tegola-vector-tiles? :) [16:05:23] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2621/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035148 (https://phabricator.wikimedia.org/T365640) (owner: 10David Caro) [16:05:33] !log enabling BFD on transit circuit to telxius in magru [16:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:58] rzl: ahem sorryyyy I was doing a quick hack/test in staging, lemme revert [16:06:02] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Remove src/.phutil_module_cache from repository [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035470 (https://phabricator.wikimedia.org/T365716) (owner: 10Aklapper) [16:06:12] no no you're fine! I just wrapped up, just wanted to make sure we aren't at cross purposes [16:06:26] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Add .gitignore [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035439 (https://phabricator.wikimedia.org/T365716) (owner: 10Aklapper) [16:06:33] I was deploying https://gerrit.wikimedia.org/r/1032524, should be unimpactful [16:06:46] rzl: nono I am testing the usage of the local sidecar for swift [16:08:39] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [16:08:42] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [16:09:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P63011 and previous config saved to /var/cache/conftool/dbconfig/20240523-160921-marostegui.json [16:09:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [16:09:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [16:09:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63012 and previous config saved to /var/cache/conftool/dbconfig/20240523-160951-arnaudb.json [16:10:55] (03CR) 10Ssingh: "This is a comment-only change so feel free to pick up and merge whenever." [homer/public] - 10https://gerrit.wikimedia.org/r/1032522 (owner: 10Ssingh) [16:12:57] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [16:13:00] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [16:13:12] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [16:13:27] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9826511 (10Papaul) [16:13:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9826509 (10Jclark-ctr) a:03Jclark-ctr You have successfully submitted request SR191070960. Ordered replacement drive. will update when arrives [16:14:17] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [16:15:09] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [16:15:42] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [16:16:51] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [16:17:16] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [16:17:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63013 and previous config saved to /var/cache/conftool/dbconfig/20240523-161755-arnaudb.json [16:18:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Relabel eqiad Kubernetes hosts - https://phabricator.wikimedia.org/T365711#9826540 (10Jclark-ctr) 05Openβ†’03Resolved relabled servers [16:18:19] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [16:19:17] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [16:20:51] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [16:21:15] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [16:21:30] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [16:21:39] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [16:23:59] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [16:24:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P63014 and previous config saved to /var/cache/conftool/dbconfig/20240523-162430-marostegui.json [16:24:53] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [16:24:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63015 and previous config saved to /var/cache/conftool/dbconfig/20240523-162457-arnaudb.json [16:26:01] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [16:26:46] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9826583 (10cmooney) [16:27:09] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [16:31:55] (03PS2) 10Ssingh: P:cumin: add support for aliasing LVS host classes [puppet] - 10https://gerrit.wikimedia.org/r/1035474 [16:34:01] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1035474/2623/cumin1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1035474 (owner: 10Ssingh) [16:36:21] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9826632 (10Fuzzy) @cscott, thank you very much for your work on this issue. I completely agree that changing... [16:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:37:33] !log destroying all blubberoid deployments as part of its decommissioning (T318289) [16:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:37] T318289: Deprecate Blubber's CLI and microservice (blubberoid) interfaces - https://phabricator.wikimedia.org/T318289 [16:39:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T364299)', diff saved to https://phabricator.wikimedia.org/P63016 and previous config saved to /var/cache/conftool/dbconfig/20240523-163938-marostegui.json [16:39:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [16:39:43] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:39:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [16:40:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T364299)', diff saved to https://phabricator.wikimedia.org/P63017 and previous config saved to /var/cache/conftool/dbconfig/20240523-164002-marostegui.json [16:40:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63018 and previous config saved to /var/cache/conftool/dbconfig/20240523-164010-arnaudb.json [16:40:23] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_4666: Servers parse1011.eqiad.wmnet, mw1433.eqiad.wmnet, kubernetes1041.eqiad.wmnet, kubernetes1025.eqiad.wmnet, mw1470.eqiad.wmnet, parse1009.eqiad.wmnet, mw1484.eqiad.wmnet, mw1405.eqiad.wmnet, mw1425.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1391.eqiad.wmnet, mw1424.eqiad.wmnet, parse1005.eqiad.wmnet, mw1408.eqiad.wmnet, mw1370.eqiad. [16:40:23] 1389.eqiad.wmnet, kubernetes1050.eqiad.wmnet, kubernetes1014.eqiad.wmnet, mw1483.eqiad.wmnet, kubernetes1048.eqiad.wmnet, mw1469.eqiad.wmnet, kubernetes1058.eqiad.wmnet, kubernetes1038.eqiad.wmnet, mw1356.eqiad.wmnet, kubernetes1018.eqiad.wmnet, mw1369.eqiad.wmnet, mw1371.eqiad.wmnet, mw1468.eqiad.wmnet, kubernetes1028.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1024.eqiad.wm [16:40:23] 39.eqiad.wmnet, mw1464.eqiad.wmnet, mw1381.eqiad.wmnet, parse1021.eqiad.wmnet, kubernetes1042.eqiad.wmnet, parse1022.eqiad.wmnet, kubernetes1035.eqiad.wmnet, mw1379.eqiad.wmnet, kuberne https://wikitech.wikimedia.org/wiki/PyBal [16:40:25] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_4666: Servers kubernetes1010.eqiad.wmnet, parse1013.eqiad.wmnet, mw1380.eqiad.wmnet, parse1014.eqiad.wmnet, parse1007.eqiad.wmnet, mw1457.eqiad.wmnet, mw1419.eqiad.wmnet, mw1476.eqiad.wmnet, mw1458.eqiad.wmnet, mw1432.eqiad.wmnet, kubernetes1022.eqiad.wmnet, mw1478.eqiad.wmnet, mw1349.eqiad.wmnet, mw1384.eqiad.wmnet, mw1479.eqiad.wmnet, k [16:40:25] 1023.eqiad.wmnet, mw1378.eqiad.wmnet, kubernetes1021.eqiad.wmnet, mw1462.eqiad.wmnet, mw1430.eqiad.wmnet, mw1459.eqiad.wmnet, mw1388.eqiad.wmnet, kubernetes1044.eqiad.wmnet, mw1449.eqiad.wmnet, mw1495.eqiad.wmnet, mw1492.eqiad.wmnet, kubernetes1047.eqiad.wmnet, kubernetes1030.eqiad.wmnet, mw1435.eqiad.wmnet, mw1424.eqiad.wmnet, mw1454.eqiad.wmnet, parse1010.eqiad.wmnet, mw1408.eqiad.wmnet, mw1370.eqiad.wmnet, mw1477.eqiad.wmnet, mw1496.e [16:40:25] t, kubernetes1060.eqiad.wmnet, kubernetes1050.eqiad.wmnet, kubernetes1020.eqiad.wmnet, mw1397.eqiad.wmnet, kubernetes1033.eqiad.wmnet, mw1394.eqiad.wmnet, mw1385.eqiad.wmnet, mw1483.eqi https://wikitech.wikimedia.org/wiki/PyBal [16:40:44] erk [16:40:47] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_4666: Servers kubernetes2046.codfw.wmnet, mw2396.codfw.wmnet, mw2334.codfw.wmnet, mw2420.codfw.wmnet, parse2006.codfw.wmnet, mw2312.codfw.wmnet, kubernetes2024.codfw.wmnet, parse2009.codfw.wmnet, mw2421.codfw.wmnet, parse2003.codfw.wmnet, kubernetes2059.codfw.wmnet, mw2435.codfw.wmnet, parse2017.codfw.wmnet, kubernetes2050.codfw.wmnet, mw [16:40:47] w.wmnet, mw2427.codfw.wmnet, mw2384.codfw.wmnet, kubernetes2055.codfw.wmnet, mw2419.codfw.wmnet, kubernetes2006.codfw.wmnet, mw2359.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2030.codfw.wmnet, kubernetes2053.codfw.wmnet, kubernetes2054.codfw.wmnet, mw2434.codfw.wmnet, mw2353.codfw.wmnet, kubernetes2020.codfw.wmnet, mw2449.codfw.wmnet, mw2397.codfw.wmnet, mw2394.codfw.wmnet, mw2316.codfw.wmnet, mw2401.c [16:40:47] t, mw2440.codfw.wmnet, kubernetes2042.codfw.wmnet, mw2387.codfw.wmnet, mw2382.codfw.wmnet, mw2304.codfw.wmnet, kubernetes2036.codfw.wmnet, mw2296.codfw.wmnet, mw2388.codfw.wmnet, kubern https://wikitech.wikimedia.org/wiki/PyBal [16:40:55] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_4666: Servers kubernetes2046.codfw.wmnet, mw2396.codfw.wmnet, mw2334.codfw.wmnet, mw2420.codfw.wmnet, kubernetes2056.codfw.wmnet, parse2006.codfw.wmnet, mw2378.codfw.wmnet, mw2322.codfw.wmnet, mw2321.codfw.wmnet, mw2294.codfw.wmnet, mw2375.codfw.wmnet, mw2447.codfw.wmnet, kubernetes2048.codfw.wmnet, mw2435.codfw.wmnet, mw2315.codfw.wmnet, [16:40:55] 4.codfw.wmnet, kubernetes2050.codfw.wmnet, mw2427.codfw.wmnet, mw2384.codfw.wmnet, kubernetes2055.codfw.wmnet, mw2407.codfw.wmnet, mw2359.codfw.wmnet, kubernetes2007.codfw.wmnet, mw2302.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2030.codfw.wmnet, parse2013.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2054.codfw.wmnet, mw2434.codfw.wmnet, mw2397.codfw.wmnet, mw2316.codfw.wmnet, mw2356.codfw.wmnet, mw2314.codfw.wmnet, kuberne [16:40:55] odfw.wmnet, mw2419.codfw.wmnet, kubernetes2013.codfw.wmnet, mw2399.codfw.wmnet, mw2293.codfw.wmnet, mw2444.codfw.wmnet, mw2267.codfw.wmnet, kubernetes2033.codfw.wmnet, kubernetes2044.co https://wikitech.wikimedia.org/wiki/PyBal [16:41:01] dduvall: ^ [16:41:01] ehh [16:41:03] oh [16:41:17] yikes [16:41:37] ok, redeploying, sorry. can anyone point me at a checklist for this process? [16:42:15] !log dduvall@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply [16:42:16] dduvall: https://wikitech.wikimedia.org/wiki/Kubernetes/Remove_a_service [16:42:27] !log dduvall@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply [16:42:28] ugh. sorry about that. my searches failed me [16:42:31] i should have asked [16:42:38] ah sorry [16:42:42] that's not actually the full story [16:42:50] !log dduvall@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply [16:42:54] https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service [16:43:01] !log dduvall@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply [16:43:07] !log dduvall@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply [16:43:14] !log dduvall@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply [16:43:40] hnowlan: k, thank you. again, sorry about that. i only know how to deploy things :D [16:43:54] no worries, it's not a very intuitive process! [16:44:49] (03CR) 10Michael Große: [C:03+1] [GrowthExperiments] Disable personalized praise in eswiki labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T359038) (owner: 10Sergio Gimeno) [16:45:27] dduvall: is it worth trying to locate someone to help you then? [16:46:23] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:46:24] i'll give the docs a go first and come back with questions if i run into them. thank you! [16:46:25] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:46:39] *as* i run into them [16:46:47] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:46:49] LVS is probably not something you want to go wrong [16:46:55] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:49:02] yeah, and i'm stuck at step one of https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service :D so yeah i'll need some help from someone [16:49:47] topranks: you seem to be around [16:49:59] Cc cwhite / arnoldokoth as on callers [16:50:05] yep [16:50:30] topranks: can you help dduvall remove an lvs service [16:50:56] I can possibly help with some of it, it's not something I've done before (a little more traffic team than netops) [16:51:24] it also doesn't have to be now if you'd rather i schedule something or ping the right person on the task [16:52:21] dduvall: have you redeployed blubberoid so it's no longer in a bad state? [16:52:33] topranks: do you know who from traffic might be around? [16:52:40] yep, it should be back in eqiad/codfw/staging [16:52:49] Good [16:52:53] Just wanted to make it clear [16:53:39] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-05-23-122516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035527 [16:54:03] i'll create a subtask for the service removal and assign it to someone in traffic [16:54:37] (03CR) 10Reedy: "Maybe we to actually just sort the sub-sections then" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035417 (owner: 10Reedy) [16:54:38] kwakuofori: ^^ for your awareness [16:54:43] (03PS1) 10BryanDavis: toolhub: Bump container version to 2024-05-23-122249-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035528 [16:54:59] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1172 - https://phabricator.wikimedia.org/T365346#9826726 (10Marostegui) 05Openβ†’03Resolved RAID back to optimal [16:55:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63019 and previous config saved to /var/cache/conftool/dbconfig/20240523-165516-arnaudb.json [16:56:45] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-05-23-122516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035527 (owner: 10BryanDavis) [16:57:16] (03CR) 10BryanDavis: [C:03+2] toolhub: Bump container version to 2024-05-23-122249-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035528 (owner: 10BryanDavis) [16:57:33] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-05-23-122516-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035527 (owner: 10BryanDavis) [16:58:14] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2024-05-23-122249-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035528 (owner: 10BryanDavis) [17:00:05] bd808: May I have your attention please! Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T1700) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T1700) [17:01:37] dduvall: so in the absence of anyone else I can maybe point you in the right direction [17:01:48] you can silence an alert in alertmanager [17:01:51] process is here: https://wikitech.wikimedia.org/wiki/Alertmanager#Silences_&_acknowledgements [17:03:21] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:03:40] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:03:41] topranks: right on. i'll have a look [17:03:48] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:03:52] you click the 'bell' icon in the rop right corner of alertmanager [17:03:55] to create a new silence [17:04:15] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:04:19] and you select "instance" as label name, then the value is the service value [17:04:30] dduvall: please contact traffic on -traffic for help [17:04:45] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:04:58] cwhite: thanks for the heads up :) [17:05:06] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:06:29] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [17:06:58] kwakuofori: will do [17:07:20] topranks: thanks for your help. i don't see blubberoid in the list of instances, but i will head over to -traffic for more help [17:09:37] It sounds like the original change has been backed out so presumably there's no need to create the silence right now until ready to start ripping out the service (e.g. tmrw after having conferred w/ traffic) [17:09:57] hmmmm... the helm deploy for toolhub in staging seems to be stuck or at least very unexpectedly slow [17:10:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1235 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P63020 and previous config saved to /var/cache/conftool/dbconfig/20240523-171022-arnaudb.json [17:15:26] the toolhub container is broken apparently. `exec: "/usr/local/bin/poetry": stat /usr/local/bin/poetry: no such file or directory: unknown'`. [17:16:38] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [17:17:05] (03CR) 10Nikerabbit: "Phabricator/core/qqq.json is no longer valid json:" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032094 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [17:21:06] ugh. I see what happened. No Toolhub deploy today. I will submit a revert of the deployment-chart patch and file a bug and see where that takes me. [17:22:48] (03PS1) 10BryanDavis: Revert "toolhub: Bump container version to 2024-05-23-122249-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035509 [17:24:38] (03CR) 10BryanDavis: [C:03+2] Revert "toolhub: Bump container version to 2024-05-23-122249-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035509 (owner: 10BryanDavis) [17:25:29] (03Merged) 10jenkins-bot: Revert "toolhub: Bump container version to 2024-05-23-122249-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035509 (owner: 10BryanDavis) [17:31:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T364299)', diff saved to https://phabricator.wikimedia.org/P63021 and previous config saved to /var/cache/conftool/dbconfig/20240523-173106-marostegui.json [17:31:14] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:46:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P63022 and previous config saved to /var/cache/conftool/dbconfig/20240523-174614-marostegui.json [17:53:17] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc1050.eqiad.wmnet with OS bookworm [17:53:20] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc2050.codfw.wmnet with OS bookworm [17:54:57] (03PS1) 10Dduvall: Remove blubberoid wmnet and wikimedia.org records [dns] - 10https://gerrit.wikimedia.org/r/1035533 (https://phabricator.wikimedia.org/T365742) [17:56:21] (03CR) 10Dduvall: "Currently following https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service" [dns] - 10https://gerrit.wikimedia.org/r/1035533 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [17:56:25] (03PS1) 10Bking: elasticsearch: enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1035534 (https://phabricator.wikimedia.org/T362922) [18:00:53] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:01:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P63023 and previous config saved to /var/cache/conftool/dbconfig/20240523-180122-marostegui.json [18:01:53] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:02:35] (03PS1) 10CDobbins: purged: set use_pki to true for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) [18:04:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:06:07] (03PS1) 10Herron: trafficserver: point pyrra to thanos discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1035541 (https://phabricator.wikimedia.org/T356386) [18:06:22] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1050.eqiad.wmnet with reason: host reimage [18:08:37] (03CR) 10Pppery: "Sorry, I have no idea how that happened. Will submit a fix patch soon." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032094 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [18:09:47] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1050.eqiad.wmnet with reason: host reimage [18:09:55] (03CR) 10BCornwall: [C:03+1] Remove blubberoid wmnet and wikimedia.org records [dns] - 10https://gerrit.wikimedia.org/r/1035533 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [18:11:30] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc2050.codfw.wmnet with reason: host reimage [18:13:14] (03CR) 10BCornwall: [C:03+2] Remove blubberoid wmnet and wikimedia.org records [dns] - 10https://gerrit.wikimedia.org/r/1035533 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [18:13:35] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9827097 (10Dzahn) [18:13:36] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9827098 (10Dzahn) [18:13:50] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9827099 (10Dzahn) 05Invalidβ†’03Open [18:14:00] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9827101 (10Dzahn) 05Openβ†’03In progress [18:14:12] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9827108 (10Dzahn) [18:14:19] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9827110 (10Dzahn) [18:14:34] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc2050.codfw.wmnet with reason: host reimage [18:15:43] (03PS1) 10Dduvall: service: Remove probes for blubberoid [puppet] - 10https://gerrit.wikimedia.org/r/1035543 (https://phabricator.wikimedia.org/T365742) [18:16:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T364299)', diff saved to https://phabricator.wikimedia.org/P63024 and previous config saved to /var/cache/conftool/dbconfig/20240523-181630-marostegui.json [18:16:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [18:16:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [18:16:35] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:16:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T364299)', diff saved to https://phabricator.wikimedia.org/P63025 and previous config saved to /var/cache/conftool/dbconfig/20240523-181643-marostegui.json [18:16:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for rickijay - https://phabricator.wikimedia.org/T365574#9827127 (10Dzahn) @darthmon_wmde or @jon_amar-WMDE Assuming you are both managers, we'll need approval for this access requests from one of you. Thanks! [18:17:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for rickijay - https://phabricator.wikimedia.org/T365574#9827130 (10Dzahn) 05Openβ†’03In progress p:05Triageβ†’03High [18:17:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for rickijay - https://phabricator.wikimedia.org/T365574#9827132 (10Dzahn) a:03RickiJay-WMDE [18:18:30] 06SRE, 06serviceops-radar, 10Release-Engineering-Team (Radar): scap train failure due to earlier host rename - https://phabricator.wikimedia.org/T365683#9827137 (10Dzahn) [18:19:24] (03CR) 10BCornwall: [C:03+1] service: Remove probes for blubberoid [puppet] - 10https://gerrit.wikimedia.org/r/1035543 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [18:20:38] (03Abandoned) 10Ssingh: reverse-proxy: use larger subnets for eqiad/codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020202 (owner: 10Ssingh) [18:21:26] (03CR) 10Ssingh: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/1005694 (owner: 10Ssingh) [18:22:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [18:22:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [18:23:20] (03CR) 10Ssingh: purged: set use_pki to true for drmrs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:25:38] (03CR) 10Ssingh: "I didn't do anything to fix the CI failures but I noticed other related changes so tried again. It works now so there's that 😊" [software/conftool] - 10https://gerrit.wikimedia.org/r/1005694 (owner: 10Ssingh) [18:26:21] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1050.eqiad.wmnet with OS bookworm [18:30:22] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9827152 (10BCornwall) [18:32:17] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2050.codfw.wmnet with OS bookworm [18:34:16] (03PS4) 10Jdlrobson: Always use desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034584 (https://phabricator.wikimedia.org/T109277) [18:34:58] (03PS5) 10Jdlrobson: Always use desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034584 (https://phabricator.wikimedia.org/T109277) [18:36:36] (03PS1) 10Dzahn: admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) [18:37:25] (03CR) 10CI reject: [V:04-1] admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn) [18:38:59] (03PS2) 10Dzahn: admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) [18:39:01] (03CR) 10Volans: [C:03+1] "The approach is not bad given the current limitations of the current puppetization of the LVS configuration part. LGTM with one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/1035474 (owner: 10Ssingh) [18:40:00] (03PS1) 10Pppery: Fix mangled JSON, redo export [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035546 [18:40:55] (03PS2) 10Pppery: Fix mangled JSON, redo export [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035546 (https://phabricator.wikimedia.org/T363188) [18:42:39] (03PS3) 10Ssingh: P:cumin: add support for aliasing LVS host classes [puppet] - 10https://gerrit.wikimedia.org/r/1035474 [18:43:05] (03CR) 10Ssingh: P:cumin: add support for aliasing LVS host classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035474 (owner: 10Ssingh) [18:45:32] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1035474/2627/cumin1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1035474 (owner: 10Ssingh) [18:46:23] (03PS3) 10Pppery: Fix mangled JSON, redo export [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035546 (https://phabricator.wikimedia.org/T363188) [18:47:02] (03PS4) 10Pppery: Fix mangled JSON, redo export [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035546 (https://phabricator.wikimedia.org/T363188) [18:47:30] (03PS1) 10JHathaway: Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1035510 [18:48:19] !log T365626 helmfile destroy'd all opentelemetry-collector releases [18:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:23] T365626: move k8s opentelemetry-collector from services to admin_ng - https://phabricator.wikimedia.org/T365626 [18:48:23] (03CR) 10JHathaway: [C:03+2] Revert "phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1035510 (owner: 10JHathaway) [18:48:28] (03CR) 10CDanis: [C:03+2] Move opentelemetry-collector to admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034978 (https://phabricator.wikimedia.org/T365626) (owner: 10CDanis) [18:51:32] (03Merged) 10jenkins-bot: Move opentelemetry-collector to admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034978 (https://phabricator.wikimedia.org/T365626) (owner: 10CDanis) [18:51:38] (03CR) 10Pppery: "OK, I see the problem. I ran the export script with an uncommitted local hack to avoid dirty diffs due to T349989, and that hack broke thi" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032094 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [18:51:58] (03CR) 10Volans: [C:03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/1005694 (owner: 10Ssingh) [18:53:25] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380#9827233 (10fkaelin) 05Openβ†’03Resolved a:03fkaelin Closing this task as resolved as the storage request was handled. [18:53:26] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1035474 (owner: 10Ssingh) [18:54:56] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:55:08] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:55:32] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:55:42] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:57:28] (03CR) 10Ssingh: "Thanks for the review!" [software/conftool] - 10https://gerrit.wikimedia.org/r/1005694 (owner: 10Ssingh) [18:57:36] (03CR) 10Ssingh: [C:03+2] Revert "tests: add schema for dnsbox" [software/conftool] - 10https://gerrit.wikimedia.org/r/1005694 (owner: 10Ssingh) [18:57:56] (03CR) 10Krinkle: [C:03+1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [18:58:05] (03CR) 10Ssingh: [C:03+2] P:cumin: add support for aliasing LVS host classes [puppet] - 10https://gerrit.wikimedia.org/r/1035474 (owner: 10Ssingh) [19:01:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T364299)', diff saved to https://phabricator.wikimedia.org/P63026 and previous config saved to /var/cache/conftool/dbconfig/20240523-190136-marostegui.json [19:01:42] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:12:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9827314 (10VRiley-WMF) After a very rigorous amount of troubleshooting, Dell will be sending out a replacement motherboard for kafka-main1009. [19:16:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P63027 and previous config saved to /var/cache/conftool/dbconfig/20240523-191644-marostegui.json [19:24:11] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1035541 (https://phabricator.wikimedia.org/T356386) (owner: 10Herron) [19:31:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P63028 and previous config saved to /var/cache/conftool/dbconfig/20240523-193152-marostegui.json [19:33:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [19:38:00] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [19:38:13] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [19:47:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T364299)', diff saved to https://phabricator.wikimedia.org/P63029 and previous config saved to /var/cache/conftool/dbconfig/20240523-194659-marostegui.json [19:47:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance [19:47:05] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:47:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance [19:47:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T364299)', diff saved to https://phabricator.wikimedia.org/P63030 and previous config saved to /var/cache/conftool/dbconfig/20240523-194723-marostegui.json [19:47:37] (03CR) 10Alexandros Kosiaris: [C:03+1] datasets-config: Add volume for configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034581 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [19:49:20] (03CR) 10Pppery: [C:04-1] "It looks like the duplicates are deleted in the wrong order here - this deletes the last instance of each whereas the first instance is th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035389 (https://phabricator.wikimedia.org/T365679) (owner: 10Reedy) [19:57:15] (03CR) 10Krinkle: [C:04-1] Migrate `wmfstatic` metrics to Prometheus store (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [19:57:38] (03PS1) 10BryanDavis: toolhub: Bump container version to 2024-05-23-193216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035555 (https://phabricator.wikimedia.org/T365654) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240523T2000). [20:00:05] JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:12] (03PS3) 10Jsn.sherman: CommonSettings: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034131 (https://phabricator.wikimedia.org/T361643) [20:00:12] (03PS6) 10Jsn.sherman: InitialiseSettings: testwiki enable AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) [20:00:15] i'm here [20:00:20] JSherman: wanna lead the deployment? [20:00:21] me too [20:00:31] yep, happy to do so [20:00:31] * TheresNoTime is also here :D [20:00:56] hi TheresNoTime :) [20:00:58] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9827465 (10eoghan) a:03eoghan [20:01:01] o/ [20:01:10] JSherman: ping me if i'm needed then :) [20:01:20] urbanecm: will do! [20:01:52] (03CR) 10BryanDavis: [C:03+2] toolhub: Bump container version to 2024-05-23-193216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035555 (https://phabricator.wikimedia.org/T365654) (owner: 10BryanDavis) [20:02:37] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [20:02:43] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2024-05-23-193216-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035555 (https://phabricator.wikimedia.org/T365654) (owner: 10BryanDavis) [20:03:19] okay, shelled in and have my browser tabs open [20:04:14] sounds good [20:04:21] It seems like I could just run my changes in one go with scap, no? [20:04:29] yea [20:04:34] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [20:04:54] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [20:05:10] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [20:05:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9827478 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [20:05:30] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [20:05:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034131 (https://phabricator.wikimedia.org/T361643) (owner: 10Jsn.sherman) [20:05:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) (owner: 10Jsn.sherman) [20:06:22] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [20:06:27] (03Merged) 10jenkins-bot: CommonSettings: Load AutoModerator extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034131 (https://phabricator.wikimedia.org/T361643) (owner: 10Jsn.sherman) [20:06:29] (03Merged) 10jenkins-bot: InitialiseSettings: testwiki enable AutoModerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034132 (https://phabricator.wikimedia.org/T361643) (owner: 10Jsn.sherman) [20:06:48] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1034131|CommonSettings: Load AutoModerator extension (T361643)]], [[gerrit:1034132|InitialiseSettings: testwiki enable AutoModerator (T361643)]] [20:06:53] T361643: Deploy the AutoModerator extension to production (testwiki, idwiki) - https://phabricator.wikimedia.org/T361643 [20:07:06] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [20:08:00] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [20:09:35] !log jsn@deploy1002 jsn: Backport for [[gerrit:1034131|CommonSettings: Load AutoModerator extension (T361643)]], [[gerrit:1034132|InitialiseSettings: testwiki enable AutoModerator (T361643)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:36] (03PS1) 10CDanis: Rename admin_ng otelcol to include 'main' prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035559 (https://phabricator.wikimedia.org/T365626) [20:11:55] proceeding with sync [20:12:03] !log jsn@deploy1002 jsn: Continuing with sync [20:12:28] special:Version sounds positive [20:13:02] (03CR) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [20:16:40] (03PS2) 10CDanis: Rename admin_ng otelcol to include 'main' prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035559 (https://phabricator.wikimedia.org/T365626) [20:18:01] (03PS12) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) [20:19:18] (03CR) 10Andrea Denisse: Migrate `wmfstatic` metrics to Prometheus store (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [20:20:59] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [20:21:06] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [20:23:53] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [20:24:07] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [20:24:19] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1034131|CommonSettings: Load AutoModerator extension (T361643)]], [[gerrit:1034132|InitialiseSettings: testwiki enable AutoModerator (T361643)]] (duration: 17m 30s) [20:24:23] T361643: Deploy the AutoModerator extension to production (testwiki, idwiki) - https://phabricator.wikimedia.org/T361643 [20:24:55] okay, double checking without using the debug extension. [20:25:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T364299)', diff saved to https://phabricator.wikimedia.org/P63031 and previous config saved to /var/cache/conftool/dbconfig/20240523-202520-marostegui.json [20:25:22] looks good. [20:25:24] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [20:26:01] looks like autmoderator is currently inactive, which is how it should be. No funny business that I can see in logstash [20:26:01] (03CR) 10CDanis: [C:03+2] "Discussed with swfrench and rzl and we agreed that this seemed like the least-bad thing to do for now." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035559 (https://phabricator.wikimedia.org/T365626) (owner: 10CDanis) [20:28:34] nice work! :D [20:29:10] (03Merged) 10jenkins-bot: Rename admin_ng otelcol to include 'main' prefix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035559 (https://phabricator.wikimedia.org/T365626) (owner: 10CDanis) [20:29:19] thank you! [20:29:27] (03CR) 10Scott French: [C:03+1] Remove profile::zookeeper::firewall::srange [puppet] - 10https://gerrit.wikimedia.org/r/1035334 (owner: 10Muehlenhoff) [20:29:59] (03CR) 10Cwhite: [C:03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [20:32:39] sorry i lost track of time TheresNoTime [20:33:11] and also apparently added to wrong backport window JSherman [20:33:38] JSherman: would you be able to help me backport 1034584 (deploy commands) Always use desktop watchlist HTML on mobile [20:34:12] Jdlrobson: yep, I finished up my backport and I still have everything open [20:34:37] the awesome [20:34:42] i just added it to right place in calendar [20:34:57] (03PS6) 10Jdlrobson: Always use desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034584 (https://phabricator.wikimedia.org/T109277) [20:35:00] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1034584 [20:35:11] last changelist mobiel special page! [20:36:27] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [20:36:35] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [20:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:42] looks straightforward enough [20:38:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034584 (https://phabricator.wikimedia.org/T109277) (owner: 10Jdlrobson) [20:38:36] (03CR) 10Krinkle: [C:03+1] "LGTM. Deploy anytime!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [20:39:14] (03Merged) 10jenkins-bot: Always use desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034584 (https://phabricator.wikimedia.org/T109277) (owner: 10Jdlrobson) [20:39:32] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1034584|Always use desktop watchlist HTML on mobile (T109277)]] [20:39:38] T109277: [EPIC]: Use core watchlist code for mobile experience - https://phabricator.wikimedia.org/T109277 [20:40:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P63032 and previous config saved to /var/cache/conftool/dbconfig/20240523-204028-marostegui.json [20:42:03] !log jsn@deploy1002 jdlrobson and jsn: Backport for [[gerrit:1034584|Always use desktop watchlist HTML on mobile (T109277)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:42:43] Jdlrobson: please test [20:43:40] on it [20:43:57] JSherman: LGTM please sync! [20:44:03] !log jsn@deploy1002 jdlrobson and jsn: Continuing with sync [20:50:21] thx JSherman im so happy to see the back of this code. Over 1000 lines of code can now be removed! https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/1034590 [20:50:59] Jdlrobson: πŸŽ‰πŸ—‘οΈπŸŽ‰ [20:51:14] this has been a long time coming! [20:53:27] everything is synced up we're just waiting on php restarts [20:55:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P63033 and previous config saved to /var/cache/conftool/dbconfig/20240523-205536-marostegui.json [20:55:56] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1034584|Always use desktop watchlist HTML on mobile (T109277)]] (duration: 16m 23s) [20:56:00] T109277: [EPIC]: Use core watchlist code for mobile experience - https://phabricator.wikimedia.org/T109277 [20:56:21] Jdlrobson: you should be good to go! [20:56:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [20:56:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [20:59:29] I'm not seeing errors that seem related at all to today's backports. I think we're okay here. [21:02:09] sounds perfect! [21:03:16] urbanecm: and TheresNoTime: thanks for being around just in case things went sideways! [21:03:23] any time :) [21:03:57] jouncebot: nowandnext [21:03:57] No deployments scheduled for the next 8 hour(s) and 56 minute(s) [21:03:57] In 8 hour(s) and 56 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240524T0600) [21:04:48] JSherman: thanks <3 [21:05:47] Jdlrobson: happy to oblige :-) [21:06:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [21:06:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [21:07:11] JSherman: any chance i could get a +2 on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/1035569 too? I like to keep local values synced with production. [21:09:09] Jdlrobson: i'm not Jason, but for posterity, could that crosslink the production patch (both to clarify what changed why and to clarify which production we're referring to)? [21:10:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T364299)', diff saved to https://phabricator.wikimedia.org/P63034 and previous config saved to /var/cache/conftool/dbconfig/20240523-211044-marostegui.json [21:10:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [21:10:49] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:11:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance [21:14:43] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [21:14:51] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [21:15:52] urbanecm: done! [21:16:56] 10ops-eqsin, 06DC-Ops, 06Traffic: Q#:rack/setup/install X - https://phabricator.wikimedia.org/T365763 (10RobH) 03NEW [21:17:06] Jdlrobson: +2ed [21:17:26] 10ops-eqsin, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9827683 (10RobH) [21:18:46] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [21:20:14] 06SRE, 10SRE-tools, 07SRE-Unowned, 06Infrastructure-Foundations: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9827705 (10Dzahn) [21:20:36] 06SRE, 10SRE-tools, 07SRE-Unowned, 06Infrastructure-Foundations: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9827708 (10Dzahn) Is this SRE-tools? or datacenter-ops? or really unowned? [21:23:12] (03PS1) 10BryanDavis: wikitech: (Un)block GitLab accounts when (un)blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035576 (https://phabricator.wikimedia.org/T316418) [21:23:51] (03CR) 10CI reject: [V:04-1] wikitech: (Un)block GitLab accounts when (un)blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035576 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [21:25:20] 10ops-eqsin, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9827741 (10RobH) [21:25:32] 10ops-eqsin, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9827744 (10RobH) [21:25:36] (03PS2) 10BryanDavis: wikitech: (Un)block GitLab accounts when (un)blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035576 (https://phabricator.wikimedia.org/T316418) [21:26:47] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [21:27:58] (03PS8) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [21:27:58] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [21:28:16] (03CR) 10BryanDavis: [C:03+1] "This has been manually applied on the wikitech hosts (cloudweb1003 & cloudweb1004) and tested by blocking and unblocking [[wikitech:User:G" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035576 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [21:28:39] jouncebot: now and next [21:28:39] No deployments scheduled for the next 8 hour(s) and 31 minute(s) [21:28:51] hello thcipriani :) [21:29:24] ohai bd808 [21:30:17] let's sling this gitlab blocking/unblocking patch out [21:30:28] just getting all my windows in order [21:31:36] I should re-up my deployment knowledge soon, but I am grateful for folks who let me just toss code at gerrit and then watch as they do the scary bits [21:32:34] (03PS9) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [21:32:44] I was about to say: not too many scary bits these days, but I didn't want to jinx myself [21:32:56] (03CR) 10CI reject: [V:04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [21:33:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035576 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [21:34:16] I had a helm deploy go pear shaped earlier today and was pretty happy to see how it rolled back automagically. I really do need to get familiar with the magic of `scap backport` and friends these days. [21:34:32] (03Merged) 10jenkins-bot: wikitech: (Un)block GitLab accounts when (un)blocked on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035576 (https://phabricator.wikimedia.org/T316418) (owner: 10BryanDavis) [21:34:47] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:1035576|wikitech: (Un)block GitLab accounts when (un)blocked on wikitech (T316418)]] [21:35:11] entire command I typed to deploy this: scap backport 1035576 [21:35:23] (after opening a bunch of stuff to monitor everything :)) [21:35:50] thcipriani: the only tricky part happens when something goes horribly wrong :) [21:35:59] but I dream of a day when the monitoring part is somewhat ambient [21:36:02] (03PS10) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [21:36:24] (03CR) 10CI reject: [V:04-1] peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 (owner: 10Dzahn) [21:36:42] urbanecm: nothing every goes horribly wrong (don't scare away bd808 :)) [21:36:56] lol. I know we are good at happy path automation. The human in the middle is mostly about recovery when things get weird [21:37:18] !log thcipriani@deploy1002 thcipriani and bd808: Backport for [[gerrit:1035576|wikitech: (Un)block GitLab accounts when (un)blocked on wikitech (T316418)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:37:59] now it's prompting me to continue since it's on testservers, but since this is a wikitech change we can't test it there. bd808 anywhere you want to "scap pull" before I tell it to go ahead? [21:38:58] thcipriani: I can manually pull to the 2 wikitech boxen I suppose. Do you think that will test anything important? [21:39:20] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9827789 (10RobH) [21:39:55] The change is on both of them already via manual patching I did before uploading to gerrit [21:40:01] bd808: sounds like you tested the important bits on other machines right prior to this, so that'll probably only test the deploy [21:40:12] so I'll go ahead and y [21:40:17] coolio [21:40:24] !log thcipriani@deploy1002 thcipriani and bd808: Continuing with sync [21:42:06] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9827795 (10RobH) [21:42:45] I was really impressed with how scap warned me away from doing something destructive last week and the docs were up to date enough for me to figure out the right thing to do instead [21:44:28] JSherman: scap has gotten better at that kinda thing over many years of breaking things ;) [21:45:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance [21:46:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance [21:46:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T364299)', diff saved to https://phabricator.wikimedia.org/P63035 and previous config saved to /var/cache/conftool/dbconfig/20240523-214614-marostegui.json [21:46:20] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:50:05] only 91 physical hosts left [21:50:37] urbanecm: I'm sorry I +2ed a patch out from under you earlier; I haven't figured out a good multitasking cadence with irc yet [21:51:20] JSherman: not sure i follow, but if it's about Jon's patch, no worries at all. i was about to +2 it myself when i saw you did :) [21:52:48] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:1035576|wikitech: (Un)block GitLab accounts when (un)blocked on wikitech (T316418)]] (duration: 18m 01s) [21:52:54] ^ bd808 all done [21:53:14] let's test once more just to be confident in the fix [21:54:23] urbanecm: Yeah, that's what I was talking about; I didn't catch the conversation y'all were having unti l after I +2ed. [21:54:46] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9827823 (10RobH) SSDs confirmed onsite by shipping, so I can go onsite whenever we schedule to take and install the SSD upgrades. [21:54:49] thcipriani: worked like a champ! Thanks for your help along the way for this one [21:54:54] \o/ [21:55:34] bd808: thanks for that addition, having blocks cascade to gitlab is a great feature to have working <3 [22:03:27] (03CR) 10Cwhite: [C:03+1] "Patch LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [22:04:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:06:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [22:06:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [22:11:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [22:11:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [22:21:02] (03PS2) 10Zabe: Stop writing to af_user(_text)/afh_user(_text) in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034919 (https://phabricator.wikimedia.org/T337920) [22:23:19] jouncebot: nowandnext [22:23:19] No deployments scheduled for the next 7 hour(s) and 36 minute(s) [22:23:20] In 7 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240524T0600) [22:23:36] (03CR) 10Zabe: [C:03+2] Stop writing to af_user(_text)/afh_user(_text) in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034919 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe) [22:24:10] (03Merged) 10jenkins-bot: Stop writing to af_user(_text)/afh_user(_text) in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034919 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe) [22:24:36] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1034919|Stop writing to af_user(_text)/afh_user(_text) in group0 wikis (T337920)]] [22:24:40] T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920 [22:27:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T364299)', diff saved to https://phabricator.wikimedia.org/P63036 and previous config saved to /var/cache/conftool/dbconfig/20240523-222714-marostegui.json [22:27:15] !log zabe@deploy1002 zabe: Backport for [[gerrit:1034919|Stop writing to af_user(_text)/afh_user(_text) in group0 wikis (T337920)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:27:17] (03PS2) 10Dduvall: service: Remove probes for blubberoid [puppet] - 10https://gerrit.wikimedia.org/r/1035543 (https://phabricator.wikimedia.org/T365742) [22:27:17] (03PS1) 10Dduvall: service: Remove blubberoid from backend servers and load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1035589 (https://phabricator.wikimedia.org/T365742) [22:27:20] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:30:18] !log zabe@deploy1002 zabe: Continuing with sync [22:37:29] (03PS11) 10Dzahn: peopleweb: (WIP) warn about large user home dirs [puppet] - 10https://gerrit.wikimedia.org/r/989577 [22:42:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P63037 and previous config saved to /var/cache/conftool/dbconfig/20240523-224222-marostegui.json [22:43:15] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1034919|Stop writing to af_user(_text)/afh_user(_text) in group0 wikis (T337920)]] (duration: 18m 39s) [22:43:19] T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920 [22:45:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P63038 and previous config saved to /var/cache/conftool/dbconfig/20240523-224459-ladsgroup.json [22:57:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P63039 and previous config saved to /var/cache/conftool/dbconfig/20240523-225730-marostegui.json [23:00:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P63040 and previous config saved to /var/cache/conftool/dbconfig/20240523-230005-ladsgroup.json [23:00:39] (03PS4) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [23:02:05] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:02:23] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:05:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [23:05:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:06:52] (03CR) 10Zabe: [C:03+2] "In order to test this on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032586 (https://phabricator.wikimedia.org/T112359) (owner: 10Zabe) [23:06:59] (03CR) 10CI reject: [V:04-1] Deploy configuration for wrapping B type passwords with encrypted Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032586 (https://phabricator.wikimedia.org/T112359) (owner: 10Zabe) [23:07:09] (03PS4) 10Zabe: Deploy configuration for wrapping B type passwords with encrypted Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032586 (https://phabricator.wikimedia.org/T112359) [23:07:17] (03CR) 10Zabe: [C:03+2] Deploy configuration for wrapping B type passwords with encrypted Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032586 (https://phabricator.wikimedia.org/T112359) (owner: 10Zabe) [23:07:58] (03Merged) 10jenkins-bot: Deploy configuration for wrapping B type passwords with encrypted Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032586 (https://phabricator.wikimedia.org/T112359) (owner: 10Zabe) [23:08:18] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1032586|Deploy configuration for wrapping B type passwords with encrypted Argon2 (T112359)]] [23:10:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:10:51] !log zabe@deploy1002 zabe: Backport for [[gerrit:1032586|Deploy configuration for wrapping B type passwords with encrypted Argon2 (T112359)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:11:15] !log zabe@deploy1002 zabe: Continuing with sync [23:12:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T364299)', diff saved to https://phabricator.wikimedia.org/P63041 and previous config saved to /var/cache/conftool/dbconfig/20240523-231238-marostegui.json [23:12:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance [23:12:44] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:12:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance [23:13:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T364299)', diff saved to https://phabricator.wikimedia.org/P63042 and previous config saved to /var/cache/conftool/dbconfig/20240523-231302-marostegui.json [23:15:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P63043 and previous config saved to /var/cache/conftool/dbconfig/20240523-231511-ladsgroup.json [23:22:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:24:18] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1032586|Deploy configuration for wrapping B type passwords with encrypted Argon2 (T112359)]] (duration: 16m 00s) [23:27:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:30:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1238 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P63044 and previous config saved to /var/cache/conftool/dbconfig/20240523-233017-ladsgroup.json [23:33:37] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [23:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1034937 [23:38:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1034937 (owner: 10TrainBranchBot) [23:52:19] (03PS1) 10Scott French: Release 3.0.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/1035596 (https://phabricator.wikimedia.org/T365123) [23:58:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T364299)', diff saved to https://phabricator.wikimedia.org/P63045 and previous config saved to /var/cache/conftool/dbconfig/20240523-235817-marostegui.json [23:58:22] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:59:31] (03CR) 10Scott French: "If you think I'm being too conservative with the major version bump, I'm happy to reconsider and go with 2.4.0 (which was my original plan" [software/conftool] - 10https://gerrit.wikimedia.org/r/1035596 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [23:59:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1034937 (owner: 10TrainBranchBot)