[00:08:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1132206 [00:08:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1132206 (owner: 10TrainBranchBot) [00:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692389 (10phaultfinder) [00:29:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1132206 (owner: 10TrainBranchBot) [00:29:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692390 (10phaultfinder) [00:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692393 (10phaultfinder) [01:07:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [01:27:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [01:41:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692404 (10phaultfinder) [02:07:26] FIRING: [10x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:19:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692410 (10phaultfinder) [02:59:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692443 (10phaultfinder) [03:02:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:02:26] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:03:53] FIRING: [2x] SessionStoreErrorRateHigh: Session storage error rates (5xx) in codfw are elevated #page - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreErrorRateHigh [03:04:37] got a user report from someone that they got logged out with a session hijacking error, and now can't log back in [03:07:23] multiple reports now [03:09:42] AntiComposite: thanks for the heads-up! investigating [03:09:47] !incidents [03:09:48] 5917 (UNACKED) [2x] SessionStoreErrorRateHigh data-persistence () [03:09:48] 5916 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [03:09:48] 5915 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [03:09:48] 5914 (RESOLVED) ProbeDown sre (10.2.2.27 ip4 swift-https:443 probes/service http_swift-https_ip4 eqiad) [03:09:48] 5913 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [03:09:49] 5912 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [03:09:58] !ack 5917 [03:09:58] 5917 (ACKED) [2x] SessionStoreErrorRateHigh data-persistence () [03:11:33] https://phabricator.wikimedia.org/T390512 was just created [03:14:59] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: I can't edit "to prevent session hijacking" and log out - https://phabricator.wikimedia.org/T390512#10692466 (10AntiCompositeNumber) [03:16:33] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:16:43] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: I can't edit "to prevent session hijacking" and log out - https://phabricator.wikimedia.org/T390512#10692470 (10Jarekt) I get similar message when trying to login: {F58950593} [03:18:01] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: I can't edit "to prevent session hijacking" and log out - https://phabricator.wikimedia.org/T390512#10692472 (10Benwing2) [03:23:07] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: I can't edit "to prevent session hijacking" and log out - https://phabricator.wikimedia.org/T390512#10692490 (10Jarekt) The log-in trouble described above, is when using Firefox. In Chrome I am logged in, but can not edit due to error: Sorry! W... [03:26:52] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: I can't edit "to prevent session hijacking" and log out - https://phabricator.wikimedia.org/T390512#10692498 (10DuncanHill) EnWIki user here. Using Edge on Win11 I get the same errors as Jarekt. Same with Chrome on my mobile. Also, "email this u... [03:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:29:08] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: I can't edit "to prevent session hijacking" and log out - https://phabricator.wikimedia.org/T390512#10692500 (10Quiddity) Devs are looking into it. https://www.wikimediastatus.net/incidents/1w3rq4d2zljj [03:30:27] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: I can't edit "to prevent session hijacking" and log out - https://phabricator.wikimedia.org/T390512#10692503 (10Jacobolus) I'm running into the same issue. Couldn't edit (with a message about needing to log out), then couldn't log out ("invalid... [03:31:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:00] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: "Invalid CSRF token" on any actions by registered users - https://phabricator.wikimedia.org/T390512#10692506 (10MBH) [03:35:47] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692508 (10phaultfinder) [03:37:26] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:37:36] Sounds like logged-in edits are starting to go through now [03:41:25] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:42:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:42:26] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:42:51] AntiComposite: thanks for confirming! yes, we believe we've mitigated the underlying issue, and I'm seeing edit rate graphs recovering [03:43:53] RESOLVED: [2x] SessionStoreErrorRateHigh: Session storage error rates (5xx) in codfw are elevated #page - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreErrorRateHigh [03:51:00] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: "Invalid CSRF token" on any actions by registered users - https://phabricator.wikimedia.org/T390512#10692512 (10MBH) 05Open→03Resolved a:03MBH Fixed. [03:51:25] RESOLVED: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:12] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: "Invalid CSRF token" on any actions by registered users - https://phabricator.wikimedia.org/T390512#10692515 (10Jacobolus) Thanks for your quick work! [04:08:27] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387829#10692516 (10phaultfinder) [04:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692517 (10phaultfinder) [04:25:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692518 (10phaultfinder) [04:32:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:42:21] 06SRE: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514 (10Scott_French) 03NEW [04:42:40] 06SRE: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10692533 (10Scott_French) p:05Triage→03High [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:32:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:41:29] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132293 (https://phabricator.wikimedia.org/T389768) [05:54:49] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10692572 (10Krinkle) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692578 (10phaultfinder) [06:07:27] (03PS1) 10Kosta Harlan: extension-list: Add EmailAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132302 (https://phabricator.wikimedia.org/T390437) [06:16:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1131975 (https://phabricator.wikimedia.org/T389869) (owner: 10Ahmon Dancy) [06:22:25] (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for SRE o11y with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1131930 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [06:28:23] (03CR) 10Muehlenhoff: [C:03+2] Add cumin1003 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1131933 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [06:35:55] (03PS1) 10Kosta Harlan: EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [06:36:44] (03CR) 10CI reject: [V:04-1] EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [06:39:50] (03PS2) 10Kosta Harlan: EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [06:40:36] (03CR) 10CI reject: [V:04-1] EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [06:41:54] !log upload openjdk-21 21.0.6+7-1~deb12u1 to apt.wikimedia.org component/jdk21 (backport of latest Java 21 security updates for Bookworm) [06:45:03] 10ops-esams, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389874#10692615 (10phaultfinder) [06:46:08] (03PS2) 10Hashar: Fix removal of Gerrit json prefix [software/bitu] - 10https://gerrit.wikimedia.org/r/1131991 [06:49:20] (03CR) 10Hashar: Fix removal of Gerrit json prefix (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1131991 (owner: 10Hashar) [06:52:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 9.646% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:52:41] (03PS5) 10Hashar: Simplify invocation of clients integrations [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 [06:54:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 14.15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:54:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692622 (10phaultfinder) [06:55:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:57:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 3.214% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:57:35] (03PS3) 10Kosta Harlan: EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [06:59:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 14.15s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:59:50] (03CR) 10Hashar: "I have rewrote the integration with Gerrit to handle the different status codes." [software/bitu] - 10https://gerrit.wikimedia.org/r/1131471 (owner: 10Hashar) [07:00:05] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T0700). [07:00:05] dcausse and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:01:06] o/ [07:02:10] I can deploy [07:02:39] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132293 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [07:03:24] Superpes: are you around? [07:04:36] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132293 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [07:04:45] (03PS1) 10Ilias Sarantopoulos: ml-services: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132419 (https://phabricator.wikimedia.org/T388269) [07:05:53] (03PS1) 10Ilias Sarantopoulos: ml-services: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132420 (https://phabricator.wikimedia.org/T388269) [07:06:07] !log deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131705 on A:cp-magru (T384227) [07:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:13] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [07:06:21] (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132293 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [07:07:25] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:07:53] (03Abandoned) 10Ilias Sarantopoulos: ml-services: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132420 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [07:07:56] (03Abandoned) 10Ilias Sarantopoulos: ml-services: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132419 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [07:08:34] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:09:13] (03PS4) 10Ilias Sarantopoulos: ml-services: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131323 (https://phabricator.wikimedia.org/T388269) [07:12:28] !log upgraded python3-wmflib to v1.3.1 on cumin[12]002 [07:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131335 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [07:13:39] (03Merged) 10jenkins-bot: cirrus: use only deployment-cirrussearch*.deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131335 (https://phabricator.wikimedia.org/T389971) (owner: 10DCausse) [07:14:32] (03CR) 10Fabfur: [C:03+2] hiera: enable TLS on volatile storage in magru [puppet] - 10https://gerrit.wikimedia.org/r/1131705 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [07:14:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692651 (10phaultfinder) [07:15:17] (03PS1) 10Muehlenhoff: Create insetup role for Data Platformo11y with nftables and merge DE/Search [puppet] - 10https://gerrit.wikimedia.org/r/1132422 (https://phabricator.wikimedia.org/T389825) [07:16:12] (03PS3) 10Aaron Schulz: services: update codfw changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) [07:16:33] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:17:08] (03CR) 10Brouberol: [C:03+2] Fix: prevent the stubprovider from locking indefinitely [dumps] - 10https://gerrit.wikimedia.org/r/1131781 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [07:17:15] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7002.magru.wmnet [07:17:46] (03PS1) 10DCausse: Translate: fix elasticsearch cluster setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132423 (https://phabricator.wikimedia.org/T390244) [07:18:15] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:18:44] (03PS2) 10Muehlenhoff: Create insetup role for Data Platform with nftables and merge DE/Search roles [puppet] - 10https://gerrit.wikimedia.org/r/1132422 (https://phabricator.wikimedia.org/T389825) [07:19:14] !log upgrade python3-wmflib to v1.3.1 fleetwide [07:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:37] adding https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1132423 to the backport window [07:19:45] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7010.magru.wmnet [07:19:58] (03CR) 10Kevin Bazira: [C:03+1] ml-services: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131323 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [07:20:06] (03PS2) 10DCausse: Translate: fix elasticsearch cluster setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132423 (https://phabricator.wikimedia.org/T390244) [07:21:18] !log brouberol@deploy1003 Started scap build-images: T390059 - Prevent stub provider from going in an infinite loop [07:21:23] T390059: Large wiki dump is getting stuck when running in airflow - https://phabricator.wikimedia.org/T390059 [07:22:46] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10692673 (10Ben.buchenau) Good morning @MoritzMuehlenhoff - just wanted to work with the data and noticed I have login issues to Phabricator due to missing permission... [07:22:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dcausse@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132423 (https://phabricator.wikimedia.org/T390244) (owner: 10DCausse) [07:22:59] (03PS1) 10Marostegui: installserver: Do not reimage db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1132492 [07:23:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10692678 (10Ben.buchenau) 05Resolved→03Open [07:23:37] (03Merged) 10jenkins-bot: Translate: fix elasticsearch cluster setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132423 (https://phabricator.wikimedia.org/T390244) (owner: 10DCausse) [07:24:16] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7002.magru.wmnet [07:24:22] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7010.magru.wmnet [07:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692681 (10phaultfinder) [07:27:18] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7003.magru.wmnet [07:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:27:26] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7011.magru.wmnet [07:27:42] (03PS1) 10Ilias Sarantopoulos: api-gateway: enable anonymous reqs to edit check staging and update name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132534 (https://phabricator.wikimedia.org/T388269) [07:27:53] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131323 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [07:28:46] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:29:05] brouberol: o/ are you running scap? [07:29:24] (03Merged) 10jenkins-bot: ml-services: change the edit check staging deployment model name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131323 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [07:30:16] (03CR) 10Marostegui: [C:03+2] installserver: Do not reimage db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1132492 (owner: 10Marostegui) [07:30:16] seeing "scap build-images: T390059" couple minutes ago [07:30:17] T390059: Large wiki dump is getting stuck when running in airflow - https://phabricator.wikimedia.org/T390059 [07:31:05] scap waiting on "concurrent prep is locked by brouberol (pid 2366246) on Mon Mar 31 07:21:18 2025" [07:31:33] dcausse: I am, to build a restricted/mediawiki-multiversion-cli. Should I stop it? [07:31:54] *to build a restricted/mediawiki-multiversion-cli image [07:31:54] brouberol: depends how long it take, it's the backport window now [07:32:16] !log brouberol@deploy1003 Finished scap build-images: T390059 - Prevent stub provider from going in an infinite loop (duration: 10m 57s) [07:32:20] oh sorry, it is usually pretty fast (<60s) [07:32:24] there we go [07:32:25] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1132423|Translate: fix elasticsearch cluster setup (T390244)]] [07:32:30] T390244: InvalidArgumentException: Default TTM service eqiad cannot be write only - https://phabricator.wikimedia.org/T390244 [07:32:32] brouberol: np! thanks! [07:33:28] although this time it took 10 minutes. I might have been mistaken when I thought it took ~60s [07:33:48] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7011.magru.wmnet [07:33:56] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7003.magru.wmnet [07:34:17] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7004.magru.wmnet [07:34:28] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7012.magru.wmnet [07:35:57] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [07:36:00] !log isaranto@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [07:36:15] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [07:36:43] (03CR) 10Volans: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [07:38:47] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7012.magru.wmnet [07:40:15] FIRING: HttpdUnreachable: httpd unavailable for deployment mw-debug/next at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-debug&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [07:40:21] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7004.magru.wmnet [07:41:46] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7005.magru.wmnet [07:41:54] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7013.magru.wmnet [07:41:59] (03CR) 10Elukey: [C:03+2] services: update codfw changeprop Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126215 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [07:42:09] (03CR) 10Filippo Giunchedi: [C:03+2] search-grafana-dashboards: format results as markdown, and add --json [software] - 10https://gerrit.wikimedia.org/r/1129242 (owner: 10Filippo Giunchedi) [07:42:26] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:43:02] scap stuck on "Started sync-testservers-k8s" since 8mins, no progress at all (ok: 0; fail: 0; left: 12) [07:43:36] * akosiaris looking [07:44:27] Failed to pull image "docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2025-03-31-072141-webserver": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2025-03-31-072141-webserver": failed commit on ref [07:44:28] "layer-sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb": unexpected commit digest sha256:623c817ae5954e1d72151a130af13ced997b918754c214b07a4f26659ec647fa, expected sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb: failed precondition [07:44:44] this stinks of the issue that Scott looked into on Friday [07:44:55] registry having a corrupt blob [07:45:15] FIRING: [6x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [07:45:23] dcausse: probably this one: https://phabricator.wikimedia.org/T390251 [07:45:28] wanna add a comment? [07:45:32] it just faile [07:45:34] d [07:45:45] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7013.magru.wmnet [07:45:46] (03PS1) 10Slyngshede: Release v0.1.8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1132536 [07:46:07] dcausse: and rollback, right? [07:46:19] seeing "07:44:20 Rollback completed" yes [07:46:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [07:46:49] (03PS1) 10Kevin Bazira: ml-services: update outlink predictor image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132537 (https://phabricator.wikimedia.org/T389768) [07:46:53] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7005.magru.wmnet [07:46:59] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10692711 (10ops-monitoring-bot) Draining ganeti4006.ulsfo.wmnet of running VMs [07:47:34] but the logs are too big to fit in my tmux window buffer, seeing helm diff mainly [07:47:44] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: add function to replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128779 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [07:48:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [07:48:18] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7006.magru.wmnet [07:48:31] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7014.magru.wmnet [07:49:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet [07:49:53] akosiaris: should I just retry? [07:50:15] RESOLVED: [6x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [07:50:28] dcausse: per that task, it might just work to retry [07:50:33] ack [07:50:35] not sure what is going on yet tbh [07:50:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10692718 (10ops-monitoring-bot) Draining ganeti4006.ulsfo.wmnet of running VMs [07:50:47] elukey: the registry is a gift that keeps on giving apparently per https://phabricator.wikimedia.org/T390251 [07:50:53] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1132423|Translate: fix elasticsearch cluster setup (T390244)]] [07:50:58] T390244: InvalidArgumentException: Default TTM service eqiad cannot be write only - https://phabricator.wikimedia.org/T390244 [07:51:24] akosiaris: there was another image built concurrently apparently but scap properly waited on a lock [07:51:45] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#10692731 (10MatthewVernon) >>! In T328872#10689851, @Mike_Peel wrote: > "The... [07:52:27] akosiaris: wasn't aware of that sigh [07:52:42] I saw "concurrent prep is locked by brouberol (pid 2366246) on Mon Mar 31 07:21:18 2025" but no clue if that could be in way related [07:53:49] sorry, that was me (obv) and I was unaware that my scap command would collide with the ongoing backport, as well as T390251 (I was OOO on friday). [07:53:49] T390251: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251 [07:54:01] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7014.magru.wmnet [07:54:01] how can I help? [07:54:09] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7006.magru.wmnet [07:54:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692732 (10phaultfinder) [07:54:57] brouberol: sorry for the ping, I'm not sure this is related to what's happening... [07:55:11] brouberol: unsure tbh. For now, I 'd say just be aware that you might see problems regarding the registry? [07:55:25] it's unclear what caused this [07:55:30] it's not looking better this time either :/ [07:55:42] Understood, and indeed, the scap logs mention `Pushing docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-cli:2025-03-31-072139-publish` and it's not appearing in the image tags [07:56:13] Failed to pull image "docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2025-03-31-072141-webserver": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2025-03-31-072141-webserver": failed commit on ref [07:56:13] "layer-sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb": unexpected commit digest sha256:623c817ae5954e1d72151a130af13ced997b918754c214b07a4f26659ec647fa, expected sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb: failed precondition [07:56:16] yup, same error [07:56:26] I think to figure out which layer this is. [07:57:15] FIRING: HttpdUnreachable: httpd unavailable for deployment mw-debug/next at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-debug&var-release=next - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [07:59:18] akosiaris: could it be the registry picking up something bad from Redis? [07:59:20] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7007.magru.wmnet [07:59:34] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7015.magru.wmnet [08:01:56] elukey: it's not caching blobs though, right? [08:02:15] FIRING: [6x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [08:02:19] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:02:20] I mean blobs of that size. Those layers are like GBs [08:02:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:02:56] what's the backend storage? I can't remember if it's thanos-swift or ms-swift [08:02:57] jouncebot: now [08:02:57] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [08:03:19] (03PS1) 10Giuseppe Lavagetto: statograph: fix edit count metric [puppet] - 10https://gerrit.wikimedia.org/r/1132538 [08:03:50] (I ask because if it's ms-swift you might have written different objects to the different ms-swift clusters) [08:03:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host cumin1003.eqiad.wmnet [08:03:53] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:04:44] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7015.magru.wmnet [08:04:53] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1132538 (owner: 10Giuseppe Lavagetto) [08:04:53] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [08:05:30] (03CR) 10Filippo Giunchedi: [C:03+1] statograph: fix edit count metric [puppet] - 10https://gerrit.wikimedia.org/r/1132538 (owner: 10Giuseppe Lavagetto) [08:05:44] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7008.magru.wmnet [08:05:52] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7016.magru.wmnet [08:06:04] akosiaris: yes yes but it caches metadata before hitting swift IIRC, see https://phabricator.wikimedia.org/T375645#10217676 [08:06:04] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update outlink predictor image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132537 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [08:07:14] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Let's also update the transformer image just to have an up to date deployment?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132537 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [08:07:15] RESOLVED: [6x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [08:10:20] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7016.magru.wmnet [08:10:22] (03CR) 10Ayounsi: [C:03+1] "ship it!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 (owner: 10Ssingh) [08:11:21] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7008.magru.wmnet [08:11:40] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cumin1003.eqiad.wmnet - jmm@cumin2002" [08:11:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cumin1003.eqiad.wmnet - jmm@cumin2002" [08:11:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:11:46] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache cumin1003.eqiad.wmnet on all recursors [08:11:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cumin1003.eqiad.wmnet on all recursors [08:12:23] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cumin1003.eqiad.wmnet - jmm@cumin2002" [08:12:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM cumin1003.eqiad.wmnet - jmm@cumin2002" [08:15:10] on the one hand, it's very nice that we can't deploy using corrupted artifacts. On the other hand, why are the artifacts corrupted [08:15:30] elukey: I see that indeed redis has nothing for the blob itself per your comment [08:15:52] (03CR) 10Giuseppe Lavagetto: [C:03+2] statograph: fix edit count metric [puppet] - 10https://gerrit.wikimedia.org/r/1132538 (owner: 10Giuseppe Lavagetto) [08:16:08] (03CR) 10Brouberol: "Should we add an optional flag to the rolling operation cookbook that would make sure these reimage flags are added, or are we ok with the" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [08:16:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host cumin1003.eqiad.wmnet with OS bookworm [08:16:25] (03CR) 10Brouberol: [C:03+2] airflow-analytics-test: fix dependency on extra_dag_folders [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131723 (owner: 10Aqu) [08:17:42] (03CR) 10Brouberol: [C:03+1] Add a cleanup timer for old dumps webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/1131965 (https://phabricator.wikimedia.org/T390123) (owner: 10Btullis) [08:17:45] akosiaris: one thing that I am wondering is where the wrong digest gets listed, is it in the image's list of layers? [08:18:09] I am trying to get the list of layers from ttps://docker-registry.wikimedia.org/restricted/mediawiki-webserver:2025-03-31-072141-webserver but I never tried restricted before [08:18:55] that one requires a password [08:19:10] but otherwise, it's the same thing as anything else [08:19:39] elukey: I am refreshing my memory as well. [08:19:43] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1132536 (owner: 10Slyngshede) [08:20:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692772 (10phaultfinder) [08:20:49] (03PS1) 10Muehlenhoff: Failover to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1132540 [08:21:09] akosiaris: yep same, let's see who is quicker :D [08:21:50] (03CR) 10Brouberol: [C:03+2] Update webrequest_sampled_live druid deep-storage [puppet] - 10https://gerrit.wikimedia.org/r/1131778 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [08:27:11] 06SRE, 06SRE Observability: Statograph referencing empty/nonexisting metrics goes unnoticed - https://phabricator.wikimedia.org/T390520 (10fgiunchedi) 03NEW [08:27:49] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cumin1003.eqiad.wmnet with reason: host reimage [08:29:14] elukey: got the command to fetch the manifest right. I see the blobSums now [08:30:33] (03CR) 10Slyngshede: [C:03+2] Release v0.1.8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1132536 (owner: 10Slyngshede) [08:31:18] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10692822 (10Jelto) @KFrancis are you able to check the NDA status? [08:31:33] akosiaris: so IIUC from the task description, 65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb should indeed be correctly listed, but somehow the registry returns a layer that has its sha256 different (623c817ae5954e1d72151a130af13ced997b918754c214b07a4f26659ec647fa). Is my understanding right? [08:31:52] I am trying to reason about the same thing [08:32:02] I can try to check on swift to see what it is stored in the meantime [08:32:22] curl https://docker-registry.wikimedia.org/v2/restricted/mediawiki-webserver/blobs/sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb | sha256sum returns the proper hash btw [08:32:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cumin1003.eqiad.wmnet with reason: host reimage [08:33:09] (03Merged) 10jenkins-bot: Release v0.1.8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1132536 (owner: 10Slyngshede) [08:33:11] well almost [08:33:31] it needs to be > and then sha256sum outfile [08:33:37] yep yep [08:33:55] it is consistent for both registries... I wonder [08:34:05] I was about to ask the same [08:35:21] (03PS1) 10Joal: Remove druid webrequest_sampled_128 purge timer [puppet] - 10https://gerrit.wikimedia.org/r/1132541 (https://phabricator.wikimedia.org/T385198) [08:35:44] (03CR) 10CI reject: [V:04-1] Remove druid webrequest_sampled_128 purge timer [puppet] - 10https://gerrit.wikimedia.org/r/1132541 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [08:35:45] so if the sha is correct, this means that in theory a new deploy should succeed [08:35:58] because for $reasons the registry is now consistent [08:36:09] (03PS2) 10Joal: Remove druid webrequest_sampled_128 purge timer [puppet] - 10https://gerrit.wikimedia.org/r/1132541 (https://phabricator.wikimedia.org/T385198) [08:36:16] elukey: should I try? [08:36:39] elukey: and indeed we now see the published tags from this morning [08:36:50] brouberol@registry1004:~$ ./latest-image-tags.py restricted/mediawiki-multiversion-cli [08:36:50] - 2025-03-31-075104-publish-81 [08:36:50] - 2025-03-31-075104-publish [08:36:50] - 2025-03-31-073236-publish-81 [08:36:50] - 2025-03-31-073236-publish [08:37:13] ? [08:37:16] ok this is weird [08:37:26] I concur with what elukey is seeing [08:37:53] 06SRE, 06serviceops, 10Wikidata, 10Wikimedia-Site-requests, and 2 others: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10692848 (10seanleong-WMDE) [08:37:56] all 4 hosts are responding with a blob that has the proper hash [08:38:03] dcausse: I suppose it won't hurt [08:38:05] go ahead [08:38:07] ack [08:38:14] but this is... a nice heisenbug [08:38:26] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1132423|Translate: fix elasticsearch cluster setup (T390244)]] [08:38:31] T390244: InvalidArgumentException: Default TTM service eqiad cannot be write only - https://phabricator.wikimedia.org/T390244 [08:38:31] I hope we haven't ended up with the registry pooled in both DCs or something [08:38:39] shouldn't, but doublechecking [08:38:40] 06SRE, 06serviceops, 10Wikidata, 10Wikimedia-Site-requests, and 2 others: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455#10692855 (10seanleong-WMDE) a:03seanleong-WMDE [08:38:45] maybe the registry was ashamed by the amount of scrutiny it received [08:39:09] docker-registry Active/Passive pooled [08:39:15] nope, it's as it should be [08:40:40] is there some caching at the k8s level when doing docker_pull? [08:41:03] Failed to pull image "docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2025-03-31-072141-webserver": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2025-03-31-072141-webserver": failed commit on ref [08:41:03] "layer-sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb": unexpected commit digest sha256:a4b0a271884c01327ca3db66f761d13ffe1a24173ad59537c29cad23417cf3aa, expected sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb: failed precondition [08:41:04] again [08:41:34] (03PS1) 10Marostegui: mariadb: Productionize db1256 [puppet] - 10https://gerrit.wikimedia.org/r/1132542 (https://phabricator.wikimedia.org/T381475) [08:41:34] dcausse: it gets stored on the local node after being pulled successfully [08:41:43] but it hasn't been pulled successfully [08:41:48] akosiaris: where did you get that error? [08:41:54] kubectl get events -w [08:42:02] (03PS3) 10Joal: Remove druid webrequest_sampled_128 purge timer [puppet] - 10https://gerrit.wikimedia.org/r/1132541 (https://phabricator.wikimedia.org/T385198) [08:42:03] when I pasted it was 14s old [08:42:18] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1256 [puppet] - 10https://gerrit.wikimedia.org/r/1132542 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [08:42:31] -n mw-debug btw [08:43:04] (03CR) 10Joal: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132541 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [08:43:07] (03PS2) 10Kevin Bazira: ml-services: update outlink predictor and transformer images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132537 (https://phabricator.wikimedia.org/T389768) [08:44:07] an interesting thing that I see on registry2005 is [08:44:09] 10.192.32.55 - kubernetes [31/Mar/2025:08:39:55 +0000] "GET /v2/restricted/mediawiki-webserver/blobs/sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb?ns=docker-registry.discovery.wmnet HTTP/1.1" 206 1 [08:44:17] that one is the dragonfly supernode [08:44:40] (03CR) 10Federico Ceratto: [C:03+1] sre.mysql.upgrade: wait to remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1130977 (owner: 10Volans) [08:44:42] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.upgrade: wait to remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/1130977 (owner: 10Volans) [08:44:48] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692895 (10phaultfinder) [08:44:52] in codfw [08:45:15] FIRING: [3x] HttpdUnreachable: httpd unavailable for deployment mw-debug/pinkunicorn at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [08:45:17] (03CR) 10Kevin Bazira: "okok... I have included the transformer image too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132537 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [08:46:16] ah, could it be the supernode is corrupted? [08:46:47] I am wondering the same, that one is 2001 but for the sha I don't see anything suspicious in the logs [08:46:50] (03CR) 10Brouberol: [C:03+1] Remove druid webrequest_sampled_128 purge timer [puppet] - 10https://gerrit.wikimedia.org/r/1132541 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [08:46:55] like $horror-while-doing-xyz [08:46:57] we did have an unresponsive node in eqiad [08:47:05] wikikube-worker1039 [08:47:24] but it's dead now, so it's probably unrelated. [08:47:30] (03CR) 10Brouberol: [C:03+2] Remove druid webrequest_sampled_128 purge timer [puppet] - 10https://gerrit.wikimedia.org/r/1132541 (https://phabricator.wikimedia.org/T385198) (owner: 10Joal) [08:47:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cumin1003.eqiad.wmnet with OS bookworm [08:47:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host cumin1003.eqiad.wmnet [08:48:15] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1211.eqiad.wmnet onto db1256.eqiad.wmnet [08:48:18] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db1211 - Depool db1211.eqiad.wmnet to then clone it to db1256.eqiad.wmnet - marostegui@cumin1002 [08:48:23] akosiaris: is it worth to roll restart the supernodes as test? [08:48:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1211 - Depool db1211.eqiad.wmnet to then clone it to db1256.eqiad.wmnet - marostegui@cumin1002 [08:49:03] (03PS1) 10Slyngshede: Upgrade IDM to Bitu 0.1.8 [dns] - 10https://gerrit.wikimedia.org/r/1132545 [08:49:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [08:49:32] elukey: sure, I don't see why not [08:49:38] actually, wait [08:49:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [08:49:43] gimme a sec to try out something first [08:50:15] FIRING: [6x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [08:50:35] yes yes [08:51:31] dcausse: o/ to reply to your question, there is indeed something between the wikikube workers and the registry: https://wikitech.wikimedia.org/wiki/Dragonfly [08:51:45] we are trying to figure it out if it plays a role [08:51:55] (03CR) 10Ilias Sarantopoulos: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132537 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [08:51:59] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update outlink predictor and transformer images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132537 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [08:52:23] but it may be a red herring, since https://phabricator.wikimedia.org/T390251 explicitly mentions that it is the registry at fault [08:52:26] at least, the last time [08:52:41] because they tried to pull directly from it, no intermediaries [08:53:16] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update outlink predictor and transformer images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132537 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [08:53:43] but this case is weird, since Alex just tested that the registries are consistent [08:54:02] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10692929 (10BTullis) I can now confirm that these three nodes are showing as decommissioned, with no under-replicated blocks on t... [08:54:16] elukey: ack [08:54:38] (03Merged) 10jenkins-bot: ml-services: update outlink predictor and transformer images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132537 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [08:55:09] randomly searching I see https://github.com/containerd/containerd/pull/5921 but that's a wild guess [08:55:15] RESOLVED: [6x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [08:55:19] (03PS1) 10Brouberol: Fix typo in import causing an import error [dumps] - 10https://gerrit.wikimedia.org/r/1132548 (https://phabricator.wikimedia.org/T390059) [08:55:55] (03CR) 10Btullis: [C:03+1] Fix typo in import causing an import error [dumps] - 10https://gerrit.wikimedia.org/r/1132548 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [08:56:31] (03CR) 10Ayounsi: [C:03+2] gNMIc: subscribe to alerts states [puppet] - 10https://gerrit.wikimedia.org/r/1131306 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [08:56:53] (03CR) 10Brouberol: [C:03+2] Fix typo in import causing an import error [dumps] - 10https://gerrit.wikimedia.org/r/1132548 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [08:56:55] (03PS1) 10Muehlenhoff: Switch ganeti4006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1132549 [08:57:19] (03PS2) 10Muehlenhoff: Switch ganeti4006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1132549 [08:58:43] elukey: I managed to fetch the image on a wikikube-worker1048 with a sudo ctr -n k8s.io image pull -u username:password [08:58:50] and it's not complaining [08:59:00] it's not clear to me if it indeed used though dfget [08:59:09] the configuration seems to imply that it should [08:59:15] in theory yes [08:59:23] but it also implied that I shouldn't need to pass the username/password pair and I had to [09:00:28] (03PS1) 10Volans: CHANGELOG: add changelogs for release v10.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1132550 [09:00:55] elukey: I am running out of ideas. Wanna restart the supernodes and we retry? [09:03:33] akosiaris: lemme backtrack a second, because I don't want us to chase a rabbit hole - from the task's description Scott and Ahmon were seeing the incosistency while curling directly the registry endpoint, meanwhile in our case we tested that the inconsistency is gone on all registry hosts but we see some warnings on mw-debug related to a failed pull. [09:04:18] could it be a different version of the bug, maybe because the supernode got affected? [09:04:22] yup. Wanna retrace our steps to at least make sure we see something similar? [09:04:50] yep yep [09:04:57] cumin1002:~$ sudo cumin --force 'registry*' 'curl -s -k http://localhost:5000/v2/restricted/mediawiki-webserver/blobs/sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb | sha256sum' [09:05:13] that returns indeed 65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb [09:05:35] which is the same hash, so at least on that level it appears to be ok [09:05:47] let me redo this on the nginx level, just to rule that one out [09:06:23] and the error is layer-sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb": unexpected commit digest sha256:a4b0a271884c01327ca3db66f761d13ffe1a24173ad59537c29cad23417cf3aa, expected sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb [09:06:36] so we seem to be at least targetting the correct layer/hash [09:06:56] at least unless my eyes are deceiving me [09:07:08] (03PS1) 10Brouberol: Move import statements to CommandsInParallel.__init__ to avoid circular import [dumps] - 10https://gerrit.wikimedia.org/r/1132555 (https://phabricator.wikimedia.org/T390059) [09:07:50] (03CR) 10Btullis: [C:03+1] Move import statements to CommandsInParallel.__init__ to avoid circular import [dumps] - 10https://gerrit.wikimedia.org/r/1132555 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [09:10:08] akosiaris: wait a sec, so you get the error hitting nginx right? [09:10:12] elukey: ok, unless I am wrong, nginx on registry2004 returns a different result [09:10:19] it's the only 1 btw [09:10:25] lol [09:10:48] ok so my smart idea of bypassing it to test wasn't that smart [09:10:52] (03CR) 10Btullis: [V:03+1 C:03+2] Add a cleanup timer for old dumps webrequest logs [puppet] - 10https://gerrit.wikimedia.org/r/1131965 (https://phabricator.wikimedia.org/T390123) (owner: 10Btullis) [09:12:08] (03CR) 10Brouberol: [C:03+2] Move import statements to CommandsInParallel.__init__ to avoid circular import [dumps] - 10https://gerrit.wikimedia.org/r/1132555 (https://phabricator.wikimedia.org/T390059) (owner: 10Brouberol) [09:12:31] elukey: the 2 codfw nodes are at 95% filesystem [09:12:40] even if it hasn't caused an issue yet, it will at some point [09:13:41] !log `apt-get clean` on registry200[4,5] to free some space [09:13:43] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v10.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1132550 (owner: 10Volans) [09:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:51] elukey: and an nginx restart later it no longer returns weird stuff [09:14:08] wow [09:14:21] :/ [09:14:23] I have no idea, just did it on a hunch, I 'll look at logs now [09:14:31] dcausse: wanna try another deploy? [09:14:35] sure [09:14:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10692967 (10phaultfinder) [09:15:00] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1132423|Translate: fix elasticsearch cluster setup (T390244)]] [09:15:04] T390244: InvalidArgumentException: Default TTM service eqiad cannot be write only - https://phabricator.wikimedia.org/T390244 [09:16:44] progressing now [09:16:58] 3s Normal Pulled pod/mw-debug.eqiad.next-5b87d68897-lnz28 Successfully pulled image "docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2025-03-31-072141-webserver" in 5.302419257s [09:17:01] yup [09:17:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:17:36] nice! [09:19:05] !log restart of nginx on registry2004 (by akosiaris) - only instance returning inconsistent responses for a given layer request - T390251 [09:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:10] T390251: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251 [09:19:15] for tracking [09:20:07] (03CR) 10Slyngshede: [C:03+2] Upgrade IDM to Bitu 0.1.8 [dns] - 10https://gerrit.wikimedia.org/r/1132545 (owner: 10Slyngshede) [09:20:15] !log slyngshede@dns1004 START - running authdns-update [09:22:15] FIRING: [3x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [09:22:21] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10692979 (10Jelto) I double checked all groups, @Ben.buchenau is not member of the NDA Phabricator group. I added Ben. Also your email address is not verified. Can you... [09:22:30] !log slyngshede@dns1004 END - running authdns-update [09:23:03] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v10.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1132550 (owner: 10Volans) [09:26:20] (03PS1) 10Volans: Upstream release v10.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1132559 [09:26:42] akosiaris: failed again but at 50% [09:26:46] (03PS1) 10Joal: Fix webrequest_sampled_live webrequest_source value [puppet] - 10https://gerrit.wikimedia.org/r/1132560 (https://phabricator.wikimedia.org/T390029) [09:27:14] (03CR) 10Volans: [C:03+2] Upstream release v10.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1132559 (owner: 10Volans) [09:27:26] it's rolling back now [09:27:41] elukey, fabfur --^ [09:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10693000 (10phaultfinder) [09:29:37] dcausse: same error? :( [09:30:22] joal: ahhh so it is on the benthos-webrequest side! [09:30:29] elukey: from scap I don't see the root cause [09:30:31] fabfur: my bad then, didn't remember it [09:30:50] (03CR) 10Elukey: [C:03+2] Fix webrequest_sampled_live webrequest_source value [puppet] - 10https://gerrit.wikimedia.org/r/1132560 (https://phabricator.wikimedia.org/T390029) (owner: 10Joal) [09:31:18] joal: rolling it out now [09:31:24] elukey: <3 [09:31:35] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:32:04] akosiaris: in the meantime I found https://phabricator.wikimedia.org/T390251#10692999 that is a little weird, I didn't know about those caches [09:32:15] RESOLVED: [3x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [09:37:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:37:40] dcausse: did it tell what namespace failed? [09:37:54] it's test servers [09:37:55] (03Merged) 10jenkins-bot: Upstream release v10.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1132559 (owner: 10Volans) [09:38:37] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:41:49] dcausse: anything on logstash etc..? I am very ignorant about debugging scap failures for mw-on-k8s, but in the namespaces I don't see any more weird warnings like before [09:42:15] like, could it be something wrong on the patch itself? [09:42:52] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [09:44:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10693031 (10phaultfinder) [09:49:18] (03PS4) 10Ayounsi: Add transit/peering in/out port saturation alert - try 2 [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) [09:49:18] (03PS2) 10Ayounsi: CloudCoreBGPDown: set severity to critical + scope network [alerts] - 10https://gerrit.wikimedia.org/r/1131011 (https://phabricator.wikimedia.org/T388641) [09:49:18] (03PS1) 10Ayounsi: Add alerts for network alarms [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) [09:52:14] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Watchlist, and 2 others: Notifications about changes by Oznamovatel sent to Janbery doesn't seem to be reliable - https://phabricator.wikimedia.org/T245762#10693047 (10matej_suchanek) [09:53:52] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [09:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10693049 (10phaultfinder) [09:55:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet [09:58:16] elukey: sorry was in a meeting, looking close at kubectl get events [09:58:28] !log fabfur@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cp4047.ulsfo.wmnet with reason: HW errors [09:58:33] dcausse: here as well, I can help [09:58:33] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10693050 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=50155bba-c03f-4da4-a7ab-39982fc57c53) set by fabfur@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their services with reason:... [09:59:23] so from scap it failed with "Error: UPGRADE FAILED: release pinkunicorn failed, and has been rolled back due to atomic being set: timed out waiting for the condition" [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1000) [10:01:37] jouncebot: now and next [10:01:37] For the next 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1000) [10:03:32] elukey: I see some "MountVolume.SetUp failed for volume "mediawiki-pinkunicorn-mail" : failed to sync configmap cache: timed out waiting for the condition" [10:04:14] going to retry [10:04:50] dcausse: but that is mw-debug right? You said it reached 50%, I thought it passed the mw-debug step [10:04:57] yes [10:05:02] no it did not [10:05:17] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1132423|Translate: fix elasticsearch cluster setup (T390244)]] [10:05:22] T390244: InvalidArgumentException: Default TTM service eqiad cannot be write only - https://phabricator.wikimedia.org/T390244 [10:05:41] 50% here means only 1 pod got upgraded iirc [10:05:46] it did only 6 pods of the mw-debug ns [10:05:57] ahhh right [10:06:25] in mw-debug eqiad I don't see signs of the registry acting funny [10:06:28] that is a good thing [10:07:37] (03CR) 10Ayounsi: [C:03+2] Add transit/peering in/out port saturation alert - try 2 [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [10:08:08] same, stalling at 50% [10:08:22] perhaps pods in eqiad? [10:08:33] no sorry, reached 7 now [10:08:50] (03Merged) 10jenkins-bot: Add transit/peering in/out port saturation alert - try 2 [alerts] - 10https://gerrit.wikimedia.org/r/1130625 (https://phabricator.wikimedia.org/T384052) (owner: 10Ayounsi) [10:10:04] ah no snap registry error again [10:10:13] Failed to pull image "docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2025-03-31-072141-webserver": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2025-03-31-072141-webserver": failed commit on ref "layer-sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb": [10:10:19] unexpected commit digest sha256:a4b0a271884c01327ca3db66f761d13ffe1a24173ad59537c29cad23417cf3aa, expected sha256:65b5b2cdb1e2f6ff09fcd1220ef4ee83f70e5929ff07e9267fb69e72f2f55ceb: failed precondition [10:10:24] lemme restart nginx on 2005 as well [10:10:32] Damnit [10:10:54] Is there any ongoing work with regard to bast1003.wikimedia.org at the moment? I can SSH into other bastions, but not this one? Just curious. Don't want to distract from the work on the registry. [10:10:57] Same image though? [10:11:02] !log restart nginx on registry2005 - T390251 [10:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:07] (03PS2) 10Ayounsi: Add alerts for network alarms [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) [10:11:07] T390251: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251 [10:11:14] Thats very weird [10:12:07] happened on wikikube-worker1106.eqiad.wmnet and wikikube-worker1118.eqiad.wmnet [10:12:15] FIRING: HttpdUnreachable: httpd unavailable for deployment mw-misc/main at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=257&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-misc&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [10:12:45] (03PS1) 10Btullis: Upgrade PHP on the misc dumps worker - snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1132568 (https://phabricator.wikimedia.org/T382484) [10:12:59] tried to manually delete one pod after the nginx restart [10:13:55] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5172/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132568 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis) [10:14:15] and the other error on wikikube-worker2081.codfw.wmnet [10:15:01] nope, errors again [10:15:30] (03PS1) 10Kevin Bazira: ml-services: update RRML image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132572 (https://phabricator.wikimedia.org/T389768) [10:15:44] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti4006.ulsfo.wmnet with reason: remove from cluster for reimage [10:15:49] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10693083 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8b39a21c-2178-4c2d-85ec-b458f3c9ab46) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and the... [10:16:48] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti4006 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1132549 (owner: 10Muehlenhoff) [10:16:51] dcausse: spicy deployment morning today :D [10:16:56] yes :) [10:17:15] FIRING: [4x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [10:17:29] codfw had all its pods finally started, only one left in eqiad but just timed out [10:17:35] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:18:47] FIRING: HelmReleaseBadStatus: Helm release mw-debug/pinkunicorn on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-debug - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:19:21] I'll stop making attempts for now [10:19:27] akosiaris: could it be anything specific for /var/cache/nginx-docker-registry? [10:19:41] I see files having today's timestamp, so it seems actually used [10:22:15] RESOLVED: [4x] HttpdUnreachable: httpd unavailable for deployment mw-debug/next at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://alerts.wikimedia.org/?q=alertname%3DHttpdUnreachable [10:23:20] (03CR) 10Brouberol: [C:03+1] "Nice" [puppet] - 10https://gerrit.wikimedia.org/r/1132568 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis) [10:23:41] (03CR) 10Btullis: [V:03+1 C:03+2] Upgrade PHP on the misc dumps worker - snapshot1016 [puppet] - 10https://gerrit.wikimedia.org/r/1132568 (https://phabricator.wikimedia.org/T382484) (owner: 10Btullis) [10:23:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-debug/pinkunicorn on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-debug - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:25:50] (03PS1) 10Muehlenhoff: Make cumin1003 a Cumin node [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) [10:26:01] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti4006.ulsfo.wmnet [10:27:23] I don't know tbh, it's weird [10:28:09] after reading https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path it could be possible to construct the key and see if it is wrongly stored in that cache [10:28:27] but maybe we could simply stop depool - stop nginx - wipe - start nginx - pool [10:29:14] (03CR) 10Volans: "LGTM, but bare in mind that the spicerack debian package is not available. Not sure if easier to wait until available." [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [10:41:09] (03PS1) 10Giuseppe Lavagetto: mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 [10:42:20] (03CR) 10CI reject: [V:04-1] mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [10:43:12] (03PS1) 10Filippo Giunchedi: alertmanager: test per-summary tasks for network alerts [puppet] - 10https://gerrit.wikimedia.org/r/1132581 (https://phabricator.wikimedia.org/T388641) [10:43:39] (03PS1) 10Elukey: role::docker_registry_had::registry: disable nginx cache [puppet] - 10https://gerrit.wikimedia.org/r/1132582 (https://phabricator.wikimedia.org/T390251) [10:43:56] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132582 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [10:45:21] (03PS1) 10Slyngshede: Permission log: Improve speed of permission log [software/bitu] - 10https://gerrit.wikimedia.org/r/1132583 [10:45:50] (03CR) 10Muehlenhoff: "No rush at all! This can wait until Spicerack is done" [puppet] - 10https://gerrit.wikimedia.org/r/1132577 (https://phabricator.wikimedia.org/T389380) (owner: 10Muehlenhoff) [10:46:27] akosiaris: I'd honestly try https://gerrit.wikimedia.org/r/c/operations/puppet/+/1132582 [10:46:50] (03PS1) 10Btullis: Update the PHP version in the other dumps configuration file [puppet] - 10https://gerrit.wikimedia.org/r/1132584 (https://phabricator.wikimedia.org/T319432) [10:47:06] (03PS1) 10Hnowlan: api-gateway: use rest-gateway for wikifeeds calls to restbase [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132585 (https://phabricator.wikimedia.org/T390317) [10:47:31] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5173/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132582 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [10:47:51] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5174/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132584 (https://phabricator.wikimedia.org/T319432) (owner: 10Btullis) [10:48:16] (03CR) 10Btullis: [V:03+1 C:03+2] Update the PHP version in the other dumps configuration file [puppet] - 10https://gerrit.wikimedia.org/r/1132584 (https://phabricator.wikimedia.org/T319432) (owner: 10Btullis) [10:50:02] 06SRE, 06Infrastructure-Foundations, 10Mail, 10MediaWiki-Watchlist, and 2 others: Notifications about changes by Oznamovatel sent to Janbery doesn't seem to be reliable - https://phabricator.wikimedia.org/T245762#10693151 (10Samwalton9-WMF) 05Open→03Resolved Given the lack of evidence of a problem... [10:51:00] (03PS2) 10Giuseppe Lavagetto: mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 [10:52:11] (03CR) 10CI reject: [V:04-1] mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [10:54:51] 06SRE-OnFire, 06MediaWiki-Engineering, 06serviceops, 10Sustainability (Incident Followup): Reduce the amount of messages sent through channel:Memcached during failures - https://phabricator.wikimedia.org/T390529 (10jijiki) 03NEW [10:56:51] (03PS1) 10Kamila Součková: hiera: add aux-k8s-codfw to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1132587 (https://phabricator.wikimedia.org/T381417) [10:58:32] (03CR) 10Elukey: [C:03+1] hiera: add aux-k8s-codfw to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1132587 (https://phabricator.wikimedia.org/T381417) (owner: 10Kamila Součková) [10:58:51] (03CR) 10Kamila Součková: [C:03+2] hiera: add aux-k8s-codfw to deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1132587 (https://phabricator.wikimedia.org/T381417) (owner: 10Kamila Součková) [10:59:01] (03PS2) 10Muehlenhoff: Failover to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1132540 [11:01:15] jouncebot: next [11:01:15] In 1 hour(s) and 58 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1300) [11:01:42] (03CR) 10Clément Goubert: [C:03+1] role::docker_registry_had::registry: disable nginx cache [puppet] - 10https://gerrit.wikimedia.org/r/1132582 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [11:01:51] makes sense to try without cache [11:02:26] claime: thanks, ok to roll it out? [11:02:34] namely, I'll do it but is it ok for you? [11:02:45] I think so yeah [11:02:49] (03PS1) 10Tiziano Fogli: tests: Add guidelines to avoid rebuilding the container on every change [alerts] - 10https://gerrit.wikimedia.org/r/1132590 [11:02:52] I mean I'm not planning on deploying anything [11:03:28] ack, so I'll stop puppet on the registry nodes [11:03:34] merge, deploy in eqiad, test quickly [11:03:37] and proceed to codfw [11:03:40] (03CR) 10Elukey: [V:03+1 C:03+2] role::docker_registry_had::registry: disable nginx cache [puppet] - 10https://gerrit.wikimedia.org/r/1132582 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [11:05:32] !log remove docker registry nginx cache settings from registry* - T390251 [11:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:38] T390251: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251 [11:06:59] (03CR) 10Jelto: [C:03+1] "thank you" [puppet] - 10https://gerrit.wikimedia.org/r/1132587 (https://phabricator.wikimedia.org/T381417) (owner: 10Kamila Součková) [11:07:36] !log elukey@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on registry1004.eqiad.wmnet with reason: maintenance [11:07:58] (03PS2) 10Tiziano Fogli: tests: Add guidelines to avoid rebuilding the container on every change [alerts] - 10https://gerrit.wikimedia.org/r/1132590 [11:08:01] (03PS3) 10Ayounsi: CloudCoreBGPDown: set severity to critical + scope network [alerts] - 10https://gerrit.wikimedia.org/r/1131011 (https://phabricator.wikimedia.org/T388641) [11:08:01] (03PS3) 10Ayounsi: Add alerts for network alarms [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) [11:08:01] (03PS1) 10Ayounsi: Make the interface error alert less sensitive [alerts] - 10https://gerrit.wikimedia.org/r/1132591 (https://phabricator.wikimedia.org/T388641) [11:08:47] (03CR) 10Kamila Součková: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [11:08:58] of course it doesn't work [11:09:04] crap [11:09:14] not at all, or it's not disabling the cache? [11:09:28] (03CR) 10CI reject: [V:04-1] Make the interface error alert less sensitive [alerts] - 10https://gerrit.wikimedia.org/r/1132591 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:09:41] so in location = /auth/basic we use proxy_cache auth [11:09:47] so the flag didn't fix it [11:09:53] (03CR) 10CI reject: [V:04-1] Add alerts for network alarms [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:09:56] (03CR) 10CI reject: [V:04-1] CloudCoreBGPDown: set severity to critical + scope network [alerts] - 10https://gerrit.wikimedia.org/r/1131011 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:10:08] augh [11:10:24] so we can probably split the cache in two, auth cache (ok to keep) and blob cache [11:10:27] sending a patch now [11:10:48] wdyt claime ? [11:10:53] yep [11:10:57] sgtm [11:11:10] are you around for 10 mins? [11:11:14] yeah [11:12:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [11:12:56] (03CR) 10Muehlenhoff: [C:03+2] Failover to idp2004 [dns] - 10https://gerrit.wikimedia.org/r/1132540 (owner: 10Muehlenhoff) [11:13:07] !log jmm@dns1004 START - running authdns-update [11:13:59] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10693231 (10BTullis) I have done the following on each of the three hosts: * Disabled puppet * Stopped the `hadoop-hdfs-datanode... [11:14:07] (03PS3) 10Tiziano Fogli: tests: Add guidelines to avoid rebuilding the container on every change [alerts] - 10https://gerrit.wikimedia.org/r/1132590 [11:14:51] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10693232 (10BTullis) [11:15:04] (03CR) 10Filippo Giunchedi: [C:03+1] tests: Add guidelines to avoid rebuilding the container on every change [alerts] - 10https://gerrit.wikimedia.org/r/1132590 (owner: 10Tiziano Fogli) [11:15:21] !log jmm@dns1004 END - running authdns-update [11:15:59] (03CR) 10Tiziano Fogli: [C:03+2] "Thank you" [alerts] - 10https://gerrit.wikimedia.org/r/1132590 (owner: 10Tiziano Fogli) [11:16:33] FIRING: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:17:10] (03Merged) 10jenkins-bot: tests: Add guidelines to avoid rebuilding the container on every change [alerts] - 10https://gerrit.wikimedia.org/r/1132590 (owner: 10Tiziano Fogli) [11:19:06] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10693235 (10BTullis) a:05BTullis→03VRiley-WMF @Jclark-ctr @VRiley-WMF - These three hosts are ready for a hard drive swap, wh... [11:20:35] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10693242 (10Jclark-ctr) Acknowledge @btullis [11:20:54] (03PS4) 10Ayounsi: CloudCoreBGPDown: set severity to critical + scope network [alerts] - 10https://gerrit.wikimedia.org/r/1131011 (https://phabricator.wikimedia.org/T388641) [11:20:54] (03PS4) 10Ayounsi: Add alerts for network alarms [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) [11:20:54] (03PS2) 10Ayounsi: Make the interface error alert less sensitive [alerts] - 10https://gerrit.wikimedia.org/r/1132591 (https://phabricator.wikimedia.org/T388641) [11:21:08] elukey: the alert we received on -traffic about `FIRING: FermMSS: Unexpected MSS value on 10.2.2.44:443 @ registry1004 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=eqiad&var-cluster=misc - https://alerts.wikimedia.org/?q=alertname%3DFermMSS` could be related to your work? [11:22:09] yeah probably, nginx is not happy [11:22:14] I am prepping a patch [11:22:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [11:22:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti4006.ulsfo.wmnet [11:24:26] (03PS5) 10Ayounsi: Add alerts for network alarms [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) [11:24:26] (03PS3) 10Ayounsi: Make the interface error alert less sensitive [alerts] - 10https://gerrit.wikimedia.org/r/1132591 (https://phabricator.wikimedia.org/T388641) [11:24:56] claime: patch incoming, a little more complicated than expected [11:25:09] elukey: no problem, lmk if you need my help [11:25:18] (03PS1) 10Elukey: docker_registry_ha: split blob and auth cache [puppet] - 10https://gerrit.wikimedia.org/r/1132595 (https://phabricator.wikimedia.org/T390251) [11:25:40] running pcc now [11:26:49] !log wikikube-worker1039.eqiad.wmnet - powercycling from idrac [11:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:26] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:27:26] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5175/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132595 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [11:28:55] (03CR) 10Filippo Giunchedi: [C:03+1] Add alerts for network alarms [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:29:12] (03PS2) 10Elukey: docker_registry_ha: split blob and auth cache [puppet] - 10https://gerrit.wikimedia.org/r/1132595 (https://phabricator.wikimedia.org/T390251) [11:31:04] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5176/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132595 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [11:31:22] (03CR) 10Ayounsi: [C:03+2] CloudCoreBGPDown: set severity to critical + scope network [alerts] - 10https://gerrit.wikimedia.org/r/1131011 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:31:33] RESOLVED: KubernetesCalicoDown: wikikube-worker1039.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=wikikube-worker1039.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:32:31] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1039.eqiad.wmnet [11:32:31] claime: ready, I have some trouble with carriage returns but I will stop fighing with erb :D [11:32:32] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1039.eqiad.wmnet [11:32:32] (03Merged) 10jenkins-bot: CloudCoreBGPDown: set severity to critical + scope network [alerts] - 10https://gerrit.wikimedia.org/r/1131011 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:35:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4006.ulsfo.wmnet with OS bookworm [11:35:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10693261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bookworm [11:36:36] (03CR) 10Ayounsi: [C:03+1] "I'm no expert but logic sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/1132581 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [11:37:31] 06SRE-OnFire, 06Release-Engineering-Team, 06serviceops, 10Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531 (10jijiki) 03NEW [11:37:34] (03CR) 10Clément Goubert: [C:03+1] docker_registry_ha: split blob and auth cache [puppet] - 10https://gerrit.wikimedia.org/r/1132595 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [11:38:01] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T389884#10693273 (10ayounsi) 05Open→03Invalid The alert was too sensitive, I made https://gerrit.wikimedia.org/r/c/operations/alerts/+/1132591 to improve it. [11:38:12] (03CR) 10Elukey: [V:03+1 C:03+2] docker_registry_ha: split blob and auth cache [puppet] - 10https://gerrit.wikimedia.org/r/1132595 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [11:38:22] 06SRE-OnFire, 06Release-Engineering-Team, 06serviceops, 10Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#10693275 (10jijiki) [11:38:30] (03CR) 10Ayounsi: [C:03+2] "Thx" [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:38:39] (03CR) 10Filippo Giunchedi: [C:03+2] alertmanager: test per-summary tasks for network alerts [puppet] - 10https://gerrit.wikimedia.org/r/1132581 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [11:39:40] (03CR) 10Filippo Giunchedi: [C:03+1] Make the interface error alert less sensitive [alerts] - 10https://gerrit.wikimedia.org/r/1132591 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:39:41] (03Merged) 10jenkins-bot: Add alerts for network alarms [alerts] - 10https://gerrit.wikimedia.org/r/1132563 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:40:04] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105#10693284 (10ayounsi) Closing this task as we now have alerting for all the MX running a not too old Junos (and we're upgrading Junos in T364092). [11:40:08] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105#10693287 (10ayounsi) 05Stalled→03Resolved a:03ayounsi [11:41:16] (03CR) 10Ayounsi: [C:03+2] Make the interface error alert less sensitive [alerts] - 10https://gerrit.wikimedia.org/r/1132591 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:41:52] (03PS4) 10Ayounsi: gNMIc: collect BFD states [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) [11:42:05] (03PS1) 10Marostegui: db1256: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1132601 [11:42:27] (03Merged) 10jenkins-bot: Make the interface error alert less sensitive [alerts] - 10https://gerrit.wikimedia.org/r/1132591 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:42:31] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:34] (03CR) 10Marostegui: [C:03+2] db1256: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1132601 (owner: 10Marostegui) [11:42:58] (03PS1) 10Marostegui: Revert "db1256: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1132602 [11:43:04] (03CR) 10Marostegui: [C:04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/1132602 (owner: 10Marostegui) [11:48:49] (03PS1) 10Filippo Giunchedi: alertmanager: fixup dcops-task-network task title [puppet] - 10https://gerrit.wikimedia.org/r/1132607 (https://phabricator.wikimedia.org/T388641) [11:49:12] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] alertmanager: fixup dcops-task-network task title [puppet] - 10https://gerrit.wikimedia.org/r/1132607 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [11:49:23] !log mvernon@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1066.eqiad.wmnet with reason: vacuum overlarge container dbs [11:49:27] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1132019 (owner: 10Hashar) [11:49:30] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10693301 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=971d7032-000d-493f-b033-a0b4543e34c6) set by mvernon@cumin... [11:51:23] ok so I have updated registry100* and I tested authentication for /restricted/, all works [11:51:45] !log VACUUM large container dbs on ms-be1066 T377827 [11:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:50] T377827: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827 [11:52:08] (03PS1) 10Filippo Giunchedi: alertmanager: fixup #2 dcops-task-network receiver [puppet] - 10https://gerrit.wikimedia.org/r/1132608 (https://phabricator.wikimedia.org/T388641) [11:53:52] 06SRE-OnFire, 06MediaWiki-Engineering, 06serviceops-radar, 10Sustainability (Incident Followup): Reduce the amount of messages sent through channel:Memcached during failures - https://phabricator.wikimedia.org/T390529#10693326 (10jijiki) [11:53:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [11:54:04] (03CR) 10Slyngshede: [C:03+1] "Looks good. We're on Python 3.11, so removeprefix is available." [software/bitu] - 10https://gerrit.wikimedia.org/r/1131991 (owner: 10Hashar) [11:54:33] (03PS2) 10Filippo Giunchedi: alertmanager: fixup #2 dcops-task-network receiver [puppet] - 10https://gerrit.wikimedia.org/r/1132608 (https://phabricator.wikimedia.org/T388641) [11:54:44] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] alertmanager: fixup #2 dcops-task-network receiver [puppet] - 10https://gerrit.wikimedia.org/r/1132608 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [11:55:31] dcausse: o/ [11:55:42] if you have time we can retry the deployment [11:56:29] (03CR) 10Slyngshede: [C:03+1] "Looks good, that does read a lot better." [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 (owner: 10Hashar) [11:56:44] (03CR) 10Jelto: "thank you Kamila for looking into the CI issue!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [11:56:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [12:02:15] !log marostegui@cumin1002 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) of db1211.eqiad.wmnet onto db1256.eqiad.wmnet [12:02:17] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10693360 (10ayounsi) [12:02:29] (03CR) 10Marostegui: Revert "db1256: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1132602 (owner: 10Marostegui) [12:02:29] (03CR) 10Marostegui: [C:03+2] Revert "db1256: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1132602 (owner: 10Marostegui) [12:05:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10693365 (10Clement_Goubert) >>! In T384970#10688038, @Jhancock.wm wrote: > @Clement_Goubert i finished all but one server (2331). Luca is trying to... [12:05:42] jouncebot: nowandnext [12:05:42] No deployments scheduled for the next 0 hour(s) and 54 minute(s) [12:05:42] In 0 hour(s) and 54 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1300) [12:05:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:08:15] !log Deploying 1131037 mw::periodic_job: Migrate blameStartupRegistry.php - T388540 [12:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:19] T388540: Migrate "startupregistrystats" maintenance script to k8s-mw-cron (mediawiki-platform-team) - https://phabricator.wikimedia.org/T388540 [12:08:47] (03CR) 10Clément Goubert: [C:03+2] mw::periodic_job: Migrate blameStartupRegistry.php [puppet] - 10https://gerrit.wikimedia.org/r/1131037 (https://phabricator.wikimedia.org/T388540) (owner: 10Clément Goubert) [12:10:41] elukey: I'm around, will try to finish deploying the change from this morning [12:10:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:11:39] !log set "graceful sender" option on cr2-drmrs to darin for JunOS upgrade T364092 [12:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:44] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:12:18] !log dcausse@deploy1003 Started scap sync-world: Backport for [[gerrit:1132423|Translate: fix elasticsearch cluster setup (T390244)]] [12:12:23] T390244: InvalidArgumentException: Default TTM service eqiad cannot be write only - https://phabricator.wikimedia.org/T390244 [12:13:29] (03CR) 10Klausman: [C:03+2] api-gateway: enable anonymous reqs to edit check staging and update name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132534 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [12:14:37] (03PS1) 10Clément Goubert: mw::periodic_job: Fix blameStartupRegistry.php timing [puppet] - 10https://gerrit.wikimedia.org/r/1132614 (https://phabricator.wikimedia.org/T388540) [12:15:07] (03Merged) 10jenkins-bot: api-gateway: enable anonymous reqs to edit check staging and update name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132534 (https://phabricator.wikimedia.org/T388269) (owner: 10Ilias Sarantopoulos) [12:15:07] (03PS1) 10Gergő Tisza: OATHAuth: Mark centralnoticeadmin as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132615 (https://phabricator.wikimedia.org/T208113) [12:15:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132615 (https://phabricator.wikimedia.org/T208113) (owner: 10Gergő Tisza) [12:15:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4006.ulsfo.wmnet with OS bookworm [12:15:43] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10693414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bookworm completed: - ganeti4006 (**PASS*... [12:15:45] RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:15:48] sync-testservers-k8s worked [12:16:17] (03CR) 10Alexandros Kosiaris: [C:03+1] role::docker_registry_had::registry: disable nginx cache [puppet] - 10https://gerrit.wikimedia.org/r/1132582 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [12:16:36] !log dcausse@deploy1003 dcausse: Backport for [[gerrit:1132423|Translate: fix elasticsearch cluster setup (T390244)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:16:52] dcausse: \o/ [12:17:14] (03CR) 10Clément Goubert: [C:03+2] mw::periodic_job: Fix blameStartupRegistry.php timing [puppet] - 10https://gerrit.wikimedia.org/r/1132614 (https://phabricator.wikimedia.org/T388540) (owner: 10Clément Goubert) [12:17:26] (03CR) 10Alexandros Kosiaris: [C:03+1] docker_registry_ha: split blob and auth cache [puppet] - 10https://gerrit.wikimedia.org/r/1132595 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [12:17:47] dcausse: good to know, thanks! [12:18:08] !log dcausse@deploy1003 dcausse: Continuing with sync [12:19:07] (03PS1) 10Federico Ceratto: clone.py: skip dbctl addition on --nopool [cookbooks] - 10https://gerrit.wikimedia.org/r/1132618 (https://phabricator.wikimedia.org/T390217) [12:19:57] (03CR) 10Federico Ceratto: "(still to be tested with dry run)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1132618 (https://phabricator.wikimedia.org/T390217) (owner: 10Federico Ceratto) [12:20:12] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132619 [12:20:25] !log klausman@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [12:20:44] !log klausman@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [12:23:29] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [12:23:32] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [12:23:38] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:23:41] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:25:20] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [12:25:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet [12:27:57] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:28:16] !log dcausse@deploy1003 Finished scap sync-world: Backport for [[gerrit:1132423|Translate: fix elasticsearch cluster setup (T390244)]] (duration: 15m 57s) [12:28:20] \o/ [12:28:20] T390244: InvalidArgumentException: Default TTM service eqiad cannot be write only - https://phabricator.wikimedia.org/T390244 [12:28:47] all good, thanks for investigating and fixing this issue akosiaris, elukey! :) [12:29:00] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Upgrade cr1-drmrs JunOS [12:29:12] (03PS1) 10Marostegui: zarcillo.sql: Add zarcillo schema [software] - 10https://gerrit.wikimedia.org/r/1132626 [12:30:03] (03CR) 10Marostegui: [C:03+2] zarcillo.sql: Add zarcillo schema [software] - 10https://gerrit.wikimedia.org/r/1132626 (owner: 10Marostegui) [12:30:32] (03Merged) 10jenkins-bot: zarcillo.sql: Add zarcillo schema [software] - 10https://gerrit.wikimedia.org/r/1132626 (owner: 10Marostegui) [12:33:26] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr2-eqsin:xe-0/1/3 (Peering: SGIX (103.16.102.187) {#1152}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:33:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet [12:34:49] (03PS1) 10Clément Goubert: mediawiki: Fix CronJob definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132627 (https://phabricator.wikimedia.org/T341555) [12:37:43] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Fix CronJob definitions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132627 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:39:36] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4005 [12:39:52] !log jmm@cumin2002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host ganeti4005 [12:39:56] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4006 [12:40:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti4006 [12:41:04] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@040c3ab]: Update artifacts for analytics_test [12:41:20] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@040c3ab]: Update artifacts for analytics_test (duration: 00m 16s) [12:41:54] !log cgoubert@deploy1003 Started scap sync-world: Deploy mediawiki chart 0.8.5 [12:42:15] log cgoubert@deploy1003 Started scap sync-world: Deploy mediawiki chart 0.8.6 [12:42:19] !log cgoubert@deploy1003 Started scap sync-world: Deploy mediawiki chart 0.8.6 [12:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:43] (03PS2) 10Federico Ceratto: clone.py: skip dbctl addition on --nopool [cookbooks] - 10https://gerrit.wikimedia.org/r/1132618 (https://phabricator.wikimedia.org/T390217) [12:42:44] !log cgoubert@deploy1003 cgoubert: Deploy mediawiki chart 0.8.5 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:42:49] !log cgoubert@deploy1003 cgoubert: Continuing with sync [12:43:02] (deployment is a no-op for everything but mw-cron) [12:43:59] !log cgoubert@deploy1003 Finished scap sync-world: Deploy mediawiki chart 0.8.5 (duration: 02m 19s) [12:44:12] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:44:19] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: stop consuming from legacy streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [12:44:25] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:44:35] !log klausman@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [12:44:41] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [12:44:45] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [12:45:03] !log klausman@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [12:45:40] (03Merged) 10jenkins-bot: cirrus-streaming-updater: stop consuming from legacy streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1124486 (https://phabricator.wikimedia.org/T375821) (owner: 10DCausse) [12:45:49] !log mvernon@cumin1002 START - Cookbook sre.hosts.remove-downtime for ms-be1066.eqiad.wmnet [12:45:50] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1066.eqiad.wmnet [12:47:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1 [12:48:00] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10693498 (10MatthewVernon) ms-be1066 alerted for disk-near-full again; I took a broader approach to vacuuming this time: `lang=bash set... [12:48:23] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:48:40] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:51:12] (03CR) 10Federico Ceratto: [C:03+1] "Thanks!" [software] - 10https://gerrit.wikimedia.org/r/1132626 (owner: 10Marostegui) [12:52:01] !log reboot cr2-drmrs to updrade JunOS T364092 [12:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:05] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [12:52:10] (03PS1) 10Clément Goubert: mediawiki: Add labels to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132630 (https://phabricator.wikimedia.org/T341555) [12:53:17] (03CR) 10CI reject: [V:04-1] mediawiki: Add labels to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132630 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [12:54:05] (03CR) 10Marostegui: [C:03+2] "Merged already!" [software] - 10https://gerrit.wikimedia.org/r/1132626 (owner: 10Marostegui) [12:54:15] (03PS4) 10Esanders: VE: Enable mobile insert menu everywhere except top 10 wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 [12:54:24] (03CR) 10Marostegui: "I will test this as I need to clone a host that won't be pooled." [cookbooks] - 10https://gerrit.wikimedia.org/r/1132618 (https://phabricator.wikimedia.org/T390217) (owner: 10Federico Ceratto) [12:54:53] (03PS5) 10Esanders: VE: Enable mobile insert menu everywhere except top 10 wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388604) [12:55:17] (03PS2) 10Clément Goubert: mediawiki: Add labels to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132630 (https://phabricator.wikimedia.org/T341555) [12:55:24] (03PS6) 10Esanders: VE: Enable mobile insert menu everywhere except top 10 wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388604) [12:55:52] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:56:06] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:57:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4006.ulsfo.wmnet to cluster ulsfo and group 1 [12:57:57] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10693509 (10MoritzMuehlenhoff) [12:58:09] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [12:58:19] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10693511 (10ops-monitoring-bot) Draining ganeti4007.ulsfo.wmnet of running VMs [12:59:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1300). [13:00:05] bpirkle, stephanebisson, and tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] I'm here [13:00:12] o/ [13:00:13] !log klausman@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [13:00:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet [13:00:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10693515 (10ops-monitoring-bot) Draining ganeti4007.ulsfo.wmnet of running VMs [13:00:45] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:00:49] !log klausman@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [13:00:49] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:02:07] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [13:02:11] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:04:57] Is a deployer available? [13:05:40] o/ [13:06:14] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:06:54] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:07:10] I suppose I can deploy [13:07:49] (03PS5) 10Ayounsi: gNMIc: collect BFD states [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) [13:08:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131384 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [13:08:25] (03CR) 10Ayounsi: "Thanks, updated. And paste updated as well." [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:08:49] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10693542 (10cmooney) [13:09:00] (03Merged) 10jenkins-bot: REST: Enable REST Sandbox on an initial set of production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131384 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [13:09:13] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1131384|REST: Enable REST Sandbox on an initial set of production wikis (T389407)]] [13:09:19] T389407: Release REST API Sandbox on 6 initial wikis - https://phabricator.wikimedia.org/T389407 [13:10:13] (03PS4) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [13:12:56] FIRING: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [13:13:02] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [13:13:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:13:33] <_joe_> !incidents [13:13:33] 5918 (UNACKED) NELHigh sre (thanos-rule tcp.timed_out) [13:13:34] 5917 (RESOLVED) [2x] SessionStoreErrorRateHigh data-persistence () [13:13:40] <_joe_> !ack 5918 [13:13:40] 5918 (ACKED) NELHigh sre (thanos-rule tcp.timed_out) [13:13:44] <_joe_> looking [13:14:12] _joe_: france, related to the router upgrade in drmrs maybe ? (cc topranks) [13:14:15] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:14:16] !log tgr@deploy1003 bpirkle, tgr: Backport for [[gerrit:1131384|REST: Enable REST Sandbox on an initial set of production wikis (T389407)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:14:23] T389407: Release REST API Sandbox on 6 initial wikis - https://phabricator.wikimedia.org/T389407 [13:14:31] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:15:21] FIRING: [5x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:15:28] XioNoX: not impossible, seemed to go quite smooth though [13:15:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [13:15:43] _joe_: could that be a longer term thing? We have seen a few complaints over the weekend about people losing their sessions all the time [13:15:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:15:59] topranks: anyway, it already went down [13:16:02] <_joe_> XioNoX: i think it was related [13:16:12] <_joe_> looking at the provenance of the nels [13:16:31] (although that could also be SUL3 or just CentralAuth sessions being slightly unreliable in general) [13:16:51] <_joe_> tgr_: sorry looking at a page rn :) but yes [13:16:54] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:16:59] (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [13:17:14] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:17:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnsta [13:17:56] RESOLVED: CirrusConsumerCloudelasticFlinkJobNotRunning: ... [13:18:02] cirrus_streaming_updater_cloudelastic_consumer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerCloudelasticFlinkJobNotRunning [13:18:08] yeah it seems related in terms of countries [13:18:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [13:19:05] seems like it matches timewise with when the router came back online and we re-established BGP sessions [13:19:14] <_joe_> amazing, i can say for once "it was a network blip" [13:19:15] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply updated master config - bking@cumin2002 - T390100 [13:19:16] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply updated master config - bking@cumin2002 - T390100 [13:19:20] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [13:19:36] bpirkle: do you need to test the sandbox? [13:19:37] which was at 13:06 UTC [13:19:57] Looks fine, please proceed [13:20:18] !log tgr@deploy1003 bpirkle, tgr: Continuing with sync [13:20:39] _joe_: yeah, sorry for the noise [13:20:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:21:13] <_joe_> topranks: wel better this way than if it was an outage :) [13:22:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:22:30] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for ban relforge1003 prior to service restart - bking@cumin2002 - T390100 [13:22:31] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for ban relforge1003 prior to service restart - bking@cumin2002 - T390100 [13:24:39] FIRING: CirrusSearchUpdaterKafkaMessagesInTooLow: ... [13:24:39] The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&var-dc=codfw%2520prometheus%252Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&from=now-1h&to=now&refresh=5m&var-topic=codfw.cirrussearch.update_pipeline.update.v1&var-topic=eqiad.cirrussearch.update_pipeline.update.v1&viewPanel=6 - ... [13:24:39] https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [13:24:56] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [13:24:56] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10693608 (10Jhancock.wm) @papaul this one keeps re-alerting for brief periods of time. [13:25:32] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:25:52] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:26:14] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:26:23] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:27:15] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:27:27] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:27:34] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131384|REST: Enable REST Sandbox on an initial set of production wikis (T389407)]] (duration: 18m 21s) [13:27:36] (03PS3) 10Clément Goubert: mediawiki: Add labels to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132630 (https://phabricator.wikimedia.org/T341555) [13:27:36] (03PS1) 10Clément Goubert: mediawiki: Fix jobConfig scope rewrite [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132636 (https://phabricator.wikimedia.org/T341555) [13:27:39] T389407: Release REST API Sandbox on 6 initial wikis - https://phabricator.wikimedia.org/T389407 [13:28:29] tgr_: thank you [13:28:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132020 (https://phabricator.wikimedia.org/T384220) (owner: 10Gergő Tisza) [13:28:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132615 (https://phabricator.wikimedia.org/T208113) (owner: 10Gergő Tisza) [13:28:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131990 (https://phabricator.wikimedia.org/T390300) (owner: 10Nik Gkountas) [13:29:39] FIRING: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [13:29:40] (03PS1) 10Filippo Giunchedi: alertmanager: open dcops tasks with title as summary [puppet] - 10https://gerrit.wikimedia.org/r/1132637 (https://phabricator.wikimedia.org/T388641) [13:29:40] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10693617 (10phaultfinder) [13:29:50] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in eqiad (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [13:30:13] !log volans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1002.eqiad.wmnet with reason: Test [13:30:14] (03Merged) 10jenkins-bot: Enable SUL3 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132020 (https://phabricator.wikimedia.org/T384220) (owner: 10Gergő Tisza) [13:30:16] (03Merged) 10jenkins-bot: OATHAuth: Mark centralnoticeadmin as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132615 (https://phabricator.wikimedia.org/T208113) (owner: 10Gergő Tisza) [13:30:21] (03Merged) 10jenkins-bot: SpecialTranslationTargetLanguages: Use cxserver-supported language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131990 (https://phabricator.wikimedia.org/T390300) (owner: 10Nik Gkountas) [13:30:31] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:30:34] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1132020|Enable SUL3 everywhere (T384220)]], [[gerrit:1132615|OATHAuth: Mark centralnoticeadmin as requiring 2FA (T208113)]], [[gerrit:1131990|SpecialTranslationTargetLanguages: Use cxserver-supported language codes (T390300)]] [13:30:36] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:30:41] T384220: SUL3 Phase 5: Staged rollout for all temporary accounts - https://phabricator.wikimedia.org/T384220 [13:30:41] T390300: SX mobile frequent languages entrypoint not working properly with special language codes - https://phabricator.wikimedia.org/T390300 [13:30:57] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:31:05] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:31:24] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:32:10] (03CR) 10Filippo Giunchedi: "This is an experiment to get more meaningful tasks open for dcops. Namely the task title will be the alert's "summary" annotation, which w" [puppet] - 10https://gerrit.wikimedia.org/r/1132637 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [13:32:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:32:45] FIRING: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:33:31] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:33:37] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:34:39] RESOLVED: [2x] CirrusSearchUpdaterKafkaMessagesInTooLow: The summed message update rate of `(eqiad|codfw).cirrussearch.update_pipeline.update.v1` is too low - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchUpdaterKafkaMessagesInTooLow [13:34:59] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10693660 (10fgiunchedi) FYI should be fixed / mitigated by https://gerrit.wikimedia.org/r/1132591 [13:35:43] !log tgr@deploy1003 ngkountas, tgr: Backport for [[gerrit:1132020|Enable SUL3 everywhere (T384220)]], [[gerrit:1132615|OATHAuth: Mark centralnoticeadmin as requiring 2FA (T208113)]], [[gerrit:1131990|SpecialTranslationTargetLanguages: Use cxserver-supported language codes (T390300)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:35:49] T384220: SUL3 Phase 5: Staged rollout for all temporary accounts - https://phabricator.wikimedia.org/T384220 [13:35:50] T390300: SX mobile frequent languages entrypoint not working properly with special language codes - https://phabricator.wikimedia.org/T390300 [13:37:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:38:43] stephanebisson: ^ [13:39:30] jouncebot: now and next [13:39:30] For the next 0 hour(s) and 20 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1300) [13:39:41] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:41:33] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10693687 (10Tgr) I assume the sawtooth pattern is the artifact of some OS optimization process? [13:41:53] I'll wait for the backport window then go ahead with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131301 [13:42:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:42:56] tgr_ all good [13:43:20] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10693709 (10Tgr) [13:43:30] !log tgr@deploy1003 ngkountas, tgr: Continuing with sync [13:43:37] (03PS1) 10Muehlenhoff: Bitu: Add approval config for airflow-ml-ops [puppet] - 10https://gerrit.wikimedia.org/r/1132640 [13:44:30] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:44:32] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:44:45] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:44:55] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:45:43] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:45:52] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:45:59] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10693761 (10Ladsgroup) Update: if all goes well, this should be done in two to three weeks. [13:47:45] FIRING: [3x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [13:48:54] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10693878 (10Tgr) CentralAuth session read / write rates are flat: {F58952025} (The big spike is presumably {T389727} related. Although I wonder how we get a s... [13:49:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10693913 (10phaultfinder) [13:49:39] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:49:41] 06SRE, 06Commons, 10MediaWiki-File-management, 06serviceops, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155#10693912 (10Ladsgroup) >>! In T266155#9766334, @Bawolff wrote: > I think if we did deliver the wrong thu... [13:50:00] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:50:37] (03PS1) 10Ayounsi: Fix TransitPeering[In|Out]boundSaturation [alerts] - 10https://gerrit.wikimedia.org/r/1132641 (https://phabricator.wikimedia.org/T388641) [13:51:00] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1132020|Enable SUL3 everywhere (T384220)]], [[gerrit:1132615|OATHAuth: Mark centralnoticeadmin as requiring 2FA (T208113)]], [[gerrit:1131990|SpecialTranslationTargetLanguages: Use cxserver-supported language codes (T390300)]] (duration: 20m 25s) [13:51:06] T384220: SUL3 Phase 5: Staged rollout for all temporary accounts - https://phabricator.wikimedia.org/T384220 [13:51:06] T390300: SX mobile frequent languages entrypoint not working properly with special language codes - https://phabricator.wikimedia.org/T390300 [13:51:06] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1132640 (owner: 10Muehlenhoff) [13:51:27] !log UTC afternoon deploys done [13:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:32] ^godog [13:51:39] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Add labels to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132630 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:51:43] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Fix jobConfig scope rewrite [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132636 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:51:48] (03CR) 10CI reject: [V:04-1] Fix TransitPeering[In|Out]boundSaturation [alerts] - 10https://gerrit.wikimedia.org/r/1132641 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:52:00] (03CR) 10Muehlenhoff: [C:03+2] Bitu: Add approval config for airflow-ml-ops [puppet] - 10https://gerrit.wikimedia.org/r/1132640 (owner: 10Muehlenhoff) [13:52:00] tgr_: thank you! [13:52:34] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:52:45] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:52:59] (03CR) 10D3r1ck01: "Related to T384232 while we were doing testing of SUL3 on production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121055 (owner: 10Gergő Tisza) [13:53:46] (03Merged) 10jenkins-bot: mediawiki: Add labels to CronJobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132630 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:54:32] (03Merged) 10jenkins-bot: mediawiki: Fix jobConfig scope rewrite [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132636 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [13:54:39] (03PS1) 10Bking: relforge: replace soon-to-be-decommissioned cluster endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1132642 (https://phabricator.wikimedia.org/T390565) [13:54:58] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132642 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [13:55:06] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: move k8s prometheus1005 -> 1007 [puppet] - 10https://gerrit.wikimedia.org/r/1131301 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [13:55:11] !log cgoubert@deploy1003 Started scap sync-world: Deploy mediawiki chart 0.8.8 [13:55:21] (no-op for anything but mw-cron) [13:55:59] !log cgoubert@deploy1003 cgoubert: Deploy mediawiki chart 0.8.8 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:56:09] !log cgoubert@deploy1003 cgoubert: Continuing with sync [13:56:51] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:56:54] (03CR) 10Bking: [C:03+2] relforge: replace soon-to-be-decommissioned cluster endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1132642 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [13:57:04] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [13:57:08] (03CR) 10Bking: [C:03+2] "self-merging, as this does not affect a production environment." [puppet] - 10https://gerrit.wikimedia.org/r/1132642 (https://phabricator.wikimedia.org/T390565) (owner: 10Bking) [13:57:19] !log cgoubert@deploy1003 Finished scap sync-world: Deploy mediawiki chart 0.8.8 (duration: 02m 19s) [13:57:29] (03PS1) 10Jelto: ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) [13:58:26] !log move k8s instances from prometheus1005 to prometheus1007 - T383232 [13:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:31] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [13:59:42] (03CR) 10Arnaudb: [C:03+1] ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:00:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10693964 (10phaultfinder) [14:01:56] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1004* for ban relforge1004 prior to service restart and decom T390565 - bking@cumin2002 - T390100 [14:01:57] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1004* for ban relforge1004 prior to service restart and decom T390565 - bking@cumin2002 - T390100 [14:02:02] T390565: decommission relforge100[34] - https://phabricator.wikimedia.org/T390565 [14:02:02] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [14:02:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:05:38] (03PS1) 10BPirkle: REST: enable Specs module on certain wikis, adjust Sandbox modules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132645 (https://phabricator.wikimedia.org/T389407) [14:08:34] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new plugins - bking@cumin2002 - T390100 [14:08:39] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [14:08:41] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Make choice of firewall stack in insetup roles specific / Add nftables variants - https://phabricator.wikimedia.org/T389825#10694022 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [14:09:43] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Deprecate use of bullseye-backports - https://phabricator.wikimedia.org/T383557#10694031 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All uses of bullseye-backports have been removed and bullseye-backports is no longer includ... [14:15:07] (03PS3) 10Federico Ceratto: clone.py: skip dbctl addition on --nopool [cookbooks] - 10https://gerrit.wikimedia.org/r/1132618 (https://phabricator.wikimedia.org/T390217) [14:17:48] (03PS2) 10Filippo Giunchedi: hieradata: move k8s prometheus1006 -> 1008 [puppet] - 10https://gerrit.wikimedia.org/r/1131302 (https://phabricator.wikimedia.org/T383232) [14:22:08] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10694056 (10Jhancock.wm) cool, if it doesn't alert again in 24 hours, i'll close the ticket. thanks for your help! [14:22:15] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:23:07] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:23:26] FIRING: [6x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet [14:27:07] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:29:20] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10694072 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [14:30:18] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:30:24] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10694076 (10Jclark-ctr) [14:30:26] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:30:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:30:44] FIRING: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [14:30:55] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:31:07] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:31:28] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10694081 (10Jclark-ctr) Completed swapping disk in an-worker117[2-4] @BTullis [14:31:53] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:32:08] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:33:26] FIRING: [6x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:35:06] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:38:33] (03CR) 10Daniel Kinzler: [C:03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132645 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [14:38:37] (03CR) 10Ssingh: [C:03+1] trafficserver: gateway-check ignore list, roll pcs/mobileapps to more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1131748 (https://phabricator.wikimedia.org/T388140) (owner: 10Hnowlan) [14:38:44] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [14:38:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:39:15] (03CR) 10Thcipriani: "Note: needs to be added to all deployed branches before this merges; i.e., EmailAuth needs to be added as a submodule to `wmf/1.44.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132302 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [14:40:37] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:40:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:40:44] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:40:52] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132653 [14:41:11] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:41:19] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:41:23] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10694124 (10Eevans) >>! In T390514#10693687, @Tgr wrote: > I assume the sawtooth pattern is the artifact of some OS optimization process? It's Cassandra comp... [14:41:40] (03CR) 10Thcipriani: "*as a submodule of `mediawiki/core` on the `wmf/1.44.0-wmf.22` branch, that is: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132302 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [14:43:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:45:34] (03CR) 10Ayounsi: [C:03+1] alertmanager: open dcops tasks with title as summary [puppet] - 10https://gerrit.wikimedia.org/r/1132637 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [14:45:43] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new plugins - bking@cumin2002 - T390100 [14:45:49] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [14:46:07] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [14:52:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [14:52:34] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:52:45] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:53:04] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:53:29] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:53:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:54:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:54:45] (03CR) 10Dbrant: [C:03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131072 (owner: 10PipelineBot) [14:55:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132645 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [14:55:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [14:55:50] RESOLVED: CirrusProducerFlinkJobNotRunning: cirrus_streaming_updater_producer in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusProducerFlinkJobNotRunning [14:56:24] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131072 (owner: 10PipelineBot) [14:57:38] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10694216 (10Ben.buchenau) Thanks. Fixed my SSH login issue locally, as I had a typo in my `.ssh/config` which made the connection to `stat1011.eqiad.wmnet` fail. From her... [14:57:55] (03PS2) 10Ayounsi: Fix TransitPeering[In|Out]boundSaturation [alerts] - 10https://gerrit.wikimedia.org/r/1132641 (https://phabricator.wikimedia.org/T388641) [14:58:48] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10694219 (10joanna_borun) a:03jhathaway [15:00:45] RESOLVED: [2x] CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [15:00:46] (03PS1) 10Joal: Update data-eng gobblin alert [alerts] - 10https://gerrit.wikimedia.org/r/1132663 (https://phabricator.wikimedia.org/T386177) [15:01:57] (03CR) 10CI reject: [V:04-1] Update data-eng gobblin alert [alerts] - 10https://gerrit.wikimedia.org/r/1132663 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [15:02:16] !log `elukey@cumin1002:~$ sudo cumin 'registry*' 'rm -rf /var/cache/nginx-docker-registry'` - T390251 [15:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:20] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti4007.ulsfo.wmnet with reason: remove from cluster for reimage [15:02:21] T390251: docker-registry.wikimedia.org was serving a bad blob - https://phabricator.wikimedia.org/T390251 [15:02:25] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10694233 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=754c015a-5966-406b-8711-e527c555dafe) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and the... [15:02:52] elukey: Thanks for the fixes! [15:03:12] (03PS1) 10Muehlenhoff: Switch ganeti4007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1132664 [15:03:31] dancy: credits to akosiaris too! It was a weird bug, hope it is fixed! :) [15:03:41] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti4007.ulsfo.wmnet [15:03:50] (03CR) 10Tiziano Fogli: [C:03+2] Fix TransitPeering[In|Out]boundSaturation [alerts] - 10https://gerrit.wikimedia.org/r/1132641 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [15:05:01] (03Merged) 10jenkins-bot: Fix TransitPeering[In|Out]boundSaturation [alerts] - 10https://gerrit.wikimedia.org/r/1132641 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [15:05:59] elukey: look at https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-2d&to=now&viewPanel=12&var-server=registry2004&var-datasource=thanos&var-cluster=misc btw [15:06:21] I am willing to bet that while the image was getting fetched we got at 95% and 5% is reserved for root [15:06:25] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:55] it's weird that I haven't found yet something logs pointed out that we run out of disk space, but it looks plausible [15:08:06] What I'm curious about is what data is being served after the presumed truncated portion of the file. [15:08:20] Since the right _amount_ of data is sent (at least in the case I ran into) [15:08:41] What could nginx possibly be reading in that case? [15:09:30] akosiaris: ahh okok yes that could explain it yes [15:09:30] oh, wait, luca has the same theory in the task, just saw [15:09:42] I didn't think about the 5% reserved [15:10:05] but somehow I thought the cache got corrupted, and it is sad that nginx didn't really tell us in the logs [15:10:11] maybe at verbose=100 [15:10:30] I would definitely expect something in the logs. out of disk space is such a typical error to handle [15:10:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10694265 (10phaultfinder) [15:10:58] and there was something for the tmpfs partition, it was loud and clear [15:11:01] not for the caching stuff [15:11:07] dancy: yes indeed really weird, no idea [15:11:11] (03CR) 10Ssingh: [C:03+1] upgrade cp1100 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131824 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:13] (03CR) 10Ssingh: [C:03+1] upgrade cp1101 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131825 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:14] (03CR) 10Ssingh: [C:03+1] upgrade cp1102 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131826 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:16] (03CR) 10Ssingh: [C:03+1] upgrade cp1103 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131827 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:17] (03CR) 10Ssingh: [C:03+1] upgrade cp1104 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131828 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:19] (03CR) 10Ssingh: [C:03+1] upgrade cp1105 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131829 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:20] (03CR) 10Ssingh: [C:03+1] upgrade cp1106 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131830 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:24] (03CR) 10Ssingh: [C:03+1] upgrade cp1107 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131831 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:28] (03CR) 10Ssingh: [C:03+1] upgrade cp1108 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131832 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:32] (03CR) 10Ssingh: [C:03+1] upgrade cp1109 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131833 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:35] but having such cache without a dedicated partition seems to be looking for troubles [15:11:37] (03CR) 10Ssingh: [C:03+1] upgrade cp1110 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131834 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:41] (03CR) 10Ssingh: [C:03+1] upgrade cp1111 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131835 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:45] (03CR) 10Ssingh: [C:03+1] upgrade cp1112 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131836 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:49] (03CR) 10Ssingh: [C:03+1] upgrade cp1113 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131837 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:53] (03CR) 10Ssingh: [C:03+1] upgrade cp1114 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131838 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:11:57] (03CR) 10Ssingh: [C:03+1] upgrade cp1115 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131839 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:12:45] (03PS4) 10Kosta Harlan: EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [15:14:37] 06SRE-OnFire, 06Release-Engineering-Team, 06serviceops, 10Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#10694284 (10jijiki) [15:14:38] 06SRE-OnFire, 06MediaWiki-Engineering, 06serviceops-radar, 10Sustainability (Incident Followup): Reduce the amount of messages sent through channel:Memcached during failures - https://phabricator.wikimedia.org/T390529#10694285 (10jijiki) [15:16:47] (03PS5) 10Kosta Harlan: EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [15:18:05] (03CR) 10BCornwall: [C:03+2] upgrade cp1101 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131825 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:18:09] (03CR) 10BCornwall: [C:03+2] upgrade cp1100 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131824 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [15:22:06] (03PS6) 10Kosta Harlan: EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [15:25:03] (03PS7) 10Kosta Harlan: EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [15:26:35] (03PS1) 10Ssingh: P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [15:27:22] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1100.eqiad.wmnet} and A:cp [15:27:24] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1101.eqiad.wmnet} and A:cp [15:27:32] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5177/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:29:36] (03PS8) 10Kosta Harlan: EmailAuth: Prepare config for enabling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [15:29:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:30:05] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1530). [15:30:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10694342 (10phaultfinder) [15:30:41] (03CR) 10Brouberol: "You also have to update `gobblin_test,yaml` to reflect the changes" [alerts] - 10https://gerrit.wikimedia.org/r/1132663 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [15:30:49] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10694345 (10Tgr) There's a GET spike starting on the 18th around 15:15: {F58952037} An increase in POSTs that's more gradual (the huge spike is the crash tod... [15:30:53] !log uploaded spicerack_10.0.0 to apt.wikimedia.org bullseye-wikimedia [15:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:53] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1101.eqiad.wmnet} and A:cp [15:33:08] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1100.eqiad.wmnet} and A:cp [15:33:50] (03PS2) 10Joal: Update data-eng gobblin alert [alerts] - 10https://gerrit.wikimedia.org/r/1132663 (https://phabricator.wikimedia.org/T386177) [15:34:21] jouncebot: now [15:34:21] For the next 0 hour(s) and 25 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1530) [15:34:53] Hey jan_drewniak - any portal deployments right now? Would like to get a sec patch fix out, if possible. [15:36:05] (03PS1) 10Ayounsi: sre.network.tls: allow running it on more types [cookbooks] - 10https://gerrit.wikimedia.org/r/1132671 (https://phabricator.wikimedia.org/T390052) [15:36:09] hi sbassett, I haven't done one in a few weeks so I we just about to start one. It won't take long. [15:36:27] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132672 (https://phabricator.wikimedia.org/T128546) [15:36:42] (03CR) 10Ayounsi: "Not sure it fully works everywhere yet, but first step to test it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1132671 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:59] jan_drewniak: ok, does it involve a sync-world? there’s a change staged in PS.php in /private right now. [15:37:00] (03PS2) 10Ssingh: P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [15:37:12] (03CR) 10Ssingh: [C:03+2] sre.network.cf: log if no changes were made [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 (owner: 10Ssingh) [15:37:19] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device fasw2-c8b-codfw [15:38:02] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:38:13] (03CR) 10Joal: "I don't see where it is needed - can you show me please?" [alerts] - 10https://gerrit.wikimedia.org/r/1132663 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [15:39:37] (03PS9) 10Kosta Harlan: EmailAuth: Prepare config for enabling in log-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [15:39:39] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c8b-codfw [15:39:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:39:57] sbassett: portals deploy only uses `sync-file` is that fine? [15:40:40] (03PS3) 10Filippo Giunchedi: mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [15:41:47] it's basically `scap sync-file portals/wikipedia.org/assets $*` and `scap sync-file portals $*` so I don't think that should affect anything in `/private` [15:43:04] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132672 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:43:26] it's technically a mediawiki-config change [15:43:41] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10694415 (10Tgr) The GET and DELETE increase has to be the DC switch, it lines up perfectly in time. The POSTs are maybe related to the bot? Too late for the... [15:43:57] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132672 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:44:22] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device fasw2-c8b-codfw [15:44:22] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c8b-codfw [15:44:29] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device fasw2-c8a-codfw [15:45:02] (03PS10) 10Kosta Harlan: EmailAuth: Prepare config for enabling in log-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [15:45:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [15:45:26] (03Merged) 10jenkins-bot: sre.network.cf: log if no changes were made [cookbooks] - 10https://gerrit.wikimedia.org/r/1130135 (owner: 10Ssingh) [15:45:32] (03PS1) 10Clément Goubert: alertmanager: Route task-level GrowthExperiments alerts [puppet] - 10https://gerrit.wikimedia.org/r/1132673 (https://phabricator.wikimedia.org/T385782) [15:45:34] (03PS1) 10Clément Goubert: mw::periodic_jobs: Migrate deleteOldSurveys [puppet] - 10https://gerrit.wikimedia.org/r/1132674 (https://phabricator.wikimedia.org/T385782) [15:46:13] (03CR) 10Máté Szabó: [C:04-1] EmailAuth: Prepare config for enabling in log-only mode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [15:46:42] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132674 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [15:46:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c8a-codfw [15:47:04] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device cloudsw2-d5-eqiad [15:47:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device cloudsw2-d5-eqiad [15:48:30] (03PS2) 10Clément Goubert: alertmanager: Route task-level GrowthExperiments alerts [puppet] - 10https://gerrit.wikimedia.org/r/1132673 (https://phabricator.wikimedia.org/T385782) [15:49:45] jan_drewniak: sounds good, thanks. [15:50:51] (03PS4) 10Filippo Giunchedi: mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [15:51:34] (03PS2) 10Clément Goubert: mw::periodic_jobs: Migrate deleteOldSurveys [puppet] - 10https://gerrit.wikimedia.org/r/1132674 (https://phabricator.wikimedia.org/T385782) [15:51:35] (03PS1) 10Ayounsi: gNMIc: start collecting metrics from fasw, ignore asw1-eqsin VC [puppet] - 10https://gerrit.wikimedia.org/r/1132675 (https://phabricator.wikimedia.org/T390052) [15:52:28] jan_drewniak: just let mstyles and I know when you’re done, thanks. [15:53:33] (03CR) 10Ayounsi: "codfw fasw are ready, will merge/deploy this, check that all is good before applying the change to eqiad." [puppet] - 10https://gerrit.wikimedia.org/r/1132675 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [15:54:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:55:00] (03PS5) 10Filippo Giunchedi: mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [15:55:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [15:55:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti4007.ulsfo.wmnet [15:57:52] (03PS11) 10Kosta Harlan: EmailAuth: Prepare config for enabling in log-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [15:58:11] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1132672| Bumping portals to master (T128546)]] (duration: 11m 48s) [15:58:16] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:58:19] (03CR) 10Kosta Harlan: EmailAuth: Prepare config for enabling in log-only mode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [15:59:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:59:58] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti4007 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1132664 (owner: 10Muehlenhoff) [16:00:47] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1132672| Bumping portals to master (T128546)]] (duration: 02m 35s) [16:01:15] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132675 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [16:01:29] sbassett: mstyles, portals deploy is done. [16:02:20] jan_drewniak thank you! [16:03:01] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578 (10Jhancock.wm) 03NEW [16:04:35] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10694539 (10Jhancock.wm) [16:05:07] (03CR) 10BCornwall: [C:03+2] upgrade cp1102 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131826 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:05:13] (03CR) 10BCornwall: [C:03+2] upgrade cp1103 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131827 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:05:37] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10694546 (10Jhancock.wm) [16:07:16] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1103.eqiad.wmnet} and A:cp [16:07:24] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1102.eqiad.wmnet} and A:cp [16:11:54] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1103.eqiad.wmnet} and A:cp [16:12:26] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132683 [16:12:44] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1102.eqiad.wmnet} and A:cp [16:16:42] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T390254#10694621 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated power cable [16:17:21] Thanks, jan_drewniak! [16:21:32] !log deploy fix for T389727 [16:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:33] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1132675 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [16:22:51] (03CR) 10Cathal Mooney: [C:03+1] sre.network.tls: allow running it on more types [cookbooks] - 10https://gerrit.wikimedia.org/r/1132671 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [16:23:39] FIRING: CoreBGPDown: Core BGP session down between cr4-ulsfo and cr2-codfw (208.80.153.193) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr4-ulsfo:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-codfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:24:01] (03CR) 10Cathal Mooney: [C:03+1] "LGTM, nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [16:25:21] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:26:57] (03CR) 10Ottomata: [C:03+2] eventgate-main - upgrade to NodeJS 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131792 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [16:28:13] !log beginning eventgate-main upgrade to NodeJS 20 - T383814 [16:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:18] T383814: Upgrade eventgate-wikimedia to node20 - https://phabricator.wikimedia.org/T383814 [16:28:35] (03Merged) 10jenkins-bot: eventgate-main - upgrade to NodeJS 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131792 (https://phabricator.wikimedia.org/T383814) (owner: 10Ottomata) [16:28:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [16:29:15] !log otto@deploy1003 helmfile [staging] START helmfile.d/services/eventgate-main: apply [16:29:34] !log otto@deploy1003 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [16:31:16] !log otto@deploy1003 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [16:31:24] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudelastic1008.eqiad.wmnet with OS bullseye [16:31:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10694716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jcla... [16:32:06] !log otto@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [16:35:27] (03PS1) 10Alexandros Kosiaris: wikifunctions: Add group{0,1,2} releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132684 (https://phabricator.wikimedia.org/T384944) [16:41:59] (03CR) 10Kosta Harlan: EmailAuth: Prepare config for enabling in log-only mode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [16:43:13] jouncebot: now [16:43:13] No deployments scheduled for the next 0 hour(s) and 16 minute(s) [16:46:05] !log otto@deploy1003 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [16:46:42] !log otto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [16:47:56] Hey all - mstyles and I have one last, quick deploy for PS.php. [16:48:56] (03CR) 10Reedy: EmailAuth: Prepare config for enabling in log-only mode (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [16:49:18] (03CR) 10Dzahn: [C:03+2] servicecatalog: add codesearch in state service_setup [puppet] - 10https://gerrit.wikimedia.org/r/1128989 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [16:50:09] (03CR) 10Dzahn: [C:03+2] conftool-data: add codesearch service to discovery objects [puppet] - 10https://gerrit.wikimedia.org/r/1128988 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [16:50:32] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10694816 (10Volans) @Andrew thanks for setting this up. I did a quick tour and found some issues: 1. The first page is very very slow to load, I th... [16:50:41] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1008.eqiad.wmnet with reason: host reimage [16:51:21] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10694821 (10Jhancock.wm) a:03Jhancock.wm [16:51:50] (03CR) 10BCornwall: [C:03+2] upgrade cp1104 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131828 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:51:53] (03CR) 10BCornwall: [C:03+2] upgrade cp1105 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131829 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [16:52:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10694834 (10Jhancock.wm) a:03Jhancock.wm [16:52:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10694842 (10Jhancock.wm) a:03Jhancock.wm [16:53:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1008.eqiad.wmnet with reason: host reimage [16:54:34] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10694844 (10Tgr) I guess GET/DELETE increases couldn't have much to with disk space use increase, anyway. I am not sure about the POST increase and the bot -... [16:57:24] (03PS12) 10Kosta Harlan: EmailAuth: Prepare config for enabling in log-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) [16:58:16] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1104.eqiad.wmnet} and A:cp [16:58:18] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1105.eqiad.wmnet} and A:cp [16:58:54] (03CR) 10Kosta Harlan: EmailAuth: Prepare config for enabling in log-only mode (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [16:59:22] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10694888 (10Andrew) a:05dcaro→03Andrew [16:59:29] (03CR) 10Dzahn: [C:03+1] miscweb: os-reports: deploy os-reports to k8s-aux [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131955 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [17:00:05] swfrench-wmf: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1700). [17:00:05] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T1700). [17:00:26] o/ [17:00:59] swfrench-wmf: I have a new release of scap ready to deploy. It has your changes in it. [17:01:14] (03PS1) 10Alexandros Kosiaris: scap: Add 3 releases in mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1132690 (https://phabricator.wikimedia.org/T384944) [17:01:15] (03PS1) 10Alexandros Kosiaris: wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1132691 (https://phabricator.wikimedia.org/T384944) [17:01:44] (03CR) 10CI reject: [V:04-1] scap: Add 3 releases in mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1132690 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [17:01:45] dancy: ah, thanks for the heads-up! feel free to go ahead - I'm not actually going to deploy anything during this window [17:01:48] (03CR) 10CI reject: [V:04-1] wikifunctions: Switch to ingress service [puppet] - 10https://gerrit.wikimedia.org/r/1132691 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [17:02:06] ok. I'll deploy after the security team is done. [17:02:07] (03CR) 10Dzahn: miscweb: os-report: use puppetdb from external_services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131952 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [17:02:25] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10694917 (10Andrew) During dcaro's PTO he wants me to get the host back up and confirm that the drive appears to the OS. He'll do performance testing when... [17:02:35] for context, I am going to attempt to reproduce T389734 with some additional logging added on mwdebug1001 [17:02:35] T389734: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors - https://phabricator.wikimedia.org/T389734 [17:03:27] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1105.eqiad.wmnet} and A:cp [17:03:35] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1104.eqiad.wmnet} and A:cp [17:04:21] !log deploy fix for T389727 [17:04:23] (03PS1) 10Ebernhardson: Update opensearch-madvise call for version 0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) [17:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:53] (03CR) 10Dzahn: [C:03+2] create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [17:04:58] (03PS3) 10Dzahn: create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) [17:05:34] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [17:05:37] dancy: ah, I did not realize that was still ongoing. holding for now [17:06:49] just finished the deploy, please go ahead [17:07:38] Thanks Maryum! [17:08:02] maryum: thanks! [17:08:07] !log dancy@deploy1003 Installing scap version "4.148.0" for 193 host(s) [17:08:19] (03PS2) 10Ebernhardson: Update opensearch-madvise call for version 0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) [17:08:36] (03CR) 10Dzahn: [V:03+1] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1126182 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [17:08:42] (03CR) 10CI reject: [V:04-1] Update opensearch-madvise call for version 0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [17:09:03] (03PS3) 10Ebernhardson: Update opensearch-madvise call for version 0.2 [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) [17:09:47] !log dzahn@dns1004 START - running authdns-update [17:10:29] (03CR) 10Gergő Tisza: mediawiki-global: add alerts for too many login attempts (037 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [17:12:12] (03CR) 10Ebernhardson: [C:04-1] "I think this patch is ready, but the .deb needs to be built before this can be deployed" [puppet] - 10https://gerrit.wikimedia.org/r/1132692 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [17:12:45] !log dancy@deploy1003 Installation of scap version "4.148.0" completed for 193 hosts [17:14:07] !log attempting to reproduce T389734 with enhanced logging on mwdebug1001 [17:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:12] T389734: Fatal exception of type "Wikimedia\RequestTimeout\EmergencyTimeoutException" or "Wikimedia\Rdbms\DBUnexpectedError" errors - https://phabricator.wikimedia.org/T389734 [17:15:07] (03PS1) 10Dzahn: Revert "create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns" [dns] - 10https://gerrit.wikimedia.org/r/1132696 [17:15:25] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10694961 (10jhathaway) One lower tech option, is to use multiple simple regexes, e.g. ` node /^sretest1002\.eqiad\./, /^sretest1004\.eqiad\./, /^sretest1006\.eqiad... [17:15:37] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10694962 (10jhathaway) p:05Triage→03Low [17:15:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [17:15:40] I need to revert my DNS change.. [17:15:41] Exception: Command /usr/sbin/gdnsd -c /tmp/dns-check.osfjlyjb checkconf failed with exit code 42, stderr: [17:15:44] :/ [17:16:17] (03CR) 10Ssingh: [C:03+1] "[Context: The service IPs need to be created in Netbox before this can work.] This change is still required but that should happen first." [dns] - 10https://gerrit.wikimedia.org/r/1132696 (owner: 10Dzahn) [17:16:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1008.eqiad.wmnet with OS bullseye [17:16:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10694970 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@c... [17:17:20] (03CR) 10Dzahn: [C:03+2] Revert "create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns" [dns] - 10https://gerrit.wikimedia.org/r/1132696 (owner: 10Dzahn) [17:17:49] (03PS1) 10Dzahn: Revert^2 "create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns" [dns] - 10https://gerrit.wikimedia.org/r/1132699 [17:18:02] !log dzahn@dns1004 START - running authdns-update [17:21:12] !log dzahn@dns1004 END - running authdns-update [17:22:13] DNS clean again [17:22:17] thanks! [17:30:29] (03CR) 10Dzahn: [C:03+2] create a namespace for codesearch on k8s-aux cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [17:36:08] (03CR) 10BCornwall: [C:03+2] upgrade cp1106 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131830 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:36:10] (03CR) 10BCornwall: [C:03+2] upgrade cp1107 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131831 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:38:19] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1105.eqiad.wmnet} and A:cp [17:38:20] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1107.eqiad.wmnet} and A:cp [17:38:26] !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=97) rolling upgrade of Varnish on P{cp1105.eqiad.wmnet} and A:cp [17:38:44] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: "Invalid CSRF token" on any actions by registered users - https://phabricator.wikimedia.org/T390512#10695084 (10matmarex) Investigation into the root cause and avoiding repeats of this event is in progress at T390514. [17:39:21] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1105.eqiad.wmnet [17:39:41] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1106.eqiad.wmnet} and A:cp [17:41:28] !log dzahn@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [17:43:05] !log dzahn@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:43:29] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1107.eqiad.wmnet} and A:cp [17:44:55] 06SRE, 10MediaWiki-User-login-and-signup, 07Wikimedia-Incident: "Invalid CSRF token" on any actions by registered users - https://phabricator.wikimedia.org/T390512#10695107 (10bd808) a:05MBH→03None [17:45:01] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1106.eqiad.wmnet} and A:cp [17:46:07] !log dzahn@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [17:46:57] !log dzahn@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [17:47:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:49:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:50:37] looking... [17:50:42] note: the link in there does not work for me [17:51:39] (03PS1) 10Andrew Bogott: Remove openstacksdk auth patch from version Dalmatian [puppet] - 10https://gerrit.wikimedia.org/r/1132712 [17:51:39] (03PS1) 10Andrew Bogott: Glance: remove patch for glance image resize [puppet] - 10https://gerrit.wikimedia.org/r/1132713 [17:51:39] (03PS1) 10Andrew Bogott: Trove: remove a backported fix for dns integration [puppet] - 10https://gerrit.wikimedia.org/r/1132714 [17:51:49] !incidents [17:51:49] 5919 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [17:51:49] 5920 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [17:51:49] 5918 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [17:51:50] 5917 (RESOLVED) [2x] SessionStoreErrorRateHigh data-persistence () [17:52:19] (03CR) 10CI reject: [V:04-1] Glance: remove patch for glance image resize [puppet] - 10https://gerrit.wikimedia.org/r/1132713 (owner: 10Andrew Bogott) [17:52:26] (03CR) 10CI reject: [V:04-1] Trove: remove a backported fix for dns integration [puppet] - 10https://gerrit.wikimedia.org/r/1132714 (owner: 10Andrew Bogott) [17:52:59] (03CR) 10Andrew Bogott: [C:03+2] Remove openstacksdk auth patch from version Dalmatian [puppet] - 10https://gerrit.wikimedia.org/r/1132712 (owner: 10Andrew Bogott) [17:54:22] (03CR) 10Kosta Harlan: EmailAuth: Prepare config for enabling in log-only mode (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [17:54:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:54:54] looked at CDN traffic dashboard, saw nothing unusual looking [17:55:03] (03PS2) 10Andrew Bogott: Glance: remove patch for glance image resize [puppet] - 10https://gerrit.wikimedia.org/r/1132713 [17:55:03] (03PS2) 10Andrew Bogott: Trove: remove a backported fix for dns integration [puppet] - 10https://gerrit.wikimedia.org/r/1132714 [17:55:46] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132714 (owner: 10Andrew Bogott) [17:55:48] (03CR) 10Alexandros Kosiaris: [C:04-2] "Not ready for this." [puppet] - 10https://gerrit.wikimedia.org/r/1132691 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [17:56:03] (03PS2) 10Alexandros Kosiaris: scap: Add 3 releases in mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1132690 (https://phabricator.wikimedia.org/T384944) [17:56:31] (03CR) 10CI reject: [V:04-1] scap: Add 3 releases in mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1132690 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [17:57:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:58:16] (03PS3) 10Andrew Bogott: Trove: remove a backported fix for dns integration [puppet] - 10https://gerrit.wikimedia.org/r/1132714 [17:58:32] (03CR) 10Andrew Bogott: [C:03+2] Glance: remove patch for glance image resize [puppet] - 10https://gerrit.wikimedia.org/r/1132713 (owner: 10Andrew Bogott) [17:58:38] (03PS3) 10Alexandros Kosiaris: scap: Add 3 releases in mw-wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1132690 (https://phabricator.wikimedia.org/T384944) [18:00:00] (03CR) 10Andrew Bogott: [C:03+2] Trove: remove a backported fix for dns integration [puppet] - 10https://gerrit.wikimedia.org/r/1132714 (owner: 10Andrew Bogott) [18:06:57] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10695166 (10Quiddity) [18:09:09] (03PS2) 10Alexandros Kosiaris: wikifunctions: Add group{0,1,2} releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132684 (https://phabricator.wikimedia.org/T384944) [18:14:09] jouncebot: nowandnext [18:14:10] No deployments scheduled for the next 1 hour(s) and 45 minute(s) [18:14:10] In 1 hour(s) and 45 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T2000) [18:14:32] got a bit delayed, but I will wrap up my debugging work on mwdebug1001 shortly [18:15:55] (03PS2) 10Reedy: CommonSettings.php: Reduce usage of wmgUseCentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132720 [18:16:38] (03CR) 10CI reject: [V:04-1] CommonSettings.php: Reduce usage of wmgUseCentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132720 (owner: 10Reedy) [18:17:28] (03PS3) 10Reedy: CommonSettings.php: Reduce usage of wmgUseCentralAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132720 [18:21:48] (03CR) 10BCornwall: [C:03+2] upgrade cp1108 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131832 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:21:50] (03CR) 10BCornwall: [C:03+2] upgrade cp1109 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131833 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:23:47] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1108.eqiad.wmnet} and A:cp [18:23:48] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1109.eqiad.wmnet} and A:cp [18:28:18] !log dzahn@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [18:28:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10695240 (10phaultfinder) [18:28:41] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1109.eqiad.wmnet} and A:cp [18:29:23] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1108.eqiad.wmnet} and A:cp [18:29:36] !log dzahn@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [18:31:21] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10695252 (10KFrancis) There isn't a valid NDA on file. Would you like a new one processed? [18:33:26] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:35:10] (03CR) 10Gergő Tisza: [C:03+1] EmailAuth: Prepare config for enabling in log-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [18:37:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [18:38:07] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10695279 (10Dzahn) @acooper Want to connect directly with Katie to exchange where you got the information about the expired NDA and where we track it? This is a former staff me... [18:46:53] (03PS1) 10Ahmon Dancy: .gitmodules: Add extensions/EmailAuth [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132727 (https://phabricator.wikimedia.org/T390437) [18:47:21] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10695314 (10MoritzMuehlenhoff) >>! In T388030#10695252, @KFrancis wrote: > There isn't a valid NDA on file. Would you like a new one processed? It's present in line 26 of the... [18:49:12] (03PS1) 10Ahmon Dancy: extension-list: Add EmailAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132729 (https://phabricator.wikimedia.org/T390437) [18:50:55] jouncebot nowandnext [18:50:55] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [18:50:56] In 1 hour(s) and 9 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T2000) [18:52:00] (03PS1) 10Ssingh: aptrepo: add component for ECH-enabled nginx [puppet] - 10https://gerrit.wikimedia.org/r/1132730 (https://phabricator.wikimedia.org/T205378) [18:54:06] (03CR) 10Dzahn: [C:03+2] "I am wondering if it's expected that I see the ceph-csi-rbd release here:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [18:56:48] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1132730 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:57:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [18:57:24] (03CR) 10Ssingh: [C:03+2] aptrepo: add component for ECH-enabled nginx [puppet] - 10https://gerrit.wikimedia.org/r/1132730 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [19:00:36] 06SRE, 10SRE-Access-Requests: Remove production data access for NDA expired user mobrovac - https://phabricator.wikimedia.org/T388030#10695404 (10KFrancis) Yes, I saw it listed there, however the NDA box did not indicate if one was on file. I checked Coupa (which we transitioned to from Cobblestone) and no ND... [19:01:34] !log Deploying EmailAuth extension to wmf.22 for T390437 [19:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:40] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [19:02:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132727 (https://phabricator.wikimedia.org/T390437) (owner: 10Ahmon Dancy) [19:02:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dancy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132729 (https://phabricator.wikimedia.org/T390437) (owner: 10Ahmon Dancy) [19:03:27] (03Merged) 10jenkins-bot: extension-list: Add EmailAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132729 (https://phabricator.wikimedia.org/T390437) (owner: 10Ahmon Dancy) [19:06:07] (03CR) 10BCornwall: [C:03+2] upgrade cp1110 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131834 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:06:11] (03CR) 10BCornwall: [C:03+2] upgrade cp1111 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131835 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:06:25] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:06:28] (03Merged) 10jenkins-bot: .gitmodules: Add extensions/EmailAuth [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132727 (https://phabricator.wikimedia.org/T390437) (owner: 10Ahmon Dancy) [19:06:42] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1132727|.gitmodules: Add extensions/EmailAuth (T390437)]], [[gerrit:1132729|extension-list: Add EmailAuth (T390437)]] [19:06:48] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [19:08:08] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1110.eqiad.wmnet} and A:cp [19:08:09] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1111.eqiad.wmnet} and A:cp [19:11:44] (03CR) 10SBassett: [C:03+1] EmailAuth: Prepare config for enabling in log-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [19:12:37] 06SRE-OnFire, 10Incident Tooling: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126#10695443 (10jhathaway) I think using the existing folder makes the most sense. [19:13:07] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1111.eqiad.wmnet} and A:cp [19:13:22] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1110.eqiad.wmnet} and A:cp [19:13:55] !log dzahn@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [19:14:12] !log dzahn@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [19:14:51] (03CR) 10JHathaway: [C:03+1] corto: use #acl*security for new incidents [puppet] - 10https://gerrit.wikimedia.org/r/1131479 (https://phabricator.wikimedia.org/T389664) (owner: 10Eevans) [19:15:38] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10695444 (10jhathaway) +1 on changing to #acl_security [19:16:51] !log dzahn@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [19:17:00] !log dzahn@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [19:17:18] (03Abandoned) 10Kosta Harlan: extension-list: Add EmailAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132302 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [19:19:53] !log kamila@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [19:20:34] !log kamila@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [19:20:59] (03PS1) 10Reedy: FileBackend: PHP Deprecated: strrpos(): Passing null to parameter #1 ($haystack) [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132737 (https://phabricator.wikimedia.org/T384851) [19:22:18] !log kamila@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [19:22:38] !log kamila@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [19:23:10] (03CR) 10Dzahn: [C:03+2] "I deployed with "helmfile -e aux-k8s-codfw -l name=namespaces -l name=namespace-certificates -i apply" to skip the metrics and ceph releas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126175 (https://phabricator.wikimedia.org/T268199) (owner: 10Dzahn) [19:24:21] (03CR) 10Dzahn: [C:03+1] corto: use #acl*security for new incidents [puppet] - 10https://gerrit.wikimedia.org/r/1131479 (https://phabricator.wikimedia.org/T389664) (owner: 10Eevans) [19:24:42] Gaah! Registry problems still [19:27:00] =/ [19:27:41] https://phabricator.wikimedia.org/T390251 has reared its ugly head again. Deployments are blocked in the meantime. [19:28:09] (03PS1) 10Ebernhardson: Set envoy keepalive's for search to match nginx [puppet] - 10https://gerrit.wikimedia.org/r/1132739 [19:30:33] (03PS2) 10Ebernhardson: Set envoy keepalive's for search to match nginx [puppet] - 10https://gerrit.wikimedia.org/r/1132739 (https://phabricator.wikimedia.org/T390612) [19:31:30] (03PS3) 10Ebernhardson: Set envoy keepalive's for search to match nginx [puppet] - 10https://gerrit.wikimedia.org/r/1132739 (https://phabricator.wikimedia.org/T390612) [19:34:02] cccccbukvgbctdhkuibvflltcriciruutbeitikhfejr [19:34:08] Exacly [19:34:12] sigh, you already know [19:34:15] (03CR) 10JHathaway: [C:03+1] Add service record for puppetserver2004 [dns] - 10https://gerrit.wikimedia.org/r/1130073 (https://phabricator.wikimedia.org/T381274) (owner: 10Muehlenhoff) [19:34:46] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132739 (https://phabricator.wikimedia.org/T390612) (owner: 10Ebernhardson) [19:44:26] (03CR) 10BCornwall: [C:03+2] upgrade cp1112 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131836 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:44:29] (03CR) 10BCornwall: [C:03+2] upgrade cp1113 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131837 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [19:45:07] !log dancy@deploy1003 Started scap sync-world: Backport for [[gerrit:1132727|.gitmodules: Add extensions/EmailAuth (T390437)]], [[gerrit:1132729|extension-list: Add EmailAuth (T390437)]] [19:45:12] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [19:46:31] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1112.eqiad.wmnet} and A:cp [19:46:33] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1113.eqiad.wmnet} and A:cp [19:51:11] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1113.eqiad.wmnet} and A:cp [19:51:48] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1112.eqiad.wmnet} and A:cp [19:53:11] !log dancy@deploy1003 dancy: Backport for [[gerrit:1132727|.gitmodules: Add extensions/EmailAuth (T390437)]], [[gerrit:1132729|extension-list: Add EmailAuth (T390437)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:53:16] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [19:55:31] !log dancy@deploy1003 dancy: Continuing with sync [19:59:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [19:59:49] (03PS1) 10Kosta Harlan: EmailAuth: Add EmailAuthRequireToken hook implementation [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132745 (https://phabricator.wikimedia.org/T390437) [19:59:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 31 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132745 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T2000). [20:00:05] Jdlrobson, bpirkle, Superpes, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] I'm here [20:00:48] There's a deployment going on right now that is 32% complete in the sync-prod-k8s phase. [20:00:52] i'm here [20:00:56] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10695612 (10xcollazo) Hello, I would like to exercise this rule by running a very heavy Presto query. Is t... [20:01:13] I'll be around for awhile, no hurry [20:01:39] 50% [20:01:53] Once you see the announcement about scap being finished, feel free to take over! [20:02:34] o/ [20:03:03] 75% [20:04:33] Deployers: If you find scap sitting in the `sync-testservers-k8s` phase for longer than usual (e.g., more than 3 minutes), please shout about it! [20:04:45] It might be https://phabricator.wikimedia.org/T390251 [20:04:59] (03CR) 10BCornwall: [C:03+1] P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [20:05:43] dancy: ack [20:06:00] !log dancy@deploy1003 Finished scap sync-world: Backport for [[gerrit:1132727|.gitmodules: Add extensions/EmailAuth (T390437)]], [[gerrit:1132729|extension-list: Add EmailAuth (T390437)]] (duration: 20m 53s) [20:06:06] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [20:06:08] (03CR) 10Dzahn: [C:03+1] P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [20:06:12] that sounds like our cue [20:06:25] Yep. Good luck! I'm stepping away from the keyboard for a break. [20:06:46] I can do my patches at the end [20:06:55] I also need a little break. Is there another deployer around? [20:07:41] dancy: was the last the sync the l10n-rebuild one? or is that still needed? [20:07:51] (I can deploy) [20:10:15] l10n rebuild happened. [20:10:20] which of the patches don't need much testing? the throttle one and the REST one I guess? [20:10:24] thanks dancy! [20:10:40] REST one is very straightforward [20:10:44] and the emailauth hook since it doesn't do anything on its own [20:11:19] (03CR) 10Gergő Tisza: [C:03+2] EmailAuth: Add EmailAuthRequireToken hook implementation [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132745 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [20:12:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132645 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [20:12:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132197 (https://phabricator.wikimedia.org/T390290) (owner: 10Superpes15) [20:12:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132745 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [20:12:41] (03Merged) 10jenkins-bot: EmailAuth: Add EmailAuthRequireToken hook implementation [extensions/WikimediaEvents] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132745 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [20:13:28] (03Merged) 10jenkins-bot: REST: enable Specs module on certain wikis, adjust Sandbox modules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132645 (https://phabricator.wikimedia.org/T389407) (owner: 10BPirkle) [20:13:30] (03Merged) 10jenkins-bot: Throttle exemption for Editathon at Universidad Nacional de La Plata - 9 April 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132197 (https://phabricator.wikimedia.org/T390290) (owner: 10Superpes15) [20:13:43] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1132645|REST: enable Specs module on certain wikis, adjust Sandbox modules (T389407)]], [[gerrit:1132197|Throttle exemption for Editathon at Universidad Nacional de La Plata - 9 April 2025 (T390290)]], [[gerrit:1132745|EmailAuth: Add EmailAuthRequireToken hook implementation (T390437)]] [20:13:50] T389407: Release REST API Sandbox on 6 initial wikis - https://phabricator.wikimedia.org/T389407 [20:13:50] T390290: Lift IP cap on 2025-04-09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T390290 [20:13:51] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [20:18:23] !log tgr@deploy1003 superpes, bpirkle, tgr, kharlan: Backport for [[gerrit:1132645|REST: enable Specs module on certain wikis, adjust Sandbox modules (T389407)]], [[gerrit:1132197|Throttle exemption for Editathon at Universidad Nacional de La Plata - 9 April 2025 (T390290)]], [[gerrit:1132745|EmailAuth: Add EmailAuthRequireToken hook implementation (T390437)]] synced to the testservers (https://wikitech.wikimedia.org/wiki [20:18:23] /Mwdebug) [20:18:41] REST change looks good [20:19:37] !log tgr@deploy1003 superpes, bpirkle, tgr, kharlan: Continuing with sync [20:19:38] Throttle obviously doesn't required to be tested :) [20:19:38] (03CR) 10BCornwall: [C:03+2] upgrade cp1114 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131838 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:19:39] (03CR) 10BCornwall: [C:03+2] upgrade cp1115 to Varnish 7.1 [puppet] - 10https://gerrit.wikimedia.org/r/1131839 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [20:20:59] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1114.eqiad.wmnet} and A:cp [20:21:00] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp1115.eqiad.wmnet} and A:cp [20:25:21] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:26:05] tgr_: would you be able to help me deploy my 2 changes as well? I can't find anyone from web to help with those today we're a bit short staffed [20:26:12] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1115.eqiad.wmnet} and A:cp [20:26:33] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp1114.eqiad.wmnet} and A:cp [20:26:35] Jdlrobson: sure. Do they need to deployed separately? [20:26:43] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1132645|REST: enable Specs module on certain wikis, adjust Sandbox modules (T389407)]], [[gerrit:1132197|Throttle exemption for Editathon at Universidad Nacional de La Plata - 9 April 2025 (T390290)]], [[gerrit:1132745|EmailAuth: Add EmailAuthRequireToken hook implementation (T390437)]] (duration: 12m 59s) [20:26:50] T389407: Release REST API Sandbox on 6 initial wikis - https://phabricator.wikimedia.org/T389407 [20:26:50] T390290: Lift IP cap on 2025-04-09 for edit-a-thon for eswiki, commons and wikidata - https://phabricator.wikimedia.org/T390290 [20:26:51] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [20:26:54] tgr_: thank you for deploying [20:27:04] tgr_: they can go out at same time [20:27:05] Thanks for your assistance tgr_ :) [20:27:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131483 (https://phabricator.wikimedia.org/T387155) (owner: 10Jdlrobson) [20:27:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131484 (https://phabricator.wikimedia.org/T390112) (owner: 10Jdlrobson) [20:28:30] (03Merged) 10jenkins-bot: Deploy dark mode and Vector 2022 to German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131483 (https://phabricator.wikimedia.org/T387155) (owner: 10Jdlrobson) [20:28:34] (03Merged) 10jenkins-bot: Enable Vector 2022 for Russian Wikimedia and arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131484 (https://phabricator.wikimedia.org/T390112) (owner: 10Jdlrobson) [20:28:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:28:49] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1131483|Deploy dark mode and Vector 2022 to German Wikipedia (T387155)]], [[gerrit:1131484|Enable Vector 2022 for Russian Wikimedia and arbcom_ruwiki (T390112)]] [20:28:54] T387155: Enable Vector 2022 and dark mode for German wikis (anonymous users) - https://phabricator.wikimedia.org/T387155 [20:28:55] T390112: Enable Vector 2022 for Russian Wikimedia and arbcom_ruwiki - https://phabricator.wikimedia.org/T390112 [20:33:33] !log tgr@deploy1003 jdlrobson, tgr: Backport for [[gerrit:1131483|Deploy dark mode and Vector 2022 to German Wikipedia (T387155)]], [[gerrit:1131484|Enable Vector 2022 for Russian Wikimedia and arbcom_ruwiki (T390112)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:38:48] tgr looks good in production. [20:39:21] !log tgr@deploy1003 jdlrobson, tgr: Continuing with sync [20:46:45] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1131483|Deploy dark mode and Vector 2022 to German Wikipedia (T387155)]], [[gerrit:1131484|Enable Vector 2022 for Russian Wikimedia and arbcom_ruwiki (T390112)]] (duration: 17m 56s) [20:46:51] T387155: Enable Vector 2022 and dark mode for German wikis (anonymous users) - https://phabricator.wikimedia.org/T387155 [20:46:51] T390112: Enable Vector 2022 for Russian Wikimedia and arbcom_ruwiki - https://phabricator.wikimedia.org/T390112 [20:47:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [20:47:53] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10695805 (10RobH) > The NVME and tech was sent out Dell order 458888470 for Wednesday. Will update with tech info when becomes available. > > Will the technician be carrying the replacement PCIe... [20:48:14] (03Merged) 10jenkins-bot: EmailAuth: Prepare config for enabling in log-only mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132408 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [20:48:21] thanks tgr_ [20:48:28] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1132408|EmailAuth: Prepare config for enabling in log-only mode (T390437)]] [20:48:33] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [20:53:10] !log tgr@deploy1003 tgr, kharlan: Backport for [[gerrit:1132408|EmailAuth: Prepare config for enabling in log-only mode (T390437)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:53:59] testing [20:56:05] (03CR) 10Scott French: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1132690 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [21:00:05] Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T2100). [21:01:06] (03PS1) 10Kosta Harlan: EmailAuth: Enable info level logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132755 (https://phabricator.wikimedia.org/T390437) [21:01:59] !log tgr@deploy1003 Sync cancelled. [21:02:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132755 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [21:02:57] (03PS1) 10Aleksandar Mastilovic: Upgrade the Gobblin JAR version to 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/1132756 (https://phabricator.wikimedia.org/T390247) [21:03:20] (03Merged) 10jenkins-bot: EmailAuth: Enable info level logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132755 (https://phabricator.wikimedia.org/T390437) (owner: 10Kosta Harlan) [21:03:35] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1132408|EmailAuth: Prepare config for enabling in log-only mode (T390437)]], [[gerrit:1132755|EmailAuth: Enable info level logging (T390437)]] [21:03:40] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [21:05:17] (03CR) 10CI reject: [V:04-1] Upgrade the Gobblin JAR version to 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/1132756 (https://phabricator.wikimedia.org/T390247) (owner: 10Aleksandar Mastilovic) [21:08:34] (03CR) 10Reedy: [C:03+2] FileBackend: PHP Deprecated: strrpos(): Passing null to parameter #1 ($haystack) [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132737 (https://phabricator.wikimedia.org/T384851) (owner: 10Reedy) [21:08:51] !log tgr@deploy1003 kharlan, tgr: Backport for [[gerrit:1132408|EmailAuth: Prepare config for enabling in log-only mode (T390437)]], [[gerrit:1132755|EmailAuth: Enable info level logging (T390437)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:56] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [21:10:27] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10695918 (10BTullis) >>! In T381389#10695612, @xcollazo wrote: > Hello, I would like to exercise this rule... [21:10:41] (03CR) 10Eevans: [C:03+2] corto: use #acl*security for new incidents [puppet] - 10https://gerrit.wikimedia.org/r/1131479 (https://phabricator.wikimedia.org/T389664) (owner: 10Eevans) [21:15:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [21:20:22] tgr_: lgtm [21:21:13] (03PS1) 10Ryan Kemper: cirrus: (WIP) support rename elastic->cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) [21:21:18] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1132637 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [21:21:41] (03PS1) 10Andrew Bogott: magnum: remove backported fix from version dalmation [puppet] - 10https://gerrit.wikimedia.org/r/1132759 [21:22:04] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:22:25] (03CR) 10Andrew Bogott: [C:03+2] magnum: remove backported fix from version dalmation [puppet] - 10https://gerrit.wikimedia.org/r/1132759 (owner: 10Andrew Bogott) [21:22:35] 06SRE-OnFire, 10Incident Tooling, 13Patch-For-Review: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10695977 (10Eevans) 05Open→03Resolved a:03Eevans Updated. [21:22:36] (03Merged) 10jenkins-bot: FileBackend: PHP Deprecated: strrpos(): Passing null to parameter #1 ($haystack) [core] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132737 (https://phabricator.wikimedia.org/T384851) (owner: 10Reedy) [21:22:41] !log tgr@deploy1003 kharlan, tgr: Continuing with sync [21:22:55] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1131302 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [21:24:31] (03PS2) 10Aleksandar Mastilovic: Upgrade the Gobblin JAR version to 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/1132756 (https://phabricator.wikimedia.org/T390247) [21:26:19] (03PS2) 10Ryan Kemper: cirrus: (WIP) support rename elastic->cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) [21:26:34] (03CR) 10Ryan Kemper: "PS2 is intentionally broken to sanity check some PCC stuff" [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:26:38] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:27:31] (03CR) 10Ahmon Dancy: [C:03+1] "If Scott's good with it, I am too." [puppet] - 10https://gerrit.wikimedia.org/r/1132690 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [21:28:30] (03PS1) 10Gergő Tisza: Add EmailAuth provider to local domain exclusion list [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132763 (https://phabricator.wikimedia.org/T390437) [21:28:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132763 (https://phabricator.wikimedia.org/T390437) (owner: 10Gergő Tisza) [21:29:43] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1132408|EmailAuth: Prepare config for enabling in log-only mode (T390437)]], [[gerrit:1132755|EmailAuth: Enable info level logging (T390437)]] (duration: 26m 08s) [21:29:48] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [21:30:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132763 (https://phabricator.wikimedia.org/T390437) (owner: 10Gergő Tisza) [21:31:23] (03PS1) 10Eevans: Revert "corto: use #acl*security for new incidents" [puppet] - 10https://gerrit.wikimedia.org/r/1132764 [21:32:40] (03PS2) 10Eevans: Revert "corto: use #acl*security for new incidents" [puppet] - 10https://gerrit.wikimedia.org/r/1132764 [21:33:06] (03PS3) 10Eevans: Revert "corto: use #acl*security for new incidents" [puppet] - 10https://gerrit.wikimedia.org/r/1132764 [21:34:29] (03CR) 10Eevans: [C:03+2] Revert "corto: use #acl*security for new incidents" [puppet] - 10https://gerrit.wikimedia.org/r/1132764 (owner: 10Eevans) [21:36:27] (03PS3) 10Ryan Kemper: cirrus: (WIP) support rename elastic->cirrussearch [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) [21:36:29] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10696041 (10Eevans) 05Resolved→03Open Change reverted because the bot needs to be part of the #acl_security project in order to create new issues. [21:36:36] (03CR) 10Ryan Kemper: "Nevermind I'd forgotten to amend my commit, PS3 is the intentionally broken one :)" [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:37:46] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132758 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:38:57] (03PS1) 10BCornwall: varnish: Remove support for below version 7 [puppet] - 10https://gerrit.wikimedia.org/r/1132765 (https://phabricator.wikimedia.org/T378737) [21:40:09] (03Merged) 10jenkins-bot: Add EmailAuth provider to local domain exclusion list [extensions/CentralAuth] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1132763 (https://phabricator.wikimedia.org/T390437) (owner: 10Gergő Tisza) [21:40:34] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5179/console" [puppet] - 10https://gerrit.wikimedia.org/r/1132765 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:43:10] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5181/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132765 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [21:43:50] (03CR) 10JHathaway: [C:03+1] community_crm: Add trusted_host_patterns to settings template [puppet] - 10https://gerrit.wikimedia.org/r/1123711 (https://phabricator.wikimedia.org/T386267) (owner: 10Dwisehaupt) [21:44:26] Reedy: are you backporting https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1086000 ? [21:44:45] It's already merged [21:44:50] https://gerrit.wikimedia.org/r/1132737 [21:44:56] yeah that's why I'm asking [21:45:09] scap was complaining about unexpected commits [21:45:09] 06SRE-OnFire, 10Incident Tooling: Reconsider default incident visibility - https://phabricator.wikimedia.org/T389664#10696071 (10Eevans) >>! In T389664#10696041, @Eevans wrote: > Change reverted because the bot needs to be part of the #acl_security project in order to create new issues. Opened: {T390627} [21:45:15] I was planning on getting it deployed to clear quite a bit of logspam [21:45:18] and it has a -1 which isn't super reassuring [21:45:33] ok, as long as it was intentional [21:45:52] the -1 was very much from a "why are you making this patch" [21:45:55] !log tgr@deploy1003 Started scap sync-world: Backport for [[gerrit:1132763|Add EmailAuth provider to local domain exclusion list (T390437)]] [21:46:00] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [21:46:15] That became apparent a while later, when we started seeing the same logspam [21:48:27] (03PS1) 10Btullis: presto: Double the heap size for the coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1132769 (https://phabricator.wikimedia.org/T390623) [21:49:43] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5182/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132769 (https://phabricator.wikimedia.org/T390623) (owner: 10Btullis) [21:51:14] !log tgr@deploy1003 tgr: Backport for [[gerrit:1132763|Add EmailAuth provider to local domain exclusion list (T390437)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:51:19] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [21:54:34] !log tgr@deploy1003 tgr: Continuing with sync [21:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10696119 (10phaultfinder) [21:58:25] (03PS1) 10Ryan Kemper: cirrus: test rename of single host elastic2055 [puppet] - 10https://gerrit.wikimedia.org/r/1132772 (https://phabricator.wikimedia.org/T388610) [21:58:57] FIRING: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:59:47] (03PS2) 10Ryan Kemper: cirrus: test rename of single host elastic2055 [puppet] - 10https://gerrit.wikimedia.org/r/1132772 (https://phabricator.wikimedia.org/T388610) [22:00:10] (03CR) 10CI reject: [V:04-1] cirrus: test rename of single host elastic2055 [puppet] - 10https://gerrit.wikimedia.org/r/1132772 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [22:01:06] (03PS3) 10Ryan Kemper: cirrus: test rename of single host elastic2055 [puppet] - 10https://gerrit.wikimedia.org/r/1132772 (https://phabricator.wikimedia.org/T388610) [22:01:33] !log tgr@deploy1003 Finished scap sync-world: Backport for [[gerrit:1132763|Add EmailAuth provider to local domain exclusion list (T390437)]] (duration: 15m 37s) [22:01:34] re: shellbox-video. the graphs says this was very short lived [22:01:38] T390437: Deploy Extension:EmailAuth - https://phabricator.wikimedia.org/T390437 [22:01:40] !incidents [22:01:40] You're not allowed to perform this action. [22:01:47] oh noez :D [22:01:54] mutante: jhathaway: that ProbeDown probably means we have a transcode backlog and just deployed a bunch [22:01:55] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132772 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [22:01:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132772 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [22:02:16] !log UTC late deploys done [22:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:25] swfrench-wmf: okay, anything actionable? [22:02:33] swfrench-wmf: thank you, the graph looks like it is already back to normal [22:02:57] if it doesn't self-resolve, or comes back, we can throw capacity at the problem [22:03:08] should slowly drain off now that the deployments are done [22:03:22] I see the deploy just ended. ack [22:03:23] https://grafana.wikimedia.org/goto/nQOqXwoNR?orgId=1 [22:03:26] FIRING: [5x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:57] RESOLVED: ProbeDown: Service shellbox-video:4080 has failed probes (http_shellbox-video_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-video:4080 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:03:57] that grafana link shows the number of available pods (pods go unavailable while transcoding, which is intentional) [22:04:02] gives it 5 minutes.. heh [22:04:19] so, I _think_ this is the first time we've seen this happen :) [22:05:02] it's been a theoretical risk for a while, since a series of back-to-back deployments transiently increases the number of concurrent transcodes [22:08:34] anyway, I'll keep an eye out and can upsize if needed [22:09:32] thank you swfrench-wmf [22:13:48] !log reedy@deploy1003 Synchronized php-1.44.0-wmf.22/includes/libs/filebackend/FileBackend.php: T384851 (duration: 02m 14s) [22:13:53] T384851: PHP Deprecated: strrpos(): Passing null to parameter #1 ($haystack) of type string is deprecated - https://phabricator.wikimedia.org/T384851 [22:21:30] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10696151 (10BTullis) a:05Jclark-ctr→03BTullis Thanks @Jclark-ctr I have added the new logical drives with: ` sudo perccli64... [22:21:41] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10696153 (10BTullis) [22:25:24] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1172.eqiad.wmnet [22:27:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1172.eqiad.wmnet [22:31:58] 06SRE, 06Data-Persistence: Alert disk space utilization on sessionstore nodes is too high - https://phabricator.wikimedia.org/T390630 (10Scott_French) 03NEW [22:32:28] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10696183 (10BTullis) I executed the following: ` btullis@cumin1002:~$ sudo cookbook sre.hadoop.init-hadoop-workers --skip-disks 0... [22:33:56] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1173-1174].eqiad.wmnet [22:35:13] (03PS1) 10Scott French: sessionstore-resources: add SessionStoreDiskSpaceUtilizationTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) [22:35:13] (03CR) 10Scott French: "I'd propose that we start with something simple like this - i.e., a basic utilization threshold that we do not expect to cross in normal o" [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [22:35:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1173-1174].eqiad.wmnet [22:38:05] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1174.eqiad.wmnet [22:38:22] 06SRE, 06Data-Persistence, 13Patch-For-Review: Alert when disk space utilization on sessionstore nodes is too high - https://phabricator.wikimedia.org/T390630#10696195 (10Scott_French) [22:38:22] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10696196 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Reboot post HDD replacement [22:42:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [22:49:14] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host an-worker1174.eqiad.wmnet [22:50:58] (03PS1) 10Btullis: Revert "Exclude an-worker group1 for hard drive replacement" [puppet] - 10https://gerrit.wikimedia.org/r/1132777 [22:51:57] (03CR) 10Btullis: [C:03+2] Revert "Exclude an-worker group1 for hard drive replacement" [puppet] - 10https://gerrit.wikimedia.org/r/1132777 (owner: 10Btullis) [22:57:37] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1172.eqiad.wmnet [22:57:58] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10696360 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Reboot post HDD replacement [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250331T2300) [23:02:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [23:05:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1172.eqiad.wmnet [23:06:07] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1173.eqiad.wmnet [23:06:25] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:06:27] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10696370 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Reboot post HDD replacement [23:07:02] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10696375 (10BTullis) 05Open→03Resolved [23:10:25] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10696390 (10BTullis) All looking good so far. We can proceed to the next group. {F58954950,width=50%} There are a couple of t... [23:13:44] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1173.eqiad.wmnet [23:25:21] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:28:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [23:29:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10696397 (10phaultfinder) [23:30:21] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:30:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:34:39] 06SRE-OnFire, 06Release-Engineering-Team, 06serviceops, 10Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#10696405 (10Scott_French) A couple of thoughts: I think it would make a lot of s... [23:38:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1132778 [23:38:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1132778 (owner: 10TrainBranchBot) [23:49:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1132778 (owner: 10TrainBranchBot) [23:49:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10696432 (10phaultfinder)