[00:19:14] (03PS1) 10Dzahn: zuul: specify charset=utf8mb4 in database connection config [puppet] - 10https://gerrit.wikimedia.org/r/1261670 (https://phabricator.wikimedia.org/T395938) [00:19:37] (03CR) 10CI reject: [V:04-1] zuul: specify charset=utf8mb4 in database connection config [puppet] - 10https://gerrit.wikimedia.org/r/1261670 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [00:20:46] (03PS2) 10Dzahn: zuul: specify charset=utf8mb4 in database connection config [puppet] - 10https://gerrit.wikimedia.org/r/1261670 (https://phabricator.wikimedia.org/T395938) [00:22:00] (03PS3) 10Dzahn: zuul: specify charset=utf8mb4 in database connection config [puppet] - 10https://gerrit.wikimedia.org/r/1261670 (https://phabricator.wikimedia.org/T395938) [00:22:16] (03PS4) 10Dzahn: zuul: specify charset=utf8mb4 in database connection config [puppet] - 10https://gerrit.wikimedia.org/r/1261670 (https://phabricator.wikimedia.org/T395938) [00:31:57] (03CR) 10Dzahn: [C:03+2] zuul: specify charset=utf8mb4 in database connection config [puppet] - 10https://gerrit.wikimedia.org/r/1261670 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [00:42:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1261686 [00:42:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1261686 (owner: 10TrainBranchBot) [00:55:55] (03PS1) 10Dzahn: zuul: break out mTLS setup into separate class [puppet] - 10https://gerrit.wikimedia.org/r/1261690 (https://phabricator.wikimedia.org/T421398) [00:56:30] (03PS2) 10Dzahn: zuul: break out mTLS setup into separate class [puppet] - 10https://gerrit.wikimedia.org/r/1261690 (https://phabricator.wikimedia.org/T421398) [00:56:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1261686 (owner: 10TrainBranchBot) [01:00:34] (03PS3) 10Dzahn: zuul: break out mTLS setup into separate class [puppet] - 10https://gerrit.wikimedia.org/r/1261690 (https://phabricator.wikimedia.org/T421398) [01:06:39] (03PS1) 10Dzahn: zuul: add fake TLS passwords under renamed class name [labs/private] - 10https://gerrit.wikimedia.org/r/1261693 [01:07:07] (03CR) 10Dzahn: [V:03+2 C:03+2] zuul: add fake TLS passwords under renamed class name [labs/private] - 10https://gerrit.wikimedia.org/r/1261693 (owner: 10Dzahn) [01:10:54] (03PS1) 10AKhatun: stream: mediawiki.page_edit_type_simple.dev1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261695 (https://phabricator.wikimedia.org/T421005) [01:11:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1261696 [01:11:59] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1261696 (owner: 10TrainBranchBot) [01:12:22] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [01:24:32] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1261696 (owner: 10TrainBranchBot) [01:27:53] (03CR) 10AKhatun: stream: mw-page-html-content-change-enrich-next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261475 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton) [01:29:11] (03PS1) 10Dzahn: zuul: have 2 separate configs for main vs executor [puppet] - 10https://gerrit.wikimedia.org/r/1261701 [01:29:53] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on zuul2002.codfw.wmnet with reason: T421330 [01:29:58] T421330: SystemdUnitFailed - zuul-scheduler - https://phabricator.wikimedia.org/T421330 [01:30:03] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on zuul2001.codfw.wmnet with reason: T421330 [01:30:21] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on zuul1001.eqiad.wmnet with reason: T421330 [01:30:32] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on zuul1002.eqiad.wmnet with reason: T421330 [01:36:59] (03PS1) 10AKhatun: stream: mediawiki.page_edit_type_simple.dev1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261706 (https://phabricator.wikimedia.org/T421005) [02:01:16] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:24] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 06m 07s) [02:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:12] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [02:14:18] (03PS1) 10Aaron Schulz: Move all analytics API sandbox entries to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261732 (https://phabricator.wikimedia.org/T419429) [02:28:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [02:32:07] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [02:33:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [02:34:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:39] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [03:35:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:58:54] FIRING: TransitBGPDown: Transit BGP session down between cr2-esams and Hurricane Electric (2001:7f8:13::a500:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPD [04:03:39] RESOLVED: TransitBGPDown: Transit BGP session down between cr2-esams and Hurricane Electric (2001:7f8:13::a500:6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=esams&var-device=cr2-esams:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBG [04:40:39] FIRING: TransitBGPDown: Transit BGP session down between cr1-eqiad and Hurricane Electric (2001:504:0:2::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr1-eqiad:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [04:45:39] FIRING: [4x] TransitBGPDown: Transit BGP session down between cr1-eqiad and Hurricane Electric (2001:504:0:2::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:10:00] (03CR) 10Marostegui: [C:03+1] "Thank you for catching this!" [dns] - 10https://gerrit.wikimedia.org/r/1260132 (https://phabricator.wikimedia.org/T387332) (owner: 10Jasmine) [05:12:10] (03PS3) 10Ryan Kemper: Add sre.hadoop.reboot-coordinators cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261271 (https://phabricator.wikimedia.org/T421285) [05:30:39] RESOLVED: [4x] TransitBGPDown: Transit BGP session down between cr1-eqiad and Hurricane Electric (2001:504:0:2::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260327T0600) [06:27:43] (03PS1) 10Arnaudb: gerrit: tweak downstream_idle_timeout on Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1261932 (https://phabricator.wikimedia.org/T420909) [06:28:03] (03CR) 10Arnaudb: [C:03+2] gerrit: tweak downstream_idle_timeout on Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1261932 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [06:32:51] (03PS1) 10Arnaudb: gerrit: Envoy downstream timeout fix [puppet] - 10https://gerrit.wikimedia.org/r/1261933 (https://phabricator.wikimedia.org/T420909) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260327T0700) [07:04:12] (03PS1) 10Arnaudb: gerrit: tweak envoy::idle_timeout [puppet] - 10https://gerrit.wikimedia.org/r/1261937 (https://phabricator.wikimedia.org/T420909) [07:13:51] (03CR) 10Joal: "I think this change was not needed (see comment inline)." [puppet] - 10https://gerrit.wikimedia.org/r/1261504 (https://phabricator.wikimedia.org/T420008) (owner: 10Eevans) [07:21:15] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1261481 (https://phabricator.wikimedia.org/T417213) (owner: 10Btullis) [07:35:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:40:34] (03PS1) 10Hashar: gerrit: align ATS/Envoy/Apache timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1261957 (https://phabricator.wikimedia.org/T420909) [07:43:45] (03Abandoned) 10Filippo Giunchedi: wmcs: update Filippo's root key [puppet] - 10https://gerrit.wikimedia.org/r/1189209 (owner: 10Filippo Giunchedi) [07:44:47] (03CR) 10Arnaudb: [C:03+2] "lets try these values" [puppet] - 10https://gerrit.wikimedia.org/r/1261957 (https://phabricator.wikimedia.org/T420909) (owner: 10Hashar) [07:46:18] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [07:49:19] (03PS1) 10Arnaudb: Revert "gerrit: align ATS/Envoy/Apache timeouts" [puppet] - 10https://gerrit.wikimedia.org/r/1261961 [07:54:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:55:22] (03PS1) 10Elukey: admin_ng: bump kartotherian's resourcequota limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261965 (https://phabricator.wikimedia.org/T421350) [07:56:13] (03PS1) 10Brouberol: deployment_server: define the turnilo kubeconfig in dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1261966 (https://phabricator.wikimedia.org/T416119) [07:59:06] (03PS1) 10Brouberol: trafficerver: redirect turnilo.w.o to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1261968 (https://phabricator.wikimedia.org/T416125) [07:59:15] (03PS1) 10Brouberol: Define the turnilo namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261969 (https://phabricator.wikimedia.org/T416120) [07:59:17] (03PS1) 10Brouberol: Define the turnilo helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261970 (https://phabricator.wikimedia.org/T416121) [08:00:31] (03PS2) 10Brouberol: trafficserver: redirect turnilo.w.o to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1261968 (https://phabricator.wikimedia.org/T416125) [08:02:02] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [08:04:20] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/kartotherian: sync [08:04:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:05:11] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/kartotherian: sync [08:06:55] (03PS2) 10Elukey: admin_ng: bump kartotherian's resourcequota limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261965 (https://phabricator.wikimedia.org/T421350) [08:07:32] (03PS1) 10Brouberol: Remove kafka.roll-restart-mirror-maker cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261971 (https://phabricator.wikimedia.org/T417407) [08:15:06] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1261966 (https://phabricator.wikimedia.org/T416119) (owner: 10Brouberol) [08:15:27] (03CR) 10Joal: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261969 (https://phabricator.wikimedia.org/T416120) (owner: 10Brouberol) [08:16:41] (03CR) 10Brouberol: [C:03+2] deployment_server: define the turnilo kubeconfig in dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1261966 (https://phabricator.wikimedia.org/T416119) (owner: 10Brouberol) [08:16:56] (03CR) 10Joal: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261970 (https://phabricator.wikimedia.org/T416121) (owner: 10Brouberol) [08:17:29] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1261968 (https://phabricator.wikimedia.org/T416125) (owner: 10Brouberol) [08:18:35] (03CR) 10Brouberol: [C:03+2] Define the turnilo namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261969 (https://phabricator.wikimedia.org/T416120) (owner: 10Brouberol) [08:18:38] (03CR) 10Brouberol: [C:03+2] Define the turnilo helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261970 (https://phabricator.wikimedia.org/T416121) (owner: 10Brouberol) [08:26:42] (03Merged) 10jenkins-bot: Define the turnilo namespace in dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261969 (https://phabricator.wikimedia.org/T416120) (owner: 10Brouberol) [08:27:09] (03Merged) 10jenkins-bot: Define the turnilo helmfile and values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261970 (https://phabricator.wikimedia.org/T416121) (owner: 10Brouberol) [08:28:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:28:47] (03CR) 10Elukey: [C:03+2] "Manually tested and applied in eqiad to unblock deployments." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261965 (https://phabricator.wikimedia.org/T421350) (owner: 10Elukey) [08:30:04] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:31:34] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/turnilo: apply [08:32:03] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/turnilo: apply [08:40:19] (03CR) 10Brouberol: [C:03+2] trafficserver: redirect turnilo.w.o to dse-k8s-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1261968 (https://phabricator.wikimedia.org/T416125) (owner: 10Brouberol) [08:41:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-06 - 2026-03-27): Configure dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T421465 (10Gehel) 03NEW [08:44:35] (03PS6) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [08:45:59] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [08:50:16] (03CR) 10Elukey: "I left some notes after a quick pass, overall it looks good but I would split the change in 3 stesp:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [08:53:33] (03CR) 10JavierMonton: stream: mw-page-html-content-change-enrich-next (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261475 (https://phabricator.wikimedia.org/T421341) (owner: 10JavierMonton) [08:56:34] (03CR) 10Elukey: ml-serve: add modified kserve 0.17 chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [08:57:01] (03PS1) 10Arnaudb: gerrit: discard ttl=20 on httpd [puppet] - 10https://gerrit.wikimedia.org/r/1262007 (https://phabricator.wikimedia.org/T420909) [08:57:59] (03CR) 10Elukey: [C:03+1] Remove kafka.roll-restart-mirror-maker cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261971 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [08:58:14] (03CR) 10Brouberol: [C:03+2] Remove kafka.roll-restart-mirror-maker cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1261971 (https://phabricator.wikimedia.org/T417407) (owner: 10Brouberol) [09:03:45] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:04:19] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:05:09] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [09:06:19] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:10:20] 06SRE, 10Maps, 07Sustainability (Incident Followup): Kartotherian dashboard links don't work - https://phabricator.wikimedia.org/T421226#11757172 (10elukey) 05Open→03Resolved Updated the links, they now work :) [09:12:48] (03Abandoned) 10Elukey: role::kafka::test: prepare the cluster for the Kafka upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1239142 (https://phabricator.wikimedia.org/T416670) (owner: 10Elukey) [09:13:04] (03Abandoned) 10Elukey: Disable notifications for db1253 [puppet] - 10https://gerrit.wikimedia.org/r/1252002 (https://phabricator.wikimedia.org/T420041) (owner: 10Elukey) [09:14:47] (03PS1) 10Elukey: role::kafka::test: update the inter broker protocol [puppet] - 10https://gerrit.wikimedia.org/r/1262008 (https://phabricator.wikimedia.org/T417035) [09:15:43] (03CR) 10JavierMonton: [C:03+1] role::kafka::test: update the inter broker protocol [puppet] - 10https://gerrit.wikimedia.org/r/1262008 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [09:21:09] (03PS7) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [09:22:34] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [09:24:41] (03PS8) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [09:25:24] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [09:25:58] (03CR) 10Elukey: [C:03+2] role::kafka::test: update the inter broker protocol [puppet] - 10https://gerrit.wikimedia.org/r/1262008 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [09:26:15] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [09:29:05] (03PS1) 10Elukey: role::kafka::test::broker: fix inter-broker paramter type [puppet] - 10https://gerrit.wikimedia.org/r/1262017 [09:31:56] (03CR) 10Elukey: [C:03+2] role::kafka::test::broker: fix inter-broker paramter type [puppet] - 10https://gerrit.wikimedia.org/r/1262017 (owner: 10Elukey) [09:36:49] (03PS1) 10Arnaudb: gerrit: adjust idleTimeout on Jetty [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909) [09:37:49] !log elukey@cumin1003 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-test-eqiad [09:39:27] (03CR) 10Effie Mouzeli: [C:03+1] "Thank you very much Luca!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) (owner: 10Elukey) [09:40:11] (03CR) 10Elukey: [V:03+2 C:03+2] "merging now to avoid the monday rebase! :D" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1259148 (https://phabricator.wikimedia.org/T420223) (owner: 10Elukey) [09:41:35] (03PS6) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) [09:44:07] (03CR) 10Hashar: "That is to sync up with the timeout on mod_proxy side which I think has:" [puppet] - 10https://gerrit.wikimedia.org/r/1262020 (https://phabricator.wikimedia.org/T420909) (owner: 10Arnaudb) [09:44:25] (03CR) 10CI reject: [V:04-1] sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) (owner: 10Federico Ceratto) [09:46:12] (03PS1) 10Brouberol: Add monitors for the turnilo deployment pods [alerts] - 10https://gerrit.wikimedia.org/r/1262024 (https://phabricator.wikimedia.org/T416113) [09:48:47] 07sre-alert-triage, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414970#11757285 (10Gehel) [09:48:51] 07sre-alert-triage, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Alert in need of triage: KubernetesAPIErrorRate - https://phabricator.wikimedia.org/T414413#11757289 (10Gehel) [09:49:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Configure dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T421465#11757297 (10Gehel) [09:49:49] (03PS2) 10Brouberol: Add monitors for the turnilo deployment pods [alerts] - 10https://gerrit.wikimedia.org/r/1262024 (https://phabricator.wikimedia.org/T416113) [09:52:24] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-worker1213 - https://phabricator.wikimedia.org/T420812#11757379 (10Gehel) [09:54:14] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Requesting access to analytics-admins for Jerrywang - https://phabricator.wikimedia.org/T419820#11757406 (10Gehel) [09:54:26] (03PS9) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [09:54:36] (03PS1) 10Effie Mouzeli: mw-parsoid: add CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1262025 (https://phabricator.wikimedia.org/T420468) [09:55:07] 06SRE, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Data Platform SRE paging alerts and on-call SRE response - https://phabricator.wikimedia.org/T420264#11757420 (10Gehel) [09:55:29] 06SRE, 10SRE-Access-Requests, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 13Patch-For-Review: Requesting access to analytics-privatedata-users for AWesterinen - https://phabricator.wikimedia.org/T420053#11757426 (10Gehel) [09:56:05] 10SRE-SLO, 10observability, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Update WDQS SLOs to reflect graph split changes - https://phabricator.wikimedia.org/T393966#11757440 (10Gehel) [09:56:09] (03PS1) 10Effie Mouzeli: trafficserver: update mw-parsoid XWD entries [puppet] - 10https://gerrit.wikimedia.org/r/1262026 (https://phabricator.wikimedia.org/T420468) [09:56:13] (03PS10) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [09:56:47] 10SRE-SLO, 06ServiceOps new, 06Data-Platform-SRE (2026-03-27 - 2026-04-17), 07Essential-Work, and 2 others: IPoid: Define service level indicators and service level objectives - https://phabricator.wikimedia.org/T348935#11757454 (10Gehel) [09:56:59] (03CR) 10CI reject: [V:04-1] mw-parsoid: add CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1262025 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [09:57:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Degraded RAID on an-presto1007 - https://phabricator.wikimedia.org/T419329#11757461 (10Gehel) [09:57:30] !log elukey@cumin1003 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-test-eqiad [09:57:40] 10ops-eqiad, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Disk error on an-worker1178 - https://phabricator.wikimedia.org/T419206#11757471 (10Gehel) [09:57:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Missing physical volume on an-worker1159 - https://phabricator.wikimedia.org/T419129#11757473 (10Gehel) [09:58:43] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:59:17] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [09:59:21] (03Abandoned) 10Effie Mouzeli: php8.1-cli: introduce opcache and JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294) (owner: 10Effie Mouzeli) [09:59:33] (03Abandoned) 10Effie Mouzeli: trafficserver: respect PHP_ENGINE_STICKY cookie value [puppet] - 10https://gerrit.wikimedia.org/r/1125176 (https://phabricator.wikimedia.org/T383845) (owner: 10Effie Mouzeli) [09:59:37] (03PS11) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [09:59:41] (03Abandoned) 10Effie Mouzeli: mw-cron: disable mcrouter container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1143517 (https://phabricator.wikimedia.org/T341555) (owner: 10Effie Mouzeli) [09:59:45] (03PS12) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [10:00:05] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:00:09] (03CR) 10Klausman: "Could you add (a link to) some quick overview of what was changed relative to upstream (if anything)?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:00:39] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:01:11] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:03:01] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:03:57] (03PS2) 10Effie Mouzeli: mw-parsoid: add CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1262025 (https://phabricator.wikimedia.org/T420468) [10:04:38] (03CR) 10Slyngshede: [C:03+1] "Look sensible." [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [10:04:52] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [10:08:06] (03PS1) 10Elukey: Move kafka-test1006 to Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1262031 (https://phabricator.wikimedia.org/T417035) [10:08:38] 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471 (10MPostoronca-WMF) 03NEW [10:09:44] (03CR) 10Elukey: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262031 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [10:10:33] (03PS13) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [10:10:56] (03CR) 10JavierMonton: [C:03+1] Move kafka-test1006 to Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1262031 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [10:11:09] (03CR) 10Dpogorzelski: "I think i'll add an additional readme as Luca suggested so that we know how to keep track of things." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:11:11] 06SRE, 10SRE-Access-Requests: Requesting access to superset dashboard for mpostoronca - https://phabricator.wikimedia.org/T421471#11757568 (10MPostoronca-WMF) [10:11:26] (03CR) 10Elukey: [C:03+2] Move kafka-test1006 to Trixie [puppet] - 10https://gerrit.wikimedia.org/r/1262031 (https://phabricator.wikimedia.org/T417035) (owner: 10Elukey) [10:12:16] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [10:12:20] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1262032 [10:12:21] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1262033 [10:12:26] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host kafka-test1006.eqiad.wmnet with OS trixie [10:18:06] !log taavi@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database abstractwiki (T420637) [10:18:12] T420637: [wikireplicas] Create views for new wiki abstractwiki - https://phabricator.wikimedia.org/T420637 [10:24:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:27:54] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-test1006.eqiad.wmnet with reason: host reimage [10:31:04] (03CR) 10Joal: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1262024 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [10:31:20] (03CR) 10Brouberol: [C:03+2] Add monitors for the turnilo deployment pods [alerts] - 10https://gerrit.wikimedia.org/r/1262024 (https://phabricator.wikimedia.org/T416113) (owner: 10Brouberol) [10:33:59] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-test1006.eqiad.wmnet with reason: host reimage [10:41:09] (03PS7) 10Federico Ceratto: sre.mysql.clone: record clone runs into Zarcillo [cookbooks] - 10https://gerrit.wikimedia.org/r/1243772 (https://phabricator.wikimedia.org/T417608) [10:43:07] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter2005.codfw.wmnet [10:46:58] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2005.codfw.wmnet [10:49:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster test-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=test-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:50:22] !log jayme@cumin1003 START - Cookbook sre.hosts.reboot-single for host poolcounter2006.codfw.wmnet [10:51:12] !log blake@cumin1003 START - Cookbook sre.hosts.reboot-single for host deploy2002.codfw.wmnet [10:51:22] (03PS8) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [10:54:21] !log jayme@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host poolcounter2006.codfw.wmnet [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260327T0700) [11:00:05] jelto, arnoldokoth, mutante, and arnaudb: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260327T1100). [11:00:53] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [11:02:04] !log blake@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy2002.codfw.wmnet [11:02:49] (03PS1) 10Effie Mouzeli: mw-parsoid: remove service definition 4 [puppet] - 10https://gerrit.wikimedia.org/r/1262052 (https://phabricator.wikimedia.org/T420468) [11:03:59] (03PS1) 10Effie Mouzeli: envoy: remove mw-parsoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1262054 (https://phabricator.wikimedia.org/T420468) [11:05:28] (03PS1) 10Elukey: profile::base::certificates: rename Puppet Internal CA's path [puppet] - 10https://gerrit.wikimedia.org/r/1262055 [11:05:55] (03PS3) 10Effie Mouzeli: mw-parsoid: add CNAMES [dns] - 10https://gerrit.wikimedia.org/r/1262025 (https://phabricator.wikimedia.org/T420468) [11:06:10] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1020 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:07:10] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1020 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:15:17] !log taavi@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database abstractwiki (T420637) [11:15:23] T420637: [wikireplicas] Create views for new wiki abstractwiki - https://phabricator.wikimedia.org/T420637 [11:15:56] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host kafka-test1006.eqiad.wmnet with OS trixie [11:20:36] 06SRE, 06Infrastructure-Foundations: Create a cookbook to execute Kafka rolling upgrades - https://phabricator.wikimedia.org/T417035#11757818 (10elukey) 05Open→03Resolved a:03elukey [11:20:59] 06SRE, 06Infrastructure-Foundations: Test and upgrade Kafka clusters to Openjdk 17 - https://phabricator.wikimedia.org/T416674#11757822 (10elukey) 05Stalled→03Declined [11:27:54] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:30:51] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:34:52] (03PS5) 10Fabfur: cache:haproxy: suppress startup warn for haproxy 3.2 (lua scripts) [puppet] - 10https://gerrit.wikimedia.org/r/1261484 (https://phabricator.wikimedia.org/T421402) [11:34:52] (03PS2) 10Fabfur: hiera: upgrade haproxy to version 3.2 on cp6001 and cp6009 [puppet] - 10https://gerrit.wikimedia.org/r/1261492 (https://phabricator.wikimedia.org/T421402) [11:34:52] (03PS1) 10Fabfur: hiera: upgrade haproxy to version 3.2 on drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1262059 (https://phabricator.wikimedia.org/T421402) [11:34:55] (03PS1) 10Fabfur: hiera: upgrade haproxy to version magru [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) [11:34:56] (03PS1) 10Fabfur: hiera: upgrade haproxy to version 3.2 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1262061 (https://phabricator.wikimedia.org/T421402) [11:34:59] (03PS1) 10Fabfur: hiera: upgrade haproxy to version 3.2 on eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1262062 (https://phabricator.wikimedia.org/T421402) [11:35:01] (03PS1) 10Fabfur: hiera: upgrade haproxy to version 3.2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1262063 (https://phabricator.wikimedia.org/T421402) [11:35:03] (03PS1) 10Fabfur: hiera: upgrade haproxy to version 3.2 on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1262064 (https://phabricator.wikimedia.org/T421402) [11:35:06] (03PS1) 10Fabfur: hiera: upgrade haproxy to version 3.2 on esams [puppet] - 10https://gerrit.wikimedia.org/r/1262065 (https://phabricator.wikimedia.org/T421402) [11:35:38] (03PS2) 10Fabfur: hiera: upgrade haproxy to version 3.2 on magru [puppet] - 10https://gerrit.wikimedia.org/r/1262060 (https://phabricator.wikimedia.org/T421402) [11:35:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:41:51] 06SRE, 06Infrastructure-Foundations: Add some kafka clients to the Kafka test cluster - https://phabricator.wikimedia.org/T417034#11757879 (10JMonton-WMF) These applications were tested during the Kafka upgrade to 3.7: - PyFlink 1.20: [[ https://github.com/wikimedia/operations-deployment-charts/tree/master/he... [11:48:37] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:50:26] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:51:34] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:52:13] (03PS1) 10Fabfur: aptrepo,haproxy: add haproxy-awslc component/package [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) [11:53:29] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:55:53] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [11:57:41] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2026-03-27 - 2026-04-17): Configure dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T421465#11757915 (10Jclark-ctr) @Gehel If nothing is needed from DC Ops on this ticket, do you mind if I remove tags ops-eqiad-dc-ops? [11:58:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11757935 (10Jclark-ctr) [12:05:31] (03CR) 10Daniel Kinzler: [C:04-1] "CR-1 to remind myself to come back to this. The MediaWiki change has been deployed, but we should wait a while with deploying this, to av" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1259242 (https://phabricator.wikimedia.org/T420280) (owner: 10Daniel Kinzler) [12:06:47] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:08:36] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:10:01] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:10:06] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:10:06] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-main_443: Servers wdqs1017.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1011.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1021.eqiad.wmnet, wdqs1016.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet, wdqs1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:11:57] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:13:51] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: netbox report error for puppetdb serial versus netbox serial for backup1012 - https://phabricator.wikimedia.org/T420623#11758040 (10Jclark-ctr) >>! In T420623#11753681, @Jclark-ctr wrote: > I also see a typo on wikikube-worker1371 Mac... [12:14:04] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [12:14:34] (03PS9) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [12:16:06] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:16:06] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:18:59] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [12:25:40] (03PS10) 10Slyngshede: P:tofurkey Add tofurkey [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) [12:31:36] (03CR) 10Slyngshede: "Note that the package in not yet imported into apt-repo." [puppet] - 10https://gerrit.wikimedia.org/r/1260730 (https://phabricator.wikimedia.org/T355446) (owner: 10Slyngshede) [12:36:44] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11758091 (10Jgreen) frqueue1005/frqueue1006 are up and running [12:37:15] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [12:40:07] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:41:55] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:43:24] (03CR) 10Bartosz Wójtowicz: services: add linked-artifacts service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1250651 (https://phabricator.wikimedia.org/T414112) (owner: 10Eevans) [12:45:49] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [12:47:45] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [12:53:04] (03CR) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [12:55:10] (03PS1) 10Kamila Součková: Enable $wgTempCategoryCollations for s3 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) [12:57:19] (03CR) 10Kamila Součková: "It works on testwiki, I created a new category and checked that both tables are getting updated correctly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:51] (03PS7) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [13:25:51] (03PS7) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [13:28:20] (03CR) 10Eevans: [C:03+2] cassandra_dev: add aqsloader grants to staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1261504 (https://phabricator.wikimedia.org/T420008) (owner: 10Eevans) [13:28:39] (03CR) 10Cathal Mooney: "I've applied this manually in ulsfo to deal with the current issue. Makes no sense to leave things broken when there is an easy fix avail" [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) (owner: 10Cathal Mooney) [13:43:15] (03PS8) 10Tiziano Fogli: thanos/compact: add support for instance-based partitioning [puppet] - 10https://gerrit.wikimedia.org/r/1260650 (https://phabricator.wikimedia.org/T386911) [13:43:16] (03PS8) 10Tiziano Fogli: pontoon: override promethues_instances designated_compactor [puppet] - 10https://gerrit.wikimedia.org/r/1260651 (https://phabricator.wikimedia.org/T386911) [13:46:05] (03PS14) 10Dpogorzelski: ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) [13:47:18] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:47:35] (03CR) 10CI reject: [V:04-1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [13:48:52] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:49:06] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:49:56] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:51:51] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:52:15] !log dpogorzelski@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . [13:54:54] (03PS3) 10Cathal Mooney: Add policy 'transport-in' to apply as import on transport circuits [homer/public] - 10https://gerrit.wikimedia.org/r/1260734 (https://phabricator.wikimedia.org/T420821) [13:58:37] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11758415 (10Jclark-ctr) >>! In T416249#11755868, @Jgreen wrote: > @Jclark-ctr I finally had a chance to look into the frmx1002/frdata1003 issue. I think the servers are in the... [13:59:12] (03Abandoned) 10Jsn.sherman: Remove local configuration routing and loading [extensions/AutoModerator] (wmf/1.46.0-wmf.20) - 10https://gerrit.wikimedia.org/r/1255772 (https://phabricator.wikimedia.org/T419835) (owner: 10Jsn.sherman) [14:02:55] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [14:08:58] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change ips for frack servers - cmooney@cumin1003" [14:09:06] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: change ips for frack servers - cmooney@cumin1003" [14:09:06] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:22:09] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11758506 (10cmooney) >>! In T416249#11758413, @Jclark-ctr wrote: > @cmooney Could you assist with this next week? Done now, I've moved the frmx1002 and frdata1003 ports into... [14:22:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6] - https://phabricator.wikimedia.org/T416249#11758507 (10Jgreen) >>! In T416249#11758413, @Jclark-ctr wrote: >>>! In T416249#11755868, @Jgreen wrote: >> @Jclark-ctr I finally had a chance to loo... [14:27:03] (03CR) 10Elukey: "I didn't see a diff from CI and I checked a bit, I think there is some yaml to tune sorry :(" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261455 (owner: 10Jforrester) [14:36:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q3:rack/setup/install frdev1003 - https://phabricator.wikimedia.org/T418928#11758535 (10Jgreen) [14:37:59] (03CR) 10Vgutierrez: [C:04-1] "`aptrepo/files/updates` needs to be updated as well" [puppet] - 10https://gerrit.wikimedia.org/r/1262068 (https://phabricator.wikimedia.org/T419825) (owner: 10Fabfur) [14:43:34] 06SRE, 06Infrastructure-Foundations: Review the most critical/popular Kafka clients before the Kafka upgrade - https://phabricator.wikimedia.org/T417031#11758556 (10JMonton-WMF) We have checked all `PyFlink applications`, they are using Flink 1.20 and JDK 17. On a test on the Kafka Test cluster, they worked fi... [14:49:23] 06SRE, 06DC-Ops, 10Wikidata: NVMe versus standard SSD performance info - https://phabricator.wikimedia.org/T419884#11758575 (10RobH) [14:51:47] 06SRE, 06DC-Ops, 10Wikidata: NVMe versus standard SSD performance info - https://phabricator.wikimedia.org/T419884#11758587 (10RobH) a:05RobH→03gmodena @gmodena, I've gotten back the following links to the whitepapers for our currently used SSDs and NVMe offerings: G3ZJM0K: 480GB SSD SATA Read Intensiv... [14:52:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability, 13Patch-For-Review: Q4:rack/setup/install kafka-logging100[6-8] - https://phabricator.wikimedia.org/T418929#11758589 (10elukey) I got another issue after my patch, namely the root user creation (in the BMC) returns a plain HTTP 400. I tried this from the spi... [14:52:05] (03PS1) 10Fabfur: aptrepo: updates configuration for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1262146 (https://phabricator.wikimedia.org/T421402) [14:57:59] (03CR) 10Vgutierrez: [C:03+1] aptrepo: updates configuration for haproxy32 [puppet] - 10https://gerrit.wikimedia.org/r/1262146 (https://phabricator.wikimedia.org/T421402) (owner: 10Fabfur) [15:00:43] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:07:04] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:08:08] 06SRE, 10Maps, 07Sustainability (Incident Followup): Kartotherian dashboard links don't work - https://phabricator.wikimedia.org/T421226#11758650 (10Scott_French) Thank you very much @elukey! [15:10:21] (03PS1) 10Scott French: trafficserver: Extend validation for .lua.conf files [puppet] - 10https://gerrit.wikimedia.org/r/1262152 [15:14:44] (03CR) 10Scott French: "Thanks again for adding the syntactic validity checks." [puppet] - 10https://gerrit.wikimedia.org/r/1262152 (owner: 10Scott French) [15:15:41] (03PS18) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) [15:15:50] (03CR) 10Vgutierrez: prometheus::ops: Monitor IPIP realservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1259927 (https://phabricator.wikimedia.org/T419873) (owner: 10Vgutierrez) [15:16:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:18:25] (03PS1) 10JMeybohm: kubernetes: Remove docker as supported container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1262154 (https://phabricator.wikimedia.org/T395870) [15:19:05] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262154 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm) [15:19:53] (03CR) 10Vgutierrez: [C:03+1] "nice addition, thanks Scott! please amend the commit message to include the phab task, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1262152 (owner: 10Scott French) [15:23:13] (03PS2) 10Scott French: trafficserver: Extend validation for .lua.conf files [puppet] - 10https://gerrit.wikimedia.org/r/1262152 (https://phabricator.wikimedia.org/T421203) [15:24:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:24:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:24:24] (03CR) 10Scott French: trafficserver: Extend validation for .lua.conf files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1262152 (https://phabricator.wikimedia.org/T421203) (owner: 10Scott French) [15:24:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:29:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [15:29:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:29:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [15:31:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:34:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:38:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, March 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261526 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [15:39:49] (03CR) 10JHathaway: [C:03+2] run_ci_locally: add nounset, cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1261610 (owner: 10JHathaway) [15:39:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:42:16] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1261516 (https://phabricator.wikimedia.org/T421366) (owner: 10Mmartorana) [15:51:27] (03CR) 10Cathal Mooney: [C:03+2] Add Nokia POPs BGP policies [homer/public] - 10https://gerrit.wikimedia.org/r/1260715 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [15:52:42] (03Merged) 10jenkins-bot: Add Nokia POPs BGP policies [homer/public] - 10https://gerrit.wikimedia.org/r/1260715 (https://phabricator.wikimedia.org/T408892) (owner: 10Ayounsi) [15:56:44] (03PS1) 10Herron: wip [alerts] - 10https://gerrit.wikimedia.org/r/1262175 [15:56:44] (03CR) 10Klausman: [C:03+1] ml-serve: add modified kserve 0.17 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1261460 (https://phabricator.wikimedia.org/T419722) (owner: 10Dpogorzelski) [15:57:47] (03PS1) 10Herron: kafkamon: check all clusters from both kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1262176 [16:00:22] (03PS1) 10Cathal Mooney: Nokia BGP policy: add new policeis for version 24 syntax too [homer/public] - 10https://gerrit.wikimedia.org/r/1262179 (https://phabricator.wikimedia.org/T408892) [16:00:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:36] (03PS2) 10Herron: kafkamon: check all clusters from both kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) [16:09:12] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:10:46] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:11:45] (03PS2) 10Cathal Mooney: Nokia BGP policy: add new policeis for version 24 syntax too [homer/public] - 10https://gerrit.wikimedia.org/r/1262179 (https://phabricator.wikimedia.org/T408892) [16:11:48] !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:11:53] (03PS3) 10Herron: kafkamon: check all clusters from both kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) [16:12:08] (03PS3) 10Cathal Mooney: Nokia BGP policy: add new policeis for version 24 syntax too [homer/public] - 10https://gerrit.wikimedia.org/r/1262179 (https://phabricator.wikimedia.org/T408892) [16:12:24] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:12:26] (03CR) 10CI reject: [V:04-1] kafkamon: check all clusters from both kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [16:12:53] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [16:13:13] !log brouberol@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [16:13:39] (03PS4) 10Herron: kafkamon: check all clusters from both kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) [16:15:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:28] (03CR) 10Cathal Mooney: [C:03+2] Nokia BGP policy: add new policeis for version 24 syntax too [homer/public] - 10https://gerrit.wikimedia.org/r/1262179 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [16:18:00] (03Merged) 10jenkins-bot: Nokia BGP policy: add new policeis for version 24 syntax too [homer/public] - 10https://gerrit.wikimedia.org/r/1262179 (https://phabricator.wikimedia.org/T408892) (owner: 10Cathal Mooney) [16:18:44] (03CR) 10Scott French: "Thanks, Raine!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1262091 (https://phabricator.wikimedia.org/T419274) (owner: 10Kamila Součková) [16:22:46] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:23:08] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T421517 (10tappof) 03NEW [16:24:04] (03CR) 10Scott French: [C:03+1] "Thanks, effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1262054 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [16:25:33] (03CR) 10Scott French: [C:03+1] "Thanks, effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1262026 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [16:26:57] (03CR) 10Andrew Bogott: [C:03+1] cloudnfs: Remove Huggle project config [puppet] - 10https://gerrit.wikimedia.org/r/1259079 (owner: 10Majavah) [16:27:24] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:31:25] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/8358/co" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [16:31:42] (03PS1) 10Brouberol: dse-k8s-eqiad: bind security policies onto the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1262190 (https://phabricator.wikimedia.org/T419259) [16:33:04] PROBLEM - SSH on an-druid1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:34:12] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:34:30] (03CR) 10CDanis: "@jgleeson@wikimedia.org Whenever you can provide me with an ssh public key, we can get this merged and enabled" [puppet] - 10https://gerrit.wikimedia.org/r/1255036 (https://phabricator.wikimedia.org/T416948) (owner: 10CDanis) [16:35:28] (03PS12) 10Btullis: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) [16:35:40] (03Abandoned) 10Brouberol: dse-k8s-eqiad: bind security policies onto the mediawiki-dumps-legacy ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1262190 (https://phabricator.wikimedia.org/T419259) (owner: 10Brouberol) [16:36:30] !log jhathaway@cumin2002 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:37:05] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:37:20] (03PS13) 10Brouberol: Apply the new VAP to several namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1245369 (https://phabricator.wikimedia.org/T405509) (owner: 10Btullis) [16:38:29] (03PS2) 10Herron: burrow: update expressions to handle multiple instances [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) [16:40:54] RECOVERY - SSH on an-druid1006 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:40:54] PROBLEM - Druid historical on an-druid1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:41:14] !log dancy@deploy1003 Started deploy [releng/jenkins-deploy@31ace7e] (releasing): (no justification provided) [16:41:45] (03CR) 10Herron: [V:03+1] "the kafkamon hosts are getting upgraded to trixie soon, which is a good opportunity to improve redundancy for burrow (instead of failing o" [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [16:41:54] RECOVERY - Druid historical on an-druid1006 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [16:42:26] !log dancy@deploy1003 Finished deploy [releng/jenkins-deploy@31ace7e] (releasing): (no justification provided) (duration: 01m 18s) [16:43:09] (03PS3) 10Herron: burrow: update expressions to handle multiple instances [alerts] - 10https://gerrit.wikimedia.org/r/1262175 (https://phabricator.wikimedia.org/T418858) [16:44:12] (03PS5) 10Herron: kafkamon: check all clusters from both kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) [16:45:03] (03PS2) 10JMeybohm: kubernetes: Remove docker as supported container runtime [puppet] - 10https://gerrit.wikimedia.org/r/1262154 (https://phabricator.wikimedia.org/T395870) [16:45:25] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1262154 (https://phabricator.wikimedia.org/T395870) (owner: 10JMeybohm) [16:46:19] (03CR) 10CI reject: [V:04-1] kafkamon: check all clusters from both kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) (owner: 10Herron) [16:46:51] (03PS6) 10Herron: kafkamon: check all clusters from both kafkamon hosts [puppet] - 10https://gerrit.wikimedia.org/r/1262176 (https://phabricator.wikimedia.org/T418858) [16:47:13] (03CR) 10Scott French: "Thanks, effie! So, I think the docs [0] recommend cleaning out `profile::lvs::realserver::pools` on the relevant hosts at the same time [1" [puppet] - 10https://gerrit.wikimedia.org/r/1261433 (https://phabricator.wikimedia.org/T420468) (owner: 10Effie Mouzeli) [16:47:27] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:50:19] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2004.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [16:52:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1067:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1067 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown