[00:00:40] (03PS2) 101Veertje: Fix login form: responsive labels and mobile-friendly layout [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305745 [00:01:01] (03PS1) 101Veertje: Add daily background rotation from Wikimedia Commons (29 winter landscapes) [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305786 [00:01:11] (03PS1) 101Veertje: Background rotation: add license links and Commons file page links to attribution [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305787 [00:01:20] (03PS1) 101Veertje: Use full-original Commons URLs (no thumbnails) + fix author credits [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305788 [00:01:30] (03PS1) 101Veertje: Add dark backdrop to footer and attribution for contrast over background images [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305789 [00:01:39] (03PS1) 101Veertje: Footer and attribution in white box matching form card [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305790 [00:01:49] (03PS1) 101Veertje: Move footer and attribution inside the login form card [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305791 [00:01:58] (03PS1) 101Veertje: Author link opens Commons Media Viewer instead of plain file page [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305792 [00:02:08] (03PS1) 101Veertje: Add featured desktop backgrounds (98 sampled) with winter/featured date switching [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305793 [00:02:17] (03PS1) 101Veertje: Use 1920px thumbnails (API thumburl) instead of full-res originals [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305794 [00:02:27] (03PS1) 101Veertje: Add preload link in head for faster background image loading [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305795 [00:02:36] (03PS1) 101Veertje: Add footer links (Code of Conduct, Privacy Policy, Reset Password) with non-breaking space [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305796 [00:17:07] (03PS1) 10RLazarus: function-evaluator, function-orchestrator: Allow customizing dnsConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305797 (https://phabricator.wikimedia.org/T427864) [00:17:08] (03PS1) 10RLazarus: wikifunctions: Enable custom dnsConfig [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305798 (https://phabricator.wikimedia.org/T427864) [00:17:09] (03PS1) 10RLazarus: admin_ng: Restrict wikifunctions network access to DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305799 (https://phabricator.wikimedia.org/T427864) [00:17:15] !log T429919 Followed the steps in https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater#Clean_up_object_storage to clear out stale checkpoints [00:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:20] T429919: RdfStreamingUpdaterSpaceUsageTooHigh - https://phabricator.wikimedia.org/T429919 [00:25:51] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12057012 (10HCoplin-WMF) Hey there -- we are ready to deploy! If we could get it out the door on Monday, June 29, that would be ideal. Let us know if that works for you, @SLyngshede-WMF & @BCornwa... [00:30:56] (03PS1) 101Veertje: Fix login form: responsive labels and mobile-friendly layout [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305800 [00:30:56] (03PS1) 101Veertje: Add daily background rotation with attribution and footer polish [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305801 [00:42:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [00:47:45] (03Abandoned) 101Veertje: Author link opens Commons Media Viewer instead of plain file page [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305792 (owner: 101Veertje) [00:50:30] (03Abandoned) 101Veertje: Move footer and attribution inside the login form card [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305791 (owner: 101Veertje) [00:52:26] (03Abandoned) 101Veertje: Footer and attribution in white box matching form card [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305790 (owner: 101Veertje) [00:53:09] (03Abandoned) 101Veertje: Add dark backdrop to footer and attribution for contrast over background images [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305789 (owner: 101Veertje) [00:55:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frqueue1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429520#12057091 (10Jclark-ctr) a:03Jclark-ctr [00:56:22] (03Abandoned) 101Veertje: Add footer links (Code of Conduct, Privacy Policy, Reset Password) with non-breaking space [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305796 (owner: 101Veertje) [00:56:47] (03Abandoned) 101Veertje: Fix login form: use label elements instead of th for accessibility [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305742 (owner: 101Veertje) [00:56:58] (03Abandoned) 101Veertje: Use full-original Commons URLs (no thumbnails) + fix author credits [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305788 (owner: 101Veertje) [00:57:08] (03Abandoned) 101Veertje: Background rotation: add license links and Commons file page links to attribution [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305787 (owner: 101Veertje) [00:57:21] (03Abandoned) 101Veertje: Use 1920px thumbnails (API thumburl) instead of full-res originals [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305794 (owner: 101Veertje) [00:57:33] (03Abandoned) 101Veertje: Add preload link in head for faster background image loading [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305795 (owner: 101Veertje) [00:57:44] (03Abandoned) 101Veertje: Add daily background rotation from Wikimedia Commons (29 winter landscapes) [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305786 (owner: 101Veertje) [00:57:54] (03Abandoned) 101Veertje: Add featured desktop backgrounds (98 sampled) with winter/featured date switching [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305793 (owner: 101Veertje) [00:58:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frmx1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429529#12057097 (10Jclark-ctr) a:03Jclark-ctr E15 , U 20 [00:59:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frqueue1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429520#12057108 (10Jclark-ctr) E16 U23 [00:59:44] (03Abandoned) 101Veertje: Fix login form: responsive labels and mobile-friendly layout [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305745 (owner: 101Veertje) [01:00:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frqueue1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429520#12057114 (10Jclark-ctr) @Dwisehaupt this server is still listed as active. Can I set to Decommission? [01:01:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frmx1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429529#12057116 (10Jclark-ctr) @Dwisehaupt this server is still listed as active. Can I set to Decommission? [01:01:55] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frmx1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429529#12057117 (10Jclark-ctr) [01:02:06] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frqueue1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429520#12057118 (10Jclark-ctr) [01:04:13] (03PS2) 10RLazarus: wikifunctions: Enable custom dnsConfig pointing at coredns-internalonly [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305798 (https://phabricator.wikimedia.org/T427864) [01:04:13] (03PS2) 10RLazarus: admin_ng: Restrict wikifunctions network access to DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305799 (https://phabricator.wikimedia.org/T427864) [01:11:40] PROBLEM - MD RAID on wikikube-worker2159 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:11:42] ACKNOWLEDGEMENT - MD RAID on wikikube-worker2159 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T430240 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:11:55] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2159 - https://phabricator.wikimedia.org/T430240 (10ops-monitoring-bot) 03NEW [01:12:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1305803 [01:12:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1305803 (owner: 10TrainBranchBot) [01:17:42] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 06Release-Engineering-Team (Radar), 07User-notice: Sunsetting mirrors.wikimedia.org - https://phabricator.wikimedia.org/T416707#12057193 (10vaughnwalters) [01:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:20:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1305803 (owner: 10TrainBranchBot) [01:44:27] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1231 - https://phabricator.wikimedia.org/T430219#12057288 (10Jclark-ctr) a:03Jclark-ctr Updated firmware on backplane , idrac [01:53:20] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1231 - https://phabricator.wikimedia.org/T430219#12057308 (10Jclark-ctr) SR 228295552 [02:00:42] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:52] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 07m 09s) [02:09:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:34:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:et-0/0/0 (Transport: Arelion (IC-398709) {#20260602}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:53:45] (03PS1) 10Krinkle: extension-list: Remove WikimediaApiPortalOAuth ext and WikimediaApiPortal skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305804 (https://phabricator.wikimedia.org/T429373) [04:08:30] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/test-kitchen: apply [04:08:58] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/test-kitchen: apply [04:42:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [05:08:07] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host db2234.codfw.wmnet with OS trixie [05:09:52] (03PS1) 10Marostegui: Revert "major-upgrade.py: Add !log dbmaint on the start" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305810 [05:13:32] (03CR) 10Marostegui: [C:03+2] Revert "major-upgrade.py: Add !log dbmaint on the start" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305810 (owner: 10Marostegui) [05:15:53] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12057439 (10Marostegui) >>! In T430116#12055186, @Jhancock.wm wrote: > okay give it another shot. If it does it again I'm gonna open a ticket with Dell. Or we can try just leaving the riser out.... [05:15:56] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12057440 (10Marostegui) 05Open→03Resolved [05:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:23:51] !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2234.codfw.wmnet with reason: host reimage [05:28:55] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2234.codfw.wmnet with reason: host reimage [05:36:56] (03PS1) 10VadymTS1: hrwiki: Add to to wgCiteResponsiveReferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305811 (https://phabricator.wikimedia.org/T430182) [05:39:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305811 (https://phabricator.wikimedia.org/T430182) (owner: 10VadymTS1) [05:45:39] RECOVERY - MariaDB Replica IO: m3 on db2160 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [05:51:54] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2234.codfw.wmnet with OS trixie [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260626T0600) [06:17:09] (03PS2) 10Arnaudb: backups: prune debug artifacts from gerrit backups [puppet] - 10https://gerrit.wikimedia.org/r/1305827 (https://phabricator.wikimedia.org/T411583) [06:17:10] (03CR) 10Arnaudb: "@jcrespo@wikimedia.org feel free to merge it if you think it's OK!" [puppet] - 10https://gerrit.wikimedia.org/r/1305827 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [06:29:41] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:41] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:34:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:et-0/0/0 (Transport: Arelion (IC-398709) {#20260602}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260626T0700) [07:25:37] (03PS1) 10Filippo Giunchedi: admin: add lerickson to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1305833 [07:26:16] (03PS2) 10Filippo Giunchedi: admin: add lerickson to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1305833 (https://phabricator.wikimedia.org/T429610) [07:28:28] (03CR) 10Jcrespo: [C:03+1] "All good from me, but someone with more gerrit knowledge should approve." [puppet] - 10https://gerrit.wikimedia.org/r/1305827 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [07:33:16] (03PS1) 10Slyngshede: data.yaml: extend contract for georgemikesell [puppet] - 10https://gerrit.wikimedia.org/r/1305834 [07:36:40] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new: Setup url-downloader-next.w.o to simply tests - https://phabricator.wikimedia.org/T430166#12057726 (10MLechvien-WMF) [07:37:03] 06SRE, 06Infrastructure-Foundations, 06ServiceOps new, 07Sustainability (Incident Followup): Setup url-downloader-next.w.o to simply tests - https://phabricator.wikimedia.org/T430166#12057728 (10MLechvien-WMF) [07:37:26] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha and parsoid failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12057729 (10MLechvien-WMF) [07:37:49] 06SRE, 10hCaptcha, 06Product Safety and Integrity: hcaptcha and citoid failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12057734 (10MLechvien-WMF) [07:38:48] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-unattended Sequential unattended reboot of 6 host(s) [team=collaboration-services, os=bookworm] [07:39:03] !log jelto@cumin1003 END (FAIL) - Cookbook sre.hosts.reboot-unattended (exit_code=99) Sequential unattended reboot of 6 host(s) [team=collaboration-services, os=bookworm] [07:39:49] (03CR) 10Slyngshede: [C:03+1] admin: add lerickson to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1305833 (https://phabricator.wikimedia.org/T429610) (owner: 10Filippo Giunchedi) [07:40:31] (03CR) 10Slyngshede: [C:03+2] data.yaml: extend contract for georgemikesell [puppet] - 10https://gerrit.wikimedia.org/r/1305834 (owner: 10Slyngshede) [07:45:55] (03PS1) 10Filippo Giunchedi: clinic-duty: add Telxius multiple dates support [software] - 10https://gerrit.wikimedia.org/r/1305837 [07:46:25] 06SRE, 10hCaptcha, 06Product Safety and Integrity, 07Sustainability (Incident Followup): hcaptcha and citoid failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045#12057741 (10MLechvien-WMF) [07:47:01] (03PS3) 10Jelto: sre.hosts.reboot-unattended: add new cookbook for unattended reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/1305707 [07:47:01] (03CR) 10Jelto: "What do you think about this approach? I tried extending the existing `sre.hosts.reboot-cluster` cookbook but the scope is quite different" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305707 (owner: 10Jelto) [07:47:17] (03PS1) 10Dpogorzelski: ml-serve: add size label to GPU memory metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305843 (https://phabricator.wikimedia.org/T429597) [07:48:15] (03CR) 10Dpogorzelski: [C:03+2] ml-serve: add size label to GPU memory metrics [puppet] - 10https://gerrit.wikimedia.org/r/1305843 (https://phabricator.wikimedia.org/T429597) (owner: 10Dpogorzelski) [07:48:34] 06SRE, 06Traffic, 13Patch-For-Review: WE5.2.13 Dumps UA enforcement - https://phabricator.wikimedia.org/T427836#12057769 (10SLyngshede-WMF) @HCoplin-WMF I can merge the patch on Monday morning EU time, if that works for you? [07:49:32] (03CR) 10Filippo Giunchedi: [C:03+2] admin: add lerickson to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/1305833 (https://phabricator.wikimedia.org/T429610) (owner: 10Filippo Giunchedi) [07:50:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access for lerickson to deploy the RDF streaming updater on wikikube - https://phabricator.wikimedia.org/T429610#12057776 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done -- change will be live in the next 30m. If something is... [08:06:08] (03CR) 10Fabfur: [C:03+1] systemd::path: fix empty Unit= in path unit [puppet] - 10https://gerrit.wikimedia.org/r/1305701 (owner: 10Volans) [08:10:36] (03CR) 10Bartosz Wójtowicz: "The +1 got removed after the last update, could you approve the new patchset and we'll merge and deploy the changes?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305621 (https://phabricator.wikimedia.org/T426749) (owner: 10Bartosz Wójtowicz) [08:20:49] (03PS1) 10Gmodena: WIP: admin_ng: add wdqs local-storage resources for qlever indexer [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305852 [08:21:36] (03CR) 10Zabe: [C:03+1] extension-list: Remove WikimediaApiPortalOAuth ext and WikimediaApiPortal skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305804 (https://phabricator.wikimedia.org/T429373) (owner: 10Krinkle) [08:26:36] (03PS1) 10Gmodena: airflow-wikidata: Add qlever index PVCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305854 [08:27:18] (03PS2) 10Gmodena: WIP: airflow-wikidata: add qlever index PVCs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305854 [08:31:25] (03PS1) 10CWilliams: Revert^2 "major-upgrade.py: Add !log dbmaint on the start" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305856 [08:31:58] (03CR) 10CWilliams: [C:03+2] Revert^2 "major-upgrade.py: Add !log dbmaint on the start" [cookbooks] - 10https://gerrit.wikimedia.org/r/1305856 (owner: 10CWilliams) [08:41:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:42:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [08:45:32] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441#12058074 (10fgiunchedi) >>! In T401441#12055962, @VRiley-WMF wrote: > Thanks @fgiunchedi Would you be able to do this tomorrow or sometime next week? Yes next week works great for me, Tues 30th... [08:46:11] (03CR) 10Arnaudb: [C:03+2] backups: prune debug artifacts from gerrit backups [puppet] - 10https://gerrit.wikimedia.org/r/1305827 (https://phabricator.wikimedia.org/T411583) (owner: 10Arnaudb) [08:50:01] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279 (10Mvolz) 03NEW [08:50:18] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058102 (10Mvolz) p:05Triage→03Unbreak! [08:51:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:52:43] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058118 (10Mvolz) {F90546614} [09:05:20] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database isvwiki (T429938) [09:05:25] T429938: [wikireplicas] Create views for new wiki isvwiki - https://phabricator.wikimedia.org/T429938 [09:05:32] !log fnegri@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) for database isvwiki (T429938) [09:09:22] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database isvwiki (T429938) [09:09:32] !log fnegri@cumin1003 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) for database isvwiki (T429938) [09:09:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:15:32] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database isvwiki (T429938) [09:15:36] T429938: [wikireplicas] Create views for new wiki isvwiki - https://phabricator.wikimedia.org/T429938 [09:20:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:35] jelto: not quite sure if this pingable but we've had a big increase in latency in the citoid service since 21 utc last night and overnight have burned through half our SLO. It's looking like latency for 404s have increased drastically and p99 has increased from 30s to 2 minutes. In the front end we time out earlier than that which saves us a bit from the user perspective, but I'm seeing increase latency for 200s too though [09:21:35] not as drastic. Is there anything you know of that could be causing this? My first theory was some external service we use but I checked the major culprits and they're not reporting anything/curling them directly seems fast enough. Any ideas what to check next? [09:21:42] https://phabricator.wikimedia.org/T430279 [09:24:19] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058221 (10Mvolz) [09:25:41] let me take a look, one sec, also cc slyngs ^ [09:25:59] ty :) [09:26:57] Mvolz: can you add the link from the dashboard to the task as well? [09:29:23] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058237 (10Mvolz) [09:29:34] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058238 (10Mvolz) [09:30:01] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058250 (10Mvolz) [09:30:19] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058251 (10Mvolz) [09:30:31] Looking through the SAL log there's this: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1305639 [09:31:06] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058253 (10Mvolz) [09:31:40] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058263 (10Mvolz) [09:32:09] jelto: done, hope those work for you. the share link buttons were not terribly helpful! [09:34:41] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:14] Thank you, unfortunately the links are not working on my side, I get redirected to the grafana landing page. [09:35:14] I'm currently dinning through traffic to see if there is anything. [09:35:48] thanks for finding the SAL entry, there was indeed a deployment in the Thursday, June 25 UTC late backport window at 2026-06-25 20:00 UTC [09:37:44] looking at late backport just from what's on wikitech nothing seems related - all mediawiki stuff, not services... [09:39:41] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:42:40] !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database magwiki (T428282) [09:42:45] T428282: [wikireplicas] Create views for new wiki magwiki - https://phabricator.wikimedia.org/T428282 [09:42:53] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database magwiki (T428282) [09:43:19] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058323 (10Mvolz) [09:47:20] (03PS1) 10Filippo Giunchedi: site: make sure new cloudvirts have no firewall [puppet] - 10https://gerrit.wikimedia.org/r/1305862 (https://phabricator.wikimedia.org/T429563) [09:48:38] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058360 (10Jelto) I did some digging in superset but did not found any obvious pattern yet. @SLyngshede-WMF mentioned there was a deployment yesterday https://gerrit.wi... [09:48:47] I'll ask in serviceops channel [09:50:09] Okay, looking around logstash doesn't really tell me much. I do see a lot of "Unable to retrieve metadata from ISBN XXXXX from Zotero" but I don't know if that's normal [09:50:25] maybe related to https://phabricator.wikimedia.org/T430045 ? [09:52:45] The isbn failures are normal, we don't have that many isbns and a bot has been doing that for a while [09:53:12] ack [09:53:13] T430045 is resolved, the deployer reverted, afaik nothing new has been done? [09:53:13] T430045: hcaptcha and citoid failed to connect to the new URL downloader proxies - https://phabricator.wikimedia.org/T430045 [09:54:00] The failed url-downloader deploy did cause a latency spike but it also caused a success ratio failure spike and was reverted yesterday around noon. [09:54:31] Weirdly there's no effect on success ration with this latency issue, just latency only it looks like [09:56:15] yeah and only 404s? https://grafana.wikimedia.org/d/NJkCVermz/citoid?orgId=1&from=now-24h&to=now&timezone=utc&var-dc=000000026&refresh=5m&viewPanel=panel-21 [09:56:30] ah no 200s are also a bit slower [09:57:16] 4 or 5x slower :-) [09:57:16] (03PS5) 10Santiago Faci: growthbook: Updated chart to add API_RATE_LIMIT_MAX env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305785 (https://phabricator.wikimedia.org/T429420) [10:00:16] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058387 (10Mvolz) >>! In T430279#12058358, @Jelto wrote: > I did some digging in superset but did not found any obvious pattern yet. > > @SLyngshede-WMF mentioned there... [10:03:10] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058402 (10Mvolz) My first thought was that some third party service we use a lot might have affected things, like the doi.org resolved, crossref.org, archive,org or pmc... [10:04:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:09:10] (03PS1) 10Urbanecm: [Growth] frwiki: Deploy automated mentor list cleaner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305867 (https://phabricator.wikimedia.org/T427386) [10:09:13] (03CR) 10Filippo Giunchedi: [C:03+2] "Hosts not in service, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/1305862 (https://phabricator.wikimedia.org/T429563) (owner: 10Filippo Giunchedi) [10:09:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:12:53] !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database isvwiki (T429938) [10:12:57] T429938: [wikireplicas] Create views for new wiki isvwiki - https://phabricator.wikimedia.org/T429938 [10:13:47] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058435 (10SLyngshede-WMF) I notice that one bot starts at around the time latency start going up: https://w.wiki/Rn5a [10:14:49] (03CR) 10Btullis: dse-k8s-services: Enable ingress on WDQS namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [10:15:08] 06SRE, 10Wikimedia-maintenance-script-run, 07Serbian-Sites: Special pages didn't update since 10th of May on srwiki and shwiki - https://phabricator.wikimedia.org/T430269#12058441 (10Peachey88) [10:15:09] (03CR) 10Btullis: [C:03+1] dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [10:17:50] (03CR) 10Btullis: [C:03+2] namespaces: pageview-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) (owner: 10JavierMonton) [10:18:18] (03CR) 10Btullis: [C:03+2] k8s namespace: pageview-trending [puppet] - 10https://gerrit.wikimedia.org/r/1305630 (https://phabricator.wikimedia.org/T430136) (owner: 10JavierMonton) [10:26:09] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058596 (10Mvolz) >>! In T430279#12058435, @SLyngshede-WMF wrote: > I notice that one bot starts at around the time latency start going up: https://w.wiki/Rn5a Hmm... c... [10:26:16] (03Merged) 10jenkins-bot: namespaces: pageview-trending [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305632 (https://phabricator.wikimedia.org/T430136) (owner: 10JavierMonton) [10:27:28] (03PS1) 10Ayounsi: Fix homer crash if subinterface doesn't have an IP [homer/public] - 10https://gerrit.wikimedia.org/r/1305871 [10:28:33] (03CR) 10Brouberol: [C:03+1] growthbook: Updated chart to add API_RATE_LIMIT_MAX env var [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305785 (https://phabricator.wikimedia.org/T429420) (owner: 10Santiago Faci) [10:31:18] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058600 (10Jelto) Yes I can't find any suspicious resource metrics in citoid and zotero. So I see no reason in scaling the deployment in Kubernetes. Also no elevated tra... [10:34:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:et-0/0/0 (Transport: Arelion (IC-398709) {#20260602}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [10:38:20] (03CR) 10Cathal Mooney: [C:03+1] Fix homer crash if subinterface doesn't have an IP [homer/public] - 10https://gerrit.wikimedia.org/r/1305871 (owner: 10Ayounsi) [10:38:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [10:39:41] FIRING: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:40:07] !log filippo@cumin1003 START - Cookbook sre.dns.netbox [10:44:23] !log filippo@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new IPs for cloudvirt1078-80 - filippo@cumin1003" [10:44:31] !log filippo@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new IPs for cloudvirt1078-80 - filippo@cumin1003" [10:44:31] !log filippo@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:44:41] RESOLVED: JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:45:04] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudvirt1078.eqiad.wmnet with OS trixie [10:45:26] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudvirt1079.eqiad.wmnet with OS trixie [10:45:46] !log filippo@cumin1003 START - Cookbook sre.hosts.reimage for host cloudvirt1080.eqiad.wmnet with OS trixie [10:49:03] (03CR) 10Ayounsi: [C:03+2] Fix homer crash if subinterface doesn't have an IP [homer/public] - 10https://gerrit.wikimedia.org/r/1305871 (owner: 10Ayounsi) [10:51:40] (03Merged) 10jenkins-bot: Fix homer crash if subinterface doesn't have an IP [homer/public] - 10https://gerrit.wikimedia.org/r/1305871 (owner: 10Ayounsi) [10:52:59] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:55:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:56:46] !log filippo@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1078.eqiad.wmnet with reason: host reimage [10:57:02] !log filippo@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1079.eqiad.wmnet with reason: host reimage [10:57:13] !log filippo@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1080.eqiad.wmnet with reason: host reimage [10:59:00] (03CR) 10Btullis: [C:03+2] Add commonswiki globalimagelinks monthly sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295045 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [10:59:22] (03CR) 10Btullis: [C:03+2] Add filerevision to the mediawiki not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1295047 (https://phabricator.wikimedia.org/T427532) (owner: 10Dr0ptp4kt) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260626T0700) [11:00:04] jelto, arnoldokoth, mutante, and arnaudb: #bothumor I � Unicode. All rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20260626T1100). [11:02:04] 06SRE, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066#12058699 (10Natacha_LSP) a:05Volans→03Volans42 Hello, would it be possible to add communication@sans-pages.org to the admins ? We have grown a little bit since we started and... [11:03:40] 06SRE, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066#12058701 (10Natacha_LSP) a:05Volans42→03Volans Hello, would it be possible to add communication@sans-pages.org to the admins ? We have grown a little bit since we started and... [11:04:00] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1078.eqiad.wmnet with reason: host reimage [11:04:20] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058705 (10Joe) The increase in latency seems related to an increase in 404 results. Has anyone investigated that? It's very possible that the issue is related to someon... [11:06:13] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058709 (10Joe) So: I don't think this is an incident with user-facing consequences. It lines up with an increase in requests and corresponding 404s. Unless we have evid... [11:09:09] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1079.eqiad.wmnet with reason: host reimage [11:13:52] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1080.eqiad.wmnet with reason: host reimage [11:18:40] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1078.eqiad.wmnet with OS trixie [11:23:55] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:24:24] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1079.eqiad.wmnet with OS trixie [11:25:57] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:26:02] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:30:15] 06SRE, 10Wikimedia-Mailing-lists: Request for creation: les sans pagEs Mailing List - https://phabricator.wikimedia.org/T245066#12058764 (10Aklapper) @Natacha_LSP: Hi, please contact the current mailing list owners. Existing list owners can edit/add list owners via https://lists.wikimedia.org/postorius/lis... [11:30:52] !log filippo@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1080.eqiad.wmnet with OS trixie [11:32:04] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: sync [11:32:21] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: sync [11:32:43] <_joe_> !log rolling restart of zotero in eqiad T430279 [11:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:48] T430279: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279 [11:34:20] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058781 (10Joe) The problem is also limited to eqiad, showing that the problem is probably not systemic - further excluding url-downloader as the culprit. [11:35:56] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058782 (10Joe) Also: eqiad is showing a 45% failure of citations, which might be due to it getting more traffic, as it's the current active datacenter. [11:42:33] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/citoid: sync [11:42:45] <_joe_> !log rolling restart of citoid in eqiad T430279 [11:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:51] T430279: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279 [11:42:51] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:43:00] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/citoid: sync [11:43:28] (03CR) 10Bartosz Wójtowicz: [C:03+1] "Thank you for updating those!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305679 (https://phabricator.wikimedia.org/T421237) (owner: 10AikoChou) [11:44:51] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [11:51:17] (03CR) 10Trueg: [C:03+2] dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [11:53:18] 06SRE, 10SRE-Access-Requests: Requesting access to "analytics-privatedata" for mona_thierse - https://phabricator.wikimedia.org/T430304 (10Monrac5) 03NEW [11:53:38] (03Merged) 10jenkins-bot: dse-k8s-services: Enable ingress on WDQS namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1302784 (https://phabricator.wikimedia.org/T429313) (owner: 10Trueg) [11:56:19] (03PS2) 10VadymTS1: hrwiki: Add to wgCiteResponsiveReferences [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305811 (https://phabricator.wikimedia.org/T430182) [11:59:58] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058880 (10Joe) I am pretty sure what's going on is that some upstream service is slowing up responses to us intentionally. I've roll-restarted both citoid and zotero to... [12:01:36] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:02:12] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:04:51] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/echoserver: apply [12:05:02] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/echoserver: apply [12:08:19] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [12:10:17] !log oblivian@cumin1003 START - Cookbook sre.discovery.service-route depool zotero in eqiad: Testing theory about upstreams [12:12:03] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12058922 (10Joe) Looking at responses from zotero, I see a lot of 501 not implemented responses. Not sure if this has any relevance. [12:15:21] !log oblivian@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool zotero in eqiad: Testing theory about upstreams [12:17:24] !log oblivian@cumin1003 START - Cookbook sre.discovery.service-route check citoid: maintenance [12:17:24] !log oblivian@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check citoid: maintenance [12:18:07] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [12:19:05] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [12:19:32] !log oblivian@cumin1003 START - Cookbook sre.discovery.service-route pool zotero in eqiad: maintenance [12:19:33] !log oblivian@cumin1003 START - Cookbook sre.dns.wipe-cache zotero.discovery.wmnet on all recursors [12:19:37] !log oblivian@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) zotero.discovery.wmnet on all recursors [12:24:36] !log oblivian@cumin1003 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool zotero in eqiad: maintenance [12:30:22] !log marostegui@cumin1003 conftool action : set/weight=1; selector: name=clouddb1026.eqiad.wmnet [12:30:27] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1026.eqiad.wmnet,service=s1 [12:32:28] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [12:32:44] !log marostegui@cumin1003 conftool action : set/weight=10; selector: name=clouddb1026.eqiad.wmnet [12:32:57] !log marostegui@cumin1003 conftool action : set/pooled=yes; selector: name=clouddb1026.eqiad.wmnet,service=s1 [12:33:15] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [12:37:37] (03PS1) 10Gkyziridis: ml-services: Deploy Qwen3.6 latest image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305886 (https://phabricator.wikimedia.org/T425680) [12:39:09] (03CR) 10Aude: [C:04-1] Phase 3 Legal contact link deployments. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305773 (https://phabricator.wikimedia.org/T430227) (owner: 10Jdrewniak) [12:39:53] (03CR) 10Aude: [C:04-1] Phase 3 Legal contact link deployments. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305773 (https://phabricator.wikimedia.org/T430227) (owner: 10Jdrewniak) [12:42:16] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:44:03] (03CR) 10AikoChou: [C:03+2] ml-services: bump event-emitting isvc image tags in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305679 (https://phabricator.wikimedia.org/T421237) (owner: 10AikoChou) [12:46:24] (03Merged) 10jenkins-bot: ml-services: bump event-emitting isvc image tags in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305679 (https://phabricator.wikimedia.org/T421237) (owner: 10AikoChou) [12:54:51] (03PS2) 10Jdrewniak: Phase 3 Legal contact link deployments. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305773 (https://phabricator.wikimedia.org/T430227) [12:56:55] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply [12:57:09] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply [12:58:52] (03CR) 10Kevin Bazira: [C:03+1] ml-services: Deploy Qwen3.6 latest image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305886 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [12:59:40] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12058995 (10Jhancock.wm) i reseated both risers. most of the time that issue is triggered by a frimware issue. but it can also be a "floating riser" issue. some parts just eventually come lose b... [13:01:12] (03CR) 10Gkyziridis: [C:03+2] ml-services: Deploy Qwen3.6 latest image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305886 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [13:03:24] (03Merged) 10jenkins-bot: ml-services: Deploy Qwen3.6 latest image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305886 (https://phabricator.wikimedia.org/T425680) (owner: 10Gkyziridis) [13:05:19] !log Restarting CI Jenkins [13:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:29] (03PS1) 10Gkyziridis: ml-services: Deploy latest version of revertrisk-wikidata. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305889 (https://phabricator.wikimedia.org/T429675) [13:06:45] 06SRE, 10Maps, 06Traffic, 07affects-Kiwix-and-openZIM: Wikipedia wikis have broken maps URLs in infobox: "Bad GeoJSON - unknown \"type\" property \"ExternalData\"" - https://phabricator.wikimedia.org/T424046#12059010 (10Benoit74) @Aklapper we are a bit surprised this is still not fully triaged. This has... [13:09:45] !log gkyziridis@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [13:13:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304247 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [13:13:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305804 (https://phabricator.wikimedia.org/T429373) (owner: 10Krinkle) [13:13:13] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply [13:14:05] (03Merged) 10jenkins-bot: Undeploy the ShortUrl extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1304247 (https://phabricator.wikimedia.org/T107188) (owner: 10Krinkle) [13:14:09] (03Merged) 10jenkins-bot: extension-list: Remove WikimediaApiPortalOAuth ext and WikimediaApiPortal skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305804 (https://phabricator.wikimedia.org/T429373) (owner: 10Krinkle) [13:14:43] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1304247|Undeploy the ShortUrl extension (T107188)]], [[gerrit:1305804|extension-list: Remove WikimediaApiPortalOAuth ext and WikimediaApiPortal skin (T429373 T429374 T418494)]] [13:14:53] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [13:14:54] T429373: Archive the WikimediaApiPortal skin - https://phabricator.wikimedia.org/T429373 [13:14:54] T429374: Archive the WikimediaApiPortalOAuth extension - https://phabricator.wikimedia.org/T429374 [13:14:54] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [13:15:27] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: apply [13:16:35] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on wikikube-worker2159 - https://phabricator.wikimedia.org/T430240#12059055 (10Jhancock.wm) i checked the lifecycle logs on this one and it seems odd. 2026-06-26 01:00:15 PDR8 Disk 1 in Backplane 1 of Storage Controller in SL 3 is inserted. 2026-06-26 00:5... [13:17:05] 06SRE, 10Maps, 06MediaWiki-Engineering, 06Traffic, 07affects-Kiwix-and-openZIM: Wikipedia wikis have broken maps URLs in infobox: "Bad GeoJSON - unknown \"type\" property \"ExternalData\"" - https://phabricator.wikimedia.org/T424046#12059056 (10Krinkle) [13:17:13] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: db2234 Comm Error: Riser 2. - https://phabricator.wikimedia.org/T430116#12059057 (10Marostegui) thank you, so far so good :) thanks for the help [13:19:01] 06SRE, 06Content-Transform-Team, 10Maps, 06Traffic, 07affects-Kiwix-and-openZIM: Wikipedia wikis have broken maps URLs in infobox: "Bad GeoJSON - unknown \"type\" property \"ExternalData\"" - https://phabricator.wikimedia.org/T424046#12059061 (10Krinkle) [13:22:03] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [13:22:07] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [13:22:13] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:22:17] !log atsuko@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/webrequest-page-view-next: apply [13:29:14] (03PS1) 10AikoChou: ml-services: add revertrisk-wikidata to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305890 (https://phabricator.wikimedia.org/T420883) [13:32:41] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1304247|Undeploy the ShortUrl extension (T107188)]], [[gerrit:1305804|extension-list: Remove WikimediaApiPortalOAuth ext and WikimediaApiPortal skin (T429373 T429374 T418494)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:32:52] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [13:32:53] T429373: Archive the WikimediaApiPortal skin - https://phabricator.wikimedia.org/T429373 [13:32:53] T429374: Archive the WikimediaApiPortalOAuth extension - https://phabricator.wikimedia.org/T429374 [13:32:53] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [13:36:03] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#12059099 (10BTullis) We discovered a little problem in that dse-k8s-worker1023 is in the analytics vlan, rather than the private vlan. I will reimage it now and change the vlan as per:... [13:36:26] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts dse-k8s-worker1023.eqiad.wmnet [13:43:29] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:45:18] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:45:53] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:46:17] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:46:51] !log krinkle@deploy1003 krinkle: Continuing with deployment [13:46:53] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:48:15] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dse-k8s-worker1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [13:48:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dse-k8s-worker1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [13:48:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:48:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dse-k8s-worker1023.eqiad.wmnet [13:48:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#12059179 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1003 for hosts: `dse-k8s-worker1023.eqiad.wmnet` - dse-k8s-worker1023.eqiad.wmnet (**PASS**)... [13:49:51] (03PS9) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [13:55:01] (03PS10) 10Tiziano Fogli: mirrormaker: move alert defs on profile::kafka::mirror [puppet] - 10https://gerrit.wikimedia.org/r/1192539 (https://phabricator.wikimedia.org/T370153) [13:56:34] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:58:34] (03PS1) 10TChin: [eventgate-*] Bump to v1.31.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305900 (https://phabricator.wikimedia.org/T415590) [13:59:18] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1304247|Undeploy the ShortUrl extension (T107188)]], [[gerrit:1305804|extension-list: Remove WikimediaApiPortalOAuth ext and WikimediaApiPortal skin (T429373 T429374 T418494)]] (duration: 44m 35s) [13:59:28] T107188: Sunset ShortUrl extension in favour of UrlShortener extension - https://phabricator.wikimedia.org/T107188 [13:59:28] T429373: Archive the WikimediaApiPortal skin - https://phabricator.wikimedia.org/T429373 [13:59:29] T429374: Archive the WikimediaApiPortalOAuth extension - https://phabricator.wikimedia.org/T429374 [13:59:29] T418494: Delete the API Portal wiki - https://phabricator.wikimedia.org/T418494 [14:02:02] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved dse-k8s-worker1023 vlan - btullis@cumin1003" [14:02:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved dse-k8s-worker1023 vlan - btullis@cumin1003" [14:02:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:04:08] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host dse-k8s-worker1023 [14:04:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dse-k8s-worker1023 [14:06:00] (03PS10) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server change [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [14:07:05] (03CR) 10FNegri: "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [14:09:47] (03CR) 10CWilliams: [C:03+2] Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) (owner: 10CWilliams) [14:09:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:10:30] (03PS5) 10CWilliams: Cookbook sre.mysql.upgrade should not accept multiple hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1302745 (https://phabricator.wikimedia.org/T429230) [14:14:24] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12059375 (10Jelto) p:05Unbreak!→03High It looks like the latency is back to baseline since 13:00 UTC. I'll remove the UBN for now (feel free to re-adjust if latency g... [14:16:36] (03PS1) 10Btullis: matomo: Enable the CustomDimensions plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305903 (https://phabricator.wikimedia.org/T430307) [14:17:57] (03CR) 10TChin: [C:03+2] [eventstreams] Bump to v0.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276779 (https://phabricator.wikimedia.org/T420257) (owner: 10TChin) [14:20:17] (03Merged) 10jenkins-bot: [eventstreams] Bump to v0.19.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1276779 (https://phabricator.wikimedia.org/T420257) (owner: 10TChin) [14:25:12] !log tchin@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/eventstreams-internal: apply [14:25:44] !log tchin@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/eventstreams-internal: apply [14:28:31] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12059463 (10WindEwriX) At least for some cases in ruwiki timeline is broken by small cyrillic letter х, but if it is replaced by mnemonic х all starts working again. For example... [14:29:31] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [14:29:59] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [14:32:40] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [14:32:41] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [14:32:51] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [14:33:42] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [14:34:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:et-0/0/0 (Transport: Arelion (IC-398709) {#20260602}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:37:37] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host dse-k8s-worker1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:38:55] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Hurricane Electric (2001:504:61::1b1b:0:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:40:50] (03PS2) 10Btullis: matomo: Enable the CustomDimensions plugin [puppet] - 10https://gerrit.wikimedia.org/r/1305903 (https://phabricator.wikimedia.org/T430307) [14:41:35] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305903 (https://phabricator.wikimedia.org/T430307) (owner: 10Btullis) [14:41:44] (03CR) 10Bartosz Wójtowicz: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305890 (https://phabricator.wikimedia.org/T420883) (owner: 10AikoChou) [14:42:59] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dse-k8s-worker1023.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [14:46:05] (03PS11) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server change [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [14:46:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [14:46:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#12059545 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [14:50:39] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [14:51:28] (03CR) 10Ahmon Dancy: "re-revised. The change in puppet-common.sh is must simpler now" [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) (owner: 10Ahmon Dancy) [14:51:29] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [14:57:02] (03PS2) 10Snwachukwu: Sqoop Mediawiki: Block monthly sqoop jobs on ingestion_wikis success flag. [puppet] - 10https://gerrit.wikimedia.org/r/1303460 (https://phabricator.wikimedia.org/T425385) [15:02:06] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [15:02:20] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#12059654 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm executed with errors: - dse-k8... [15:02:39] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [15:02:52] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#12059656 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [15:03:50] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [15:04:46] !log atsuko@deploy1003 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:14:47] (03PS1) 10Gkyziridis: ml-services: Switch to float16 and reduce context length for Qwen3.6-27B deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305919 (https://phabricator.wikimedia.org/T425680) [15:19:53] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1023.eqiad.wmnet with reason: host reimage [15:23:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1023.eqiad.wmnet with reason: host reimage [15:27:05] btullis@cumin1003 reimage (PID 3577118) is awaiting input [15:35:50] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [15:35:58] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#12059774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm executed with errors: - dse-k8... [15:43:24] (03PS1) 10Bernard Wang: Remove wgMinervaEnableSiteNotice config flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305921 [15:43:44] (03PS2) 10Bernard Wang: Remove wgMinervaEnableSiteNotice config flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1305921 (https://phabricator.wikimedia.org/T417638) [15:45:30] (03PS1) 10Btullis: Temporarily set dse-k8s-worker1023 back to insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1305922 (https://phabricator.wikimedia.org/T414216) [15:46:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frqueue1003.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429520#12059831 (10Dwisehaupt) @Jclark-ctr Sorry about that, I have moved to to decommissioning. You are clear to do the work. [15:46:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10fundraising-tech-ops: decommission frmx1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T429529#12059833 (10Dwisehaupt) @Jclark-ctr Sorry about that, I have moved to to decommissioning. You are clear to do the work. [15:46:59] (03CR) 10Btullis: [C:03+2] Temporarily set dse-k8s-worker1023 back to insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1305922 (https://phabricator.wikimedia.org/T414216) (owner: 10Btullis) [15:47:12] 06SRE, 10hCaptcha, 06Product Safety and Integrity, 06Traffic: memcached errors seen in hCaptcha health checks - https://phabricator.wikimedia.org/T430340 (10kostajh) 03NEW [15:47:34] 06SRE, 10hCaptcha, 06Product Safety and Integrity, 06Traffic: memcached errors seen in hCaptcha health checks - https://phabricator.wikimedia.org/T430340#12059849 (10kostajh) Possibly related: {T420223} [15:52:19] (03PS1) 10BCornwall: haproxy: Enable jwt for upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) [15:56:32] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [15:56:45] 10ops-eqiad, 06SRE, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#12059876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1003 for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [15:58:31] (03PS2) 10BCornwall: haproxy: Enable jwt for upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) [15:59:54] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) (owner: 10BCornwall) [16:02:03] (03PS3) 10BCornwall: cache::haproxy: Enable jwt for upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1305923 (https://phabricator.wikimedia.org/T400238) [16:09:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:12:54] (03PS2) 10JHathaway: weak etag comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 [16:13:10] (03PS2) 10JHathaway: WIP: CI test [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304167 [16:13:30] (03PS2) 10JHathaway: durable: fix test when run in a tmux [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304619 [16:13:58] (03PS2) 10JHathaway: Change find_account to find_accounts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) [16:14:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:47] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on dse-k8s-worker1023.eqiad.wmnet with reason: host reimage [16:17:30] (03CR) 10CI reject: [V:04-1] Change find_account to find_accounts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) (owner: 10JHathaway) [16:18:21] (03CR) 10JHathaway: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303529 (owner: 10JHathaway) [16:18:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dse-k8s-worker1023.eqiad.wmnet with reason: host reimage [16:19:03] (03PS1) 10Cathal Mooney: Add includes for new link ranges for esams transport [dns] - 10https://gerrit.wikimedia.org/r/1305930 (https://phabricator.wikimedia.org/T412537) [16:19:21] (03Abandoned) 10JHathaway: WIP: CI test [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304167 (owner: 10JHathaway) [16:19:23] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [16:19:58] (03CR) 10CI reject: [V:04-1] Add includes for new link ranges for esams transport [dns] - 10https://gerrit.wikimedia.org/r/1305930 (https://phabricator.wikimedia.org/T412537) (owner: 10Cathal Mooney) [16:20:17] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1304139 (https://phabricator.wikimedia.org/T422801) (owner: 10Andrew Bogott) [16:20:45] (03Abandoned) 10JHathaway: notify_logger: fix tests [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304618 (owner: 10JHathaway) [16:21:00] (03CR) 10JHathaway: "ready for review" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304619 (owner: 10JHathaway) [16:21:29] (03Abandoned) 10JHathaway: WIP: ini config rev 2 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304908 (owner: 10JHathaway) [16:23:24] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for esams transport IPs - cmooney@cumin1003" [16:24:29] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for esams transport IPs - cmooney@cumin1003" [16:24:29] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:25:20] (03PS2) 10Cathal Mooney: Add includes for new link ranges for esams transport [dns] - 10https://gerrit.wikimedia.org/r/1305930 (https://phabricator.wikimedia.org/T412537) [16:26:17] (03CR) 10CI reject: [V:04-1] Add includes for new link ranges for esams transport [dns] - 10https://gerrit.wikimedia.org/r/1305930 (https://phabricator.wikimedia.org/T412537) (owner: 10Cathal Mooney) [16:26:25] (03PS3) 10JHathaway: Change find_account to find_accounts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) [16:29:41] (03PS3) 10Cathal Mooney: Add includes for new link ranges for esams transport [dns] - 10https://gerrit.wikimedia.org/r/1305930 (https://phabricator.wikimedia.org/T412537) [16:30:16] (03CR) 10Dzahn: [C:03+1] site: remove phab2002 [puppet] - 10https://gerrit.wikimedia.org/r/1305661 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [16:30:43] (03CR) 10Dzahn: [C:03+1] "well, maybe do one quick check if any users have stuff in their home directories" [puppet] - 10https://gerrit.wikimedia.org/r/1305661 (https://phabricator.wikimedia.org/T423727) (owner: 10AOkoth) [16:31:05] (03CR) 10Cathal Mooney: [C:03+2] Add includes for new link ranges for esams transport [dns] - 10https://gerrit.wikimedia.org/r/1305930 (https://phabricator.wikimedia.org/T412537) (owner: 10Cathal Mooney) [16:31:31] !log cmooney@dns2005 START - running authdns-update [16:33:10] !log cmooney@dns2005 END - running authdns-update [16:38:13] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [16:41:17] btullis@cumin1003 reimage (PID 3585117) is awaiting input [16:41:31] !log dwisehaupt@cumin1003 START - Cookbook sre.dns.netbox [16:43:59] (03PS4) 10JHathaway: Add find_accounts [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) [16:44:34] !log dwisehaupt@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:50] (03Abandoned) 10JHathaway: WIP: fix tests? [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304180 (owner: 10JHathaway) [16:46:27] (03Abandoned) 10JHathaway: load_ini_config: fix typing of config_file [software/pywmflib] - 10https://gerrit.wikimedia.org/r/1304617 (owner: 10JHathaway) [16:48:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [16:48:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm [16:48:46] 10ops-eqiad, 06SRE, 06DC-Ops: Q3:rack/setup/install dse-k8s-worker10[20-23] - https://phabricator.wikimedia.org/T414216#12060054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1003 for host dse-k8s-worker1023.eqiad.wmnet with OS bookworm completed: - dse-k8s-worker102... [16:49:04] (03PS1) 10Btullis: Revert "Temporarily set dse-k8s-worker1023 back to insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1305936 [16:50:25] (03CR) 10JHathaway: "Ready for review. Added the function separately, to ease migration." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1303559 (https://phabricator.wikimedia.org/T426180) (owner: 10JHathaway) [16:55:26] (03PS1) 10Andrew Bogott: cloud-vps backups: exclude a few more categories of servers [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) [16:55:29] (03PS1) 10Andrew Bogott: wmcs_backup_instances.yaml.erb: sort project list [puppet] - 10https://gerrit.wikimedia.org/r/1305938 (https://phabricator.wikimedia.org/T430018) [16:55:31] (03PS1) 10Andrew Bogott: cloud-vps backups: move more projects to cloudbackup1003 [puppet] - 10https://gerrit.wikimedia.org/r/1305939 (https://phabricator.wikimedia.org/T430018) [16:58:56] (03PS2) 10Andrew Bogott: cloud-vps backups: exclude a few more categories of servers [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) [16:58:56] (03PS2) 10Andrew Bogott: wmcs_backup_instances.yaml.erb: sort project list [puppet] - 10https://gerrit.wikimedia.org/r/1305938 (https://phabricator.wikimedia.org/T430018) [16:58:56] (03PS2) 10Andrew Bogott: cloud-vps backups: move more projects to cloudbackup1003 [puppet] - 10https://gerrit.wikimedia.org/r/1305939 (https://phabricator.wikimedia.org/T430018) [17:00:18] (03PS3) 10Dzahn: aphlict: create system user with systemd:sysuser [puppet] - 10https://gerrit.wikimedia.org/r/1080823 (https://phabricator.wikimedia.org/T377374) [17:01:58] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [17:05:32] (03CR) 10Btullis: [C:03+2] Revert "Temporarily set dse-k8s-worker1023 back to insetup mode" [puppet] - 10https://gerrit.wikimedia.org/r/1305936 (owner: 10Btullis) [17:07:07] (03PS1) 10Andrew Bogott: cloud-vps backups: include a few more deployment-prep servers [puppet] - 10https://gerrit.wikimedia.org/r/1305943 (https://phabricator.wikimedia.org/T430018) [17:12:25] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [17:12:28] (03PS3) 10Andrew Bogott: cloud-vps backups: exclude a few more categories of servers [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) [17:12:28] (03PS3) 10Andrew Bogott: wmcs_backup_instances.yaml.erb: sort project list [puppet] - 10https://gerrit.wikimedia.org/r/1305938 (https://phabricator.wikimedia.org/T430018) [17:12:28] (03PS3) 10Andrew Bogott: cloud-vps backups: move more projects to cloudbackup1003 [puppet] - 10https://gerrit.wikimedia.org/r/1305939 (https://phabricator.wikimedia.org/T430018) [17:16:29] (03PS12) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server change [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [17:19:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-esams:et-1/0/1 (Transport: Hurricane Electric (dc4841.ams5) {#changeme_esams_he_cct}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:20:15] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:20:17] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [17:22:33] (03CR) 10Andrew Bogott: "reviewers take note: cloud-vps does NOT offer a backup-and-restore service to users. These backups exist as a hedge against a catastrophic" [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [17:24:33] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add esams router ips - cmooney@cumin1003" [17:24:37] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add esams router ips - cmooney@cumin1003" [17:24:37] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:25:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:26:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:26:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:27:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host restbase2039.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [17:27:08] (03PS13) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server change [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413) [17:35:30] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1023.eqiad.wmnet [17:36:41] (03CR) 10Dzahn: "Jelto, Arnaud: do you have opinions on whether we should have or don't need at all backups of gitlab-runners (in devtools and separately i" [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [17:41:16] !log enable "graceful shutdown" community on cr2-eqiad to allow for reset of line card 1/1 T427843 [17:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1023.eqiad.wmnet [17:43:12] (03CR) 10AikoChou: [C:03+2] ml-services: add revertrisk-wikidata to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305890 (https://phabricator.wikimedia.org/T420883) (owner: 10AikoChou) [17:44:41] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [17:45:19] (03Merged) 10jenkins-bot: ml-services: add revertrisk-wikidata to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305890 (https://phabricator.wikimedia.org/T420883) (owner: 10AikoChou) [17:45:25] (03PS4) 10Andrew Bogott: cloud-vps backups: exclude a few more categories of servers [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) [17:45:25] (03PS4) 10Andrew Bogott: wmcs_backup_instances.yaml.erb: sort project list [puppet] - 10https://gerrit.wikimedia.org/r/1305938 (https://phabricator.wikimedia.org/T430018) [17:45:25] (03PS4) 10Andrew Bogott: cloud-vps backups: move more projects to cloudbackup1003 [puppet] - 10https://gerrit.wikimedia.org/r/1305939 (https://phabricator.wikimedia.org/T430018) [17:45:25] (03PS1) 10Andrew Bogott: cloud-vps backups: exclude zuul3 projects [puppet] - 10https://gerrit.wikimedia.org/r/1305946 (https://phabricator.wikimedia.org/T430018) [17:48:48] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [17:49:24] (03PS14) 10Ahmon Dancy: modules/profile/files/puppet/bin: cleanup puppet SSL on CA server change [puppet] - 10https://gerrit.wikimedia.org/r/1302978 (https://phabricator.wikimedia.org/T429413)