[00:03:29] (03CR) 10Scott French: [C:03+1] "Thanks, Reuven! Agreed that this is, alas, the best interim solution solution." [puppet] - 10https://gerrit.wikimedia.org/r/1121455 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [00:05:14] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:06:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [00:09:50] PROBLEM - Disk space on Hadoop worker on an-worker1164 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 14 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [00:11:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [00:12:50] PROBLEM - Disk space on Hadoop worker on an-worker1164 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 14 GB (0% inode=99%): /var/lib/hadoop/data/h 20 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [00:14:15] (03CR) 10BCornwall: provision: Adjust thermal profile for F4 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall) [00:23:39] 10ops-magru: Solicit Dell to investigate magru cp temperatures - https://phabricator.wikimedia.org/T386959#10569770 (10BCornwall) @wiki_willy It's worth noting that esams cp nodes have this same issue and that it's not specific to magru. Do we have any other hosts using the same r450 hardware configuration that... [00:27:08] (03CR) 10Cwhite: [C:03+1] hiera: restore thanos retention settings [puppet] - 10https://gerrit.wikimedia.org/r/1121324 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [00:39:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121467 [00:39:46] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121467 (owner: 10TrainBranchBot) [00:45:22] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:47:22] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:50:14] PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:50:28] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [00:50:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1121467 (owner: 10TrainBranchBot) [00:50:54] PROBLEM - Host mr1-drmrs.oob is DOWN: PING CRITICAL - Packet loss = 100% [00:52:14] RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:54:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3496 MB (3% inode=98%): /tmp 3496 MB (3% inode=98%): /var/tmp 3496 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [00:57:14] PROBLEM - Router interfaces on mr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.130, interfaces up: 34, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:00:14] RECOVERY - Router interfaces on mr1-drmrs is OK: OK: host 185.15.58.130, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:01:08] RECOVERY - Host mr1-drmrs.oob is UP: PING OK - Packet loss = 0%, RTA = 86.40 ms [01:03:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:05:54] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.43 ms [01:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121469 [01:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121469 (owner: 10TrainBranchBot) [01:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10569849 (10phaultfinder) [01:31:39] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1121469 (owner: 10TrainBranchBot) [01:33:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:41:13] (03PS1) 10Albertoleoncio: brwikimedia: update icon, logo and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121473 [01:53:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:06:15] (03PS2) 10RLazarus: deployment_server: Support multiple Kubernetes configs in mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1121443 (https://phabricator.wikimedia.org/T378429) [02:06:16] (03PS2) 10RLazarus: deployment_server: Read mwscript-k8s MW image from values, not kube API [puppet] - 10https://gerrit.wikimedia.org/r/1121455 (https://phabricator.wikimedia.org/T378429) [02:09:19] (03CR) 10RLazarus: [C:03+2] deployment_server: Support multiple Kubernetes configs in mwscript-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1121443 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [02:09:31] (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1121455 (https://phabricator.wikimedia.org/T378429) (owner: 10RLazarus) [02:23:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:14] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:14] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:11:20] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:13:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:16:14] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:23:08] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:23:14] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:23:42] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:43:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:58:07] (03PS1) 10Papaul: Remove obsolete ospf interface from cr2-esams [homer/public] - 10https://gerrit.wikimedia.org/r/1121484 (https://phabricator.wikimedia.org/T386766) [05:10:14] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:18:58] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:03:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:33:21] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:40:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1026.eqiad.wmnet with OS bookworm [06:40:41] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10570166 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1026.eqiad.wmnet with OS bookworm [06:56:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1026.eqiad.wmnet with reason: host reimage [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250221T0700) [07:02:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1026.eqiad.wmnet with reason: host reimage [07:06:26] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: upgrade and rebuild tables [07:21:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1026.eqiad.wmnet with OS bookworm [07:21:11] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10570183 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1026.eqiad.wmnet with OS bookworm completed: - ganeti102... [07:22:39] !log rebuilding tables for db2200 T385550 [07:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:43] T385550: Upgrade and rebuild s7 - https://phabricator.wikimedia.org/T385550 [07:32:41] (03CR) 10Filippo Giunchedi: [C:03+2] hiera: restore thanos retention settings [puppet] - 10https://gerrit.wikimedia.org/r/1121324 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [07:37:26] (03Abandoned) 10DCausse: cirrus: run the sanitizer only for wikitech [puppet] - 10https://gerrit.wikimedia.org/r/1052136 (owner: 10DCausse) [07:43:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1026.eqiad.wmnet [07:51:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1026.eqiad.wmnet [07:53:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1026.eqiad.wmnet to cluster eqiad and group A [07:54:02] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1026.eqiad.wmnet to cluster eqiad and group A [07:58:10] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1120587 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [07:58:28] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1120586 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250221T0800) [08:02:21] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [08:07:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in eqiad to Bookworm - https://phabricator.wikimedia.org/T382507#10570203 (10MoritzMuehlenhoff) [08:25:36] FIRING: [2x] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [08:25:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [08:26:25] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [08:27:59] here [08:28:46] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics: apply [08:29:43] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics: apply [08:32:37] (03PS1) 10Brouberol: airflow-analytics: enable kerberos to allow airflow-research to hit the API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121568 (https://phabricator.wikimedia.org/T386933) [08:32:46] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:33:08] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:34:31] 10SRE-swift-storage, 10Observability-Metrics: Capacity planning/estimation for Thanos - https://phabricator.wikimedia.org/T357747#10570212 (10fgiunchedi) 05Openβ†’03Resolved a:03fgiunchedi Resolving since we're on the retention set in the description [08:42:00] (03CR) 10Filippo Giunchedi: [C:03+1] "I'll probably followup with a CI test to catch this in the future" [alerts] - 10https://gerrit.wikimedia.org/r/1121415 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene) [08:43:52] (03CR) 10Stevemunene: [C:03+2] "Thanks Filippo!" [alerts] - 10https://gerrit.wikimedia.org/r/1121415 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene) [08:45:28] (03Merged) 10jenkins-bot: Fix team name typo for hadoop worker [alerts] - 10https://gerrit.wikimedia.org/r/1121415 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene) [08:51:20] (03PS1) 10Filippo Giunchedi: test: add k8s-mlstaging to prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/1121571 [08:52:56] (03CR) 10Filippo Giunchedi: [C:03+2] test: add k8s-mlstaging to prometheus instances [alerts] - 10https://gerrit.wikimedia.org/r/1121571 (owner: 10Filippo Giunchedi) [08:53:58] (03CR) 10Filippo Giunchedi: [C:03+1] "No problem! I misread the change and thought team label was a problem in the alert file not the tests, which is fine/harmless so I'll leav" [alerts] - 10https://gerrit.wikimedia.org/r/1121415 (https://phabricator.wikimedia.org/T386900) (owner: 10Stevemunene) [09:10:36] RESOLVED: [2x] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from lw_inference_reference_need_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=api-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [09:20:11] (03PS4) 10Vgutierrez: sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 [09:21:26] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [09:27:39] (03PS48) 10Federico Ceratto: sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [09:33:42] (03CR) 10Brouberol: [C:03+2] "Let's deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114001 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [09:33:56] (03CR) 10Brouberol: [C:03+2] mediawiki: Add support for dumps suspended job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104605 (https://phabricator.wikimedia.org/T352650) (owner: 10Giuseppe Lavagetto) [09:34:42] (03CR) 10CI reject: [V:04-1] sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [09:36:14] (03Merged) 10jenkins-bot: mediawiki: Add support for dumps suspended job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104605 (https://phabricator.wikimedia.org/T352650) (owner: 10Giuseppe Lavagetto) [09:36:16] (03Merged) 10jenkins-bot: mediwiki-dumps-legacy: Create helmfile deployment of a suspended job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1114001 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [09:40:24] (03PS1) 10Aklapper: Add withIsDisabled() to PhabricatorPeopleQuery [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121585 [09:41:44] (03CR) 10Aklapper: [V:03+2 C:03+2] Add withIsDisabled() to PhabricatorPeopleQuery [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121585 (owner: 10Aklapper) [09:50:40] (03PS1) 10Brouberol: mediawiki: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121587 (https://phabricator.wikimedia.org/T352650) [09:52:24] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:53:22] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121587 (https://phabricator.wikimedia.org/T352650) (owner: 10Brouberol) [09:53:24] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:56:26] (03CR) 10Brouberol: [C:03+2] mediawiki: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121587 (https://phabricator.wikimedia.org/T352650) (owner: 10Brouberol) [10:01:38] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [10:01:45] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [10:01:50] (03PS5) 10Vgutierrez: sre: Provide LibericaDiffFPCheck alert [alerts] - 10https://gerrit.wikimedia.org/r/1121409 [10:03:27] (03PS1) 10Filippo Giunchedi: pontoon: stack to test prometheus instance sharding [puppet] - 10https://gerrit.wikimedia.org/r/1121589 [10:05:40] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: stack to test prometheus instance sharding [puppet] - 10https://gerrit.wikimedia.org/r/1121589 (owner: 10Filippo Giunchedi) [10:05:57] (03CR) 10Vgutierrez: [C:03+2] "Done" [alerts] - 10https://gerrit.wikimedia.org/r/1121409 (owner: 10Vgutierrez) [10:06:29] (03PS1) 10Cathal Mooney: gNMIc: Increase prometheus worker threads and cache time [puppet] - 10https://gerrit.wikimedia.org/r/1121590 (https://phabricator.wikimedia.org/T386807) [10:07:31] (03PS1) 10Elukey: kserve-inference: move pod security settings for seccomp to staging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121591 (https://phabricator.wikimedia.org/T369493) [10:07:36] (03CR) 10Federico Ceratto: sre.mysql.sanitize-wiki: sanitize wiki cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [10:10:13] (03PS2) 10Cathal Mooney: gNMIc: Increase prometheus worker threads and cache time [puppet] - 10https://gerrit.wikimedia.org/r/1121590 (https://phabricator.wikimedia.org/T386807) [10:12:25] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Gaps in gNMI network statistics in eqiad - https://phabricator.wikimedia.org/T386807#10570490 (10cmooney) I increased the "cache timeout" for stats received from routers in eqiad, and upped the number of threads for the prometheus output fr... [10:18:11] (03CR) 10Cathal Mooney: provision: Adjust thermal profile for F4 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall) [10:21:56] PROBLEM - BGP status on cr2-drmrs is CRITICAL: BGP CRITICAL - AS5511/IPv6: Idle - Orange, AS5511/IPv4: Connect - Orange https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:23:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:25:02] (03CR) 10Filippo Giunchedi: [C:03+1] gNMIc: Increase prometheus worker threads and cache time [puppet] - 10https://gerrit.wikimedia.org/r/1121590 (https://phabricator.wikimedia.org/T386807) (owner: 10Cathal Mooney) [10:26:56] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121591 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [10:27:45] (03CR) 10Cathal Mooney: [C:03+2] gNMIc: Increase prometheus worker threads and cache time [puppet] - 10https://gerrit.wikimedia.org/r/1121590 (https://phabricator.wikimedia.org/T386807) (owner: 10Cathal Mooney) [10:32:46] (03PS2) 10Elukey: kserve-inference: move pod security settings for seccomp to staging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121591 (https://phabricator.wikimedia.org/T369493) [10:39:00] (03PS1) 10Ilias Sarantopoulos: admin-ng: increase cpu resource quotas for revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121595 [10:39:43] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: upgrade and rebuild tables [10:40:44] (03CR) 10Klausman: [C:03+1] "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121591 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [10:41:45] (03CR) 10Elukey: [C:03+2] kserve-inference: move pod security settings for seccomp to staging only [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121591 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [10:41:49] (03PS49) 10Federico Ceratto: sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [10:42:16] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006 (10cmooney) 03NEW p:05Triageβ†’03High [10:42:53] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10570549 (10cmooney) [10:46:17] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10570552 (10cmooney) [10:48:47] (03CR) 10Klausman: [C:03+1] admin-ng: increase cpu resource quotas for revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121595 (owner: 10Ilias Sarantopoulos) [10:49:49] (03PS3) 10Federico Ceratto: clone.py: Add helper functions for later use [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 [10:50:38] (03CR) 10Ilias Sarantopoulos: [C:03+2] admin-ng: increase cpu resource quotas for revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121595 (owner: 10Ilias Sarantopoulos) [10:54:40] (03Merged) 10jenkins-bot: admin-ng: increase cpu resource quotas for revision-models ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121595 (owner: 10Ilias Sarantopoulos) [10:56:16] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [10:58:11] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [10:58:29] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [10:59:13] (03PS4) 10Federico Ceratto: clone.py: Add helper functions for later use [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 [10:59:45] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [11:00:55] (03PS1) 10Elukey: ml-services: fix articletopic-outlink's settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121599 (https://phabricator.wikimedia.org/T369493) [11:01:22] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: fix articletopic-outlink's settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121599 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [11:01:39] (03PS1) 10Urbanecm: revalidateLinkRecommendations: Initialize $allowedChecksums [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121600 (https://phabricator.wikimedia.org/T387001) [11:02:04] (03CR) 10Klausman: [C:03+1] ml-services: fix articletopic-outlink's settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121599 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [11:02:21] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [11:03:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:03:27] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [11:06:32] (03PS1) 10Aklapper: Expand list of trusted projects for isFriendlyUser() check [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121601 [11:07:15] (03PS2) 10Aklapper: Expand list of trusted projects for isFriendlyUser() check [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121601 [11:07:44] (03CR) 10Aklapper: [V:03+2 C:03+2] Expand list of trusted projects for isFriendlyUser() check [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121601 (owner: 10Aklapper) [11:11:55] (03PS1) 10Elukey: admin_ng: apply new Knative docker images only to ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121602 (https://phabricator.wikimedia.org/T369493) [11:12:03] (03CR) 10Elukey: [C:03+2] ml-services: fix articletopic-outlink's settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121599 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [11:14:14] (03CR) 10Klausman: [C:03+1] admin_ng: apply new Knative docker images only to ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121602 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [11:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3622 MB (3% inode=98%): /tmp 3622 MB (3% inode=98%): /var/tmp 3622 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [11:16:41] (03CR) 10Elukey: [C:03+2] admin_ng: apply new Knative docker images only to ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121602 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [11:17:09] (03CR) 10Federico Ceratto: clone.py: Add helper functions for later use (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (owner: 10Federico Ceratto) [11:20:21] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [11:24:14] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 2517 bytes in 0.118 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [11:31:15] (03PS5) 10Federico Ceratto: clone.py: Add helper functions for later use [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 [11:31:57] (03CR) 10Federico Ceratto: clone.py: Add helper functions for later use (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (owner: 10Federico Ceratto) [11:40:39] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10570708 (10cmooney) [11:54:22] 06SRE, 06Commons, 10MediaWiki-Uploading, 07Wikimedia-production-error: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10570719 (10A_smart_kitten) (added #SRE for triage based on the Varnish 503 errors) [11:55:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10570726 (10MatthewVernon) @Jhancock.wm this server is still not reachable over ssh... [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250221T0800) [12:00:05] jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250221T1200). [12:02:21] FIRING: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [12:03:14] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29348 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:10:13] (03PS1) 10Aklapper: Be more lenient on account disabling again [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121607 [12:10:58] (03CR) 10Aklapper: [V:03+2 C:03+2] Be more lenient on account disabling again [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121607 (owner: 10Aklapper) [12:13:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:19:29] (03PS1) 10Aklapper: Penalize removal of all project tags and/or all parent/subtasks [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121609 [12:19:52] (03PS2) 10Aklapper: Penalize removal of all project tags and/or all parent/subtasks [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121609 (https://phabricator.wikimedia.org/T371831) [12:23:11] (03CR) 10Aklapper: [V:03+2 C:03+2] Penalize removal of all project tags and/or all parent/subtasks [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121609 (https://phabricator.wikimedia.org/T371831) (owner: 10Aklapper) [12:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10570787 (10phaultfinder) [12:27:54] RECOVERY - Disk space on Hadoop worker on an-worker1164 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [12:35:45] (03PS2) 10Hnowlan: trafficserver: use mobileapps directly for hewiki APIs [puppet] - 10https://gerrit.wikimedia.org/r/1117508 (https://phabricator.wikimedia.org/T372746) [12:40:49] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db[2198,2200].codfw.wmnet with reason: Table rebuilding ongoing [12:43:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:50:18] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:50:34] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:53:51] (03PS1) 10Jgiannelos: pcs: Increase TTL for cassandra storage in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121619 [12:56:23] (03CR) 10Huji: [C:04-1] "Please hold off from merging this. There is ongoing discussion on fawiki about its appropriateness." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121449 (owner: 10Ebrahim) [12:59:38] (03PS50) 10Federico Ceratto: sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [13:00:34] (03CR) 10Federico Ceratto: sre.mysql.sanitize-wiki: sanitize wiki cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [13:04:04] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow1002.eqiad.wmnet with reason: keeping gnmic running in debug mode to observe performance change [13:08:51] (03Abandoned) 10Ebrahim: Improve Persian Wikipedia's tagline and wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121449 (owner: 10Ebrahim) [13:11:07] (03PS1) 10DCausse: cirrus-streaming-updater: scale up the consumer-search job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121621 (https://phabricator.wikimedia.org/T386935) [13:14:45] (03CR) 10Ebrahim: "Make sense, let's see what we will come up with locally." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121449 (owner: 10Ebrahim) [13:21:46] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:21:52] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10570948 (10RobH) @cmooney, Can we have the carrier check their end first, since we'll incur hourly billing for our check? [13:22:16] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:22:20] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:28:13] (03CR) 10Hnowlan: [C:03+1] pcs: Increase TTL for cassandra storage in prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121619 (owner: 10Jgiannelos) [13:34:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3525 MB (3% inode=98%): /tmp 3525 MB (3% inode=98%): /var/tmp 3525 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [13:35:31] (03CR) 10Michael Große: [C:03+1] revalidateLinkRecommendations: Initialize $allowedChecksums [extensions/GrowthExperiments] (wmf/1.44.0-wmf.17) - 10https://gerrit.wikimedia.org/r/1121600 (https://phabricator.wikimedia.org/T387001) (owner: 10Urbanecm) [13:35:35] 06SRE, 06Infrastructure-Foundations, 10netops: gNMIc connection not working for cloudsw2-d5-eqiad - https://phabricator.wikimedia.org/T387018 (10cmooney) 03NEW p:05Triageβ†’03Low [13:42:42] RECOVERY - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is OK: wikitech-static OK - wikitech and wikitech-static in sync (41805 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [13:42:49] (03PS4) 10Federico Ceratto: clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 [13:43:24] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [13:51:30] 06SRE, 10Maps, 06Traffic, 13Patch-For-Review: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10571019 (10MSantos) Approved. [13:52:10] !log set global read_only=false @ db2230 [13:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:02] (03PS1) 10Majavah: toolforge: toolviews: Drop support for tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1121630 [13:57:21] RESOLVED: [2x] ErrorBudgetBurn: search - search-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:00:34] (03PS6) 10Fabfur: benthos: enable Benthos on whole ulsfo DC [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) [14:01:16] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:03:23] 10SRE-tools, 06DBA, 06Infrastructure-Foundations: Automate mariadb cloning process - https://phabricator.wikimedia.org/T387023 (10FCeratto-WMF) 03NEW [14:05:48] (03PS2) 10Ssingh: varnish: add schoolwiki.in to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1115031 (https://phabricator.wikimedia.org/T383210) [14:07:30] (03CR) 10Vgutierrez: [C:03+1] varnish: add schoolwiki.in to allowed maps domains [puppet] - 10https://gerrit.wikimedia.org/r/1115031 (https://phabricator.wikimedia.org/T383210) (owner: 10Ssingh) [14:07:34] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1121348 (https://phabricator.wikimedia.org/T386850) (owner: 10FNegri) [14:07:36] (03Abandoned) 10Fabfur: benthos: enable Benthos on whole ulsfo DC [puppet] - 10https://gerrit.wikimedia.org/r/1103291 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:10:33] (03PS1) 10Fabfur: hiera: enable benthos on ulsfo text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) [14:11:34] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:12:48] 06SRE, 10Maps, 06Traffic, 13Patch-For-Review: Allow Wikimedia Maps usage on schoolwiki.in - https://phabricator.wikimedia.org/T383210#10571103 (10ssingh) Thanks @MSantos! @Gnoeee: We will merge this on Monday (Feb 24). [14:13:16] (03PS1) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 [14:14:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3521 MB (3% inode=98%): /tmp 3521 MB (3% inode=98%): /var/tmp 3521 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [14:15:03] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: scale up the consumer-search job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121621 (https://phabricator.wikimedia.org/T386935) (owner: 10DCausse) [14:16:16] (03Merged) 10jenkins-bot: cirrus-streaming-updater: scale up the consumer-search job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121621 (https://phabricator.wikimedia.org/T386935) (owner: 10DCausse) [14:18:25] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [14:18:40] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:21:19] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:21:22] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:21:27] (03PS1) 10Majavah: P:wmcs: Drop unused postgres class [puppet] - 10https://gerrit.wikimedia.org/r/1121638 [14:22:05] (03CR) 10Majavah: "I8fd61d3d90076730179caba2d91ceb71a7dbeb11 proposes dropping this class entirely instead." [puppet] - 10https://gerrit.wikimedia.org/r/1115316 (owner: 10Muehlenhoff) [14:24:05] (03PS2) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 [14:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10571156 (10phaultfinder) [14:25:31] (03CR) 10CI reject: [V:04-1] pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (owner: 10Jgiannelos) [14:26:21] (03PS3) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 [14:27:20] (03CR) 10CI reject: [V:04-1] pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (owner: 10Jgiannelos) [14:27:27] (03PS2) 10Fabfur: hiera: enable benthos on ulsfo text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) [14:30:07] (03CR) 10Fabfur: [C:04-2] "To be merged next Monday" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:32:17] (03PS4) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 [14:33:15] (03CR) 10CI reject: [V:04-1] pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (owner: 10Jgiannelos) [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:59] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10571213 (10cmooney) >>! In T387006#10570948, @RobH wrote: > @cmooney, > > Can we have the carrier check their end first, since we'll incur hourly billing... [14:39:06] (03CR) 10FNegri: [C:03+2] prometheus::node_kernel_messages: ignore some false positives [puppet] - 10https://gerrit.wikimedia.org/r/1121321 (https://phabricator.wikimedia.org/T386850) (owner: 10FNegri) [14:39:09] (03CR) 10FNegri: [C:03+2] prometheus::node_kernel_messages: add new line to ignore list [puppet] - 10https://gerrit.wikimedia.org/r/1121348 (https://phabricator.wikimedia.org/T386850) (owner: 10FNegri) [14:41:47] (03PS5) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 [14:44:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:44:49] (03CR) 10Vgutierrez: hiera: enable benthos on ulsfo text|upload (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:46:16] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:46:20] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:46:29] I'll do some manual code editing on the debug hosts [14:46:46] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:46:57] PROBLEM - MariaDB read only pc7 #page on pc2017 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:46:57] PROBLEM - MariaDB Event Scheduler pc7 on pc2017 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [14:47:15] ^ Amir1 [14:47:20] here [14:47:24] one sec [14:47:29] acked [14:47:31] just depool pc7 [14:47:33] !incidents [14:47:34] 5691 (ACKED) pc2017 (paged)/MariaDB read only pc7 (paged) [14:47:34] 5690 (RESOLVED) [2x] GatewayBackendErrorsHigh sre (api-gateway eqiad) [14:47:35] will prepare depool [14:47:56] https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting#Depooling_a_parsercache_host [14:48:06] PROBLEM - MariaDB Replica IO: pc5 on pc1015 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1040, Errmsg: error connecting to master repl2024@pc2015.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Too many connections https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:48:15] FIRING: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 25% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:48:18] 10ops-magru, 06SRE, 06Infrastructure-Foundations, 10netops: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) - https://phabricator.wikimedia.org/T387006#10571241 (10cmooney) Ticket #341504 created. [14:48:35] hnowlan: please check impact while I depool [14:48:51] ack [14:48:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool pc7', diff saved to https://phabricator.wikimedia.org/P73502 and previous config saved to /var/cache/conftool/dbconfig/20250221-144852-ladsgroup.json [14:48:57] depooled [14:49:06] jynus: I depooled it [14:49:12] ah ok [14:49:15] I was doing it [14:49:35] pc1015 is pc7 on eqiad? [14:49:53] ladsgroup@cumin1002:~$ sudo dbctl instance pc2017 set-weight 0 [14:49:53] ladsgroup@cumin1002:~$ sudo dbctl instance pc1017 set-weight 0 [14:49:53] ladsgroup@cumin1002:~$ sudo dbctl config commit -m "Depool pc7" [14:49:57] These were the commands [14:50:06] jynus: pc1017 is pc7 [14:50:11] manuel reordered them [14:50:13] wait, why did pc1015 complained about replication too? [14:50:18] latency/saturation are up for mw, not critical yet and dropping [14:50:24] (to make it easier) [14:50:34] pc5 has issues too [14:50:39] I depool it [14:50:46] we should be still fine [14:51:05] not touching anything yet [14:51:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool pc5', diff saved to https://phabricator.wikimedia.org/P73503 and previous config saved to /var/cache/conftool/dbconfig/20250221-145110-ladsgroup.json [14:51:18] weird both hosts failed at the same time [14:51:19] I depooled both pc7 and pc5 [14:51:28] thank you [14:51:49] the user impact should be minimal, parsercache now automatically falls back to the second host in line [14:51:50] will silence hosts around [14:51:55] and create a ticket [14:52:02] Thanks. I will investigate after meeting [14:52:07] "100000 message: Too many connections" [14:52:13] (right now in a meeting) [14:52:17] I think the failover went badly [14:52:30] (03CR) 10Vgutierrez: [C:04-1] "hieradata/hosts/cp4039.yaml needs to be updated" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [14:52:31] not important, but may be interesting for debugging later [14:52:36] (03PS6) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 [14:53:06] RECOVERY - MariaDB Replica IO: pc5 on pc1015 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:53:15] RESOLVED: [2x] PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-int/canary at codfw: 18.75% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:53:42] I should reduce timeout for pc [14:54:52] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3370 MB (3% inode=98%): /tmp 3370 MB (3% inode=98%): /var/tmp 3370 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [14:55:02] PROBLEM - MariaDB Replica Lag: pc5 on pc1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 601.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:55:20] (03PS7) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 [14:56:39] nice to see the impact being so limited compared to previous instances [14:56:51] latency is still a little increased but tolerable [14:57:50] is it still bad? [14:57:57] I created https://phabricator.wikimedia.org/T387032 [14:58:26] not bad, just a little elevated https://grafana.wikimedia.org/goto/xDUAHy5HR?orgId=1 [14:58:28] (03PS8) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 [14:58:35] 10ms won't hurt anyone [14:58:38] there should be a small increase in latency due to cache loss, but only measurable for performance, not for availability [14:59:00] ah, ok, so that is expected and should go as new cache entries get warmed up again [14:59:09] nice [14:59:16] it is the price to pay for high availability [14:59:28] I am more worried about the snowballing [14:59:46] as that happened in the past- a host down leading to max_connections elsewhere [14:59:57] but that can be debugged later on [15:00:05] (03PS9) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (https://phabricator.wikimedia.org/T372749) [15:00:16] see my ticket hnowlan if you want to add that info [15:00:40] will silence now the depooled hosts [15:02:10] however all should be ok, but without traffic [15:02:20] I wonder why pc1015 is complaining about lag stilll [15:03:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:03:58] uff, I see why, pc2015 is overloaded still [15:04:22] pc1015 can barelly catch up with so much load [15:05:09] apprently the failover situation is still not enough :-( [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:54] that section is depooled though no? [15:07:15] it has thousands of stuck updates [15:08:54] not only from the app, but from orchestrator [15:09:35] I'm going to confirm it is depooled and restart it [15:09:45] because it is not healty right now [15:11:32] yeah, both are weight 0 [15:11:43] I am going to downtime it and restart the service [15:12:02] meeting just over [15:12:49] so pc2015 is in the "stuck processes" [15:12:58] that you may habe seen before [15:13:37] and the same for pc2017 [15:13:48] now even from orchestrator connections [15:15:11] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on pc[2015,2017].codfw.wmnet with reason: processes stuck [15:15:18] thanks for the ticket update, hnowlan [15:15:44] I was going to restart mysql on both, Amir1 unless you want to have a look first [15:15:57] (03PS10) 10Jgiannelos: pcs: Expose port for native prometheus metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1121637 (https://phabricator.wikimedia.org/T372749) [15:16:16] if you reboot it, I'd be grateful [15:16:18] thanks [15:16:24] *restart [15:16:28] the host too you mean? [15:16:43] I was going to do the process (deamon) only [15:17:30] I mean the process [15:17:34] ok, yes, doing [15:17:42] I accidentally said reboot. Sorry. [15:17:55] no prob, just confirming in case it was necessary [15:18:17] hnowlan: we may have complains from connections from eqiad, but that's not worring [15:18:25] to make it clear. The user impact was the slow down. Nothing else? [15:18:50] I think we saturated workers [15:18:50] yep [15:18:56] because of the waits [15:18:57] saturation never became critical [15:19:01] ah, good [15:19:06] latency is quite high in eqiad still [15:19:29] !log restarting mariadb @ pc2015 [15:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:32] it will recover, it'll take some time [15:19:48] cool [15:20:15] not sure if it will restart quick, otherwise will kill [15:20:26] I think last time we had to kill [15:20:35] and I wonder if this is hw-related [15:21:43] I need to -9 it, it says normal shutdown but does nothing [15:22:55] doesn't even respond to a -9 [15:23:16] finally [15:23:33] pc2015 finally looking healthy [15:23:38] doing the same with pc2017 [15:24:02] RECOVERY - MariaDB Replica Lag: pc5 on pc1015 is OK: OK slave_sql_lag Replication lag: 0.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:25:58] we should check if it was the same hosts, if they were the same, it's a hw issue [15:26:28] yeah, it will be on the previous ticket [15:26:32] clearly it is the same issue [15:26:37] whatever it was [15:27:11] RECOVERY - MariaDB read only pc7 #page on pc2017 is OK: Version 10.6.20-MariaDB-log, Uptime 37s, read_only: False, event_scheduler: True, 26.13 QPS, connection latency: 0.024322s, query latency: 0.001226s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [15:27:11] RECOVERY - MariaDB Event Scheduler pc7 on pc2017 is OK: Version 10.6.20-MariaDB-log, Uptime 37s, read_only: False, event_scheduler: True, 24.29 QPS, connection latency: 0.022731s, query latency: 0.001156s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Event_Scheduler [15:27:53] !log restarted (kill -9) mariadb @ pc2015,pc2017 T387032 [15:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:56] T387032: pc7 parsercache db got unavailable, leading to pc5 and mediawiki workers getting overloaded - https://phabricator.wikimedia.org/T387032 [15:28:18] hnowlan: that should conclude the work for now [15:28:35] https://docs.google.com/document/d/1XPw05yiO_76xxkknPWIaJtS0Gxr1qpEKeUnRhITh66w/edit?tab=t.0#heading=h.95p2g5d67t9q [15:28:36] now debugging, we will leave it to the dbas to avoid it from happening again [15:28:49] this was pc2015 [15:28:51] T387032 [15:29:16] oh, so different pc but the same host [15:29:20] *shard [15:29:30] yeah, I was going to say, this sounds similar to the December incident, but different hosts (and section) [15:29:47] ah, true, it was pc1016, not 17 [15:30:04] not sure it is hw, I think it is traffic patterns [15:30:23] too many rollbacks end up overloading the host [15:30:23] similar resolution needed, though (needed to kill the process) [15:30:32] yea [15:30:37] I agree with that [15:30:44] should be added to the ticket [15:31:02] actually, I was thinking of a previous issue [15:31:10] as I wasn't around on deceber [15:31:19] so this is the third time this has happened [15:31:22] for pc [15:32:28] and it is not the query killer- it is not enabled on pc hosts [15:32:50] so it is some traffic pattern that causes overload and mariadb cannot handle it [15:33:15] so probably traffic-related, but I wonder if something new on 10.6 makes it worse [15:33:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:33:45] I think everything is green now, so I am going to declare the incident closed [15:33:52] will remove the downtimes, in case it happens again [15:34:06] but pc7 and pc5 are depooled still [15:34:11] if it is based on traffic, I would like to repool them [15:34:15] (not hw) [15:34:29] the avoid the risk of two other sections getting overloaded [15:34:30] obviously I don't know for sure [15:34:44] but I would say it is not the leading reason IMHO [15:35:03] and would agree with repooling them [15:35:12] let me remove the downtimes [15:35:22] do we have the older incident? [15:35:29] I will search it [15:36:04] T378076 [15:36:05] T378076: Parsercache issues in codfw causing large-scale outage - https://phabricator.wikimedia.org/T378076 [15:36:07] found it! [15:36:13] !log jynus@cumin1002 START - Cookbook sre.hosts.remove-downtime for pc[2015,2017].codfw.wmnet [15:36:14] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for pc[2015,2017].codfw.wmnet [15:36:55] yeah, there was a crash, and then another host went down a few minutes later [15:37:42] still, pc1017 [15:38:27] it would be nice to see the graphs more slowly later on, too much info [15:38:31] (03PS1) 10Federico Ceratto: clone.py: Add helper functions for later use [cookbooks] - 10https://gerrit.wikimedia.org/r/1121646 (https://phabricator.wikimedia.org/T387023) [15:38:37] it is extremely weird, the fallback I implemented in mw (which was implemented after the Oct incident), distributes across all clusters so either all should struggle or none, unless that specific host has a hw issue and goes down easily [15:38:51] yeah, I don't think it is the cause [15:39:03] I think there is some bad pattern that activates when higher load [15:39:12] as it happened before you implemented it [15:39:15] and after [15:39:28] (03PS5) 10Federico Ceratto: clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 (https://phabricator.wikimedia.org/T387023) [15:39:30] some stampede or something [15:39:46] but indeed it is confusing [15:39:54] still better only 2 hosts than all [15:40:42] I am a bit saturated to think clearly, so I will let it stay for now [15:41:01] let's leave it for a bit. I need to eat lunch [15:41:28] if you are coming back, I will let you handle the repools [15:43:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:46:20] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10571415 (10Jhancock.wm) had a weather event locally. taking another look at it today. [15:51:14] any idea what kind of timeline we should see recovery in eqiad over? [15:51:41] p99/p75 are staying increased (but again not panic-worthy) [15:51:51] do you have a link? [15:52:48] https://grafana.wikimedia.org/goto/_RAfFy5NR?orgId=1 [15:53:48] that's bigger than I would expect [15:53:55] (03PS3) 10Fabfur: hiera: enable benthos on ulsfo text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) [15:53:56] I guess it is due to: https://grafana.wikimedia.org/goto/_dzPKycNR?orgId=1 [15:54:13] double the number of regular reparses [15:54:17] (03CR) 10CI reject: [V:04-1] hiera: enable benthos on ulsfo text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:54:25] (03CR) 10Fabfur: hiera: enable benthos on ulsfo text|upload (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:54:34] (03CR) 10Fabfur: [C:04-2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [15:54:51] hnowlan: I would wonder if that is a cause or a consequence, if something else changed [15:56:11] that eerily correlates with the alert on pc7 [15:56:54] the hitrate has gone from 74% to 66%, I wouldn't expect such a latency difference from that [15:57:36] I'm going to pool the hosts again, see if that is better [15:57:45] also correlates with the replag on pc5 [15:57:56] but tbh I don't quite understand the impact of different parsercache sections [15:58:07] they are supposed to be shards [15:58:19] just random partitioning [15:58:41] but could be that something started hitting mw expensively which in turn made pc5/pc7 unhappy? [15:59:03] but in the codfw case depooling increased latency almost immediately [16:00:01] hnowlan: that was actually my question [16:00:08] (03CR) 10Scott French: "Thanks for the review, Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1120586 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:00:28] (03CR) 10Scott French: [C:03+2] aptrepo: add component/pcre2 for bullseye-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1120586 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:00:40] depooling should increase misses, but I wouldn't expect such a difference, specially when memcache is in front of it [16:01:02] I am going to repool and see if that helps [16:01:21] I'll do some digging and see if I can find anything odd [16:01:55] FIRING: SystemdUnitFailed: opensearch-disable-readahead.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:01:59] !log jynus@cumin1002 dbctl commit (dc=all): 'Repool pc5 & pc7', diff saved to https://phabricator.wikimedia.org/P73504 and previous config saved to /var/cache/conftool/dbconfig/20250221-160158-jynus.json [16:02:28] hnowlan: help me see if that makes things better or worse [16:03:07] I'll keep an eye [16:03:21] I think we should handover [16:03:31] so people in americas can keep an eye [16:03:55] oh [16:03:56] https://grafana.wikimedia.org/goto/Al-N5ycNR?orgId=1 [16:04:04] reparses got down a lot [16:04:40] doubling parsoidcacheprewarm probably didn't help [16:04:42] lines up well [16:04:53] depooling seems to beet too impactful, right? [16:04:56] *be [16:06:47] I think the new arch, while on paper better must have something that still makes it too impactful [16:06:56] (03CR) 10Scott French: "Thank you both for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/1120587 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:06:56] oooh yeah big drops [16:06:58] (03CR) 10Scott French: [C:03+2] package_builder: add pbuilder hook for pcre2 component [puppet] - 10https://gerrit.wikimedia.org/r/1120587 (https://phabricator.wikimedia.org/T386006) (owner: 10Scott French) [16:07:13] we stayed up, which is a big improvement either way :D [16:07:17] hnowlan: feel free to comment on ticket [16:07:22] will do [16:07:33] yeah, but not ideal if mw starts reparsing like crazy [16:07:43] :-( [16:10:45] (03CR) 10Vgutierrez: [C:04-1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:12:56] (03PS4) 10Fabfur: hiera: enable benthos on ulsfo text|upload [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) [16:13:19] (03CR) 10Fabfur: "yeah, fixed, committed and forgot to `review`... 😞" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:13:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:21:56] (03Abandoned) 10Federico Ceratto: clone.py: Add helper functions for later use [cookbooks] - 10https://gerrit.wikimedia.org/r/1121646 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [16:25:35] (03PS2) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [16:29:50] (03CR) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [16:31:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10571614 (10Jhancock.wm) @MatthewVernon the new controller card wasn't registering for some reason. I reseated it and it shows up now. BUT. the raid config is gone. assuming its... [16:32:25] jynus: pc5 is not replacting, I should do a start slave there [16:32:41] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [16:33:20] RECOVERY - MegaRAID on an-worker1065 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:33:43] (03CR) 10Andrew Bogott: [C:03+2] vendordata.txt: include rudimentary clouds.yaml in initial VM [puppet] - 10https://gerrit.wikimedia.org/r/1120683 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [16:33:58] started both [16:34:29] !log started replication on pc2017 and pc2015 (T387032) [16:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:35] T387032: pc7 parsercache db got unavailable, leading to pc5 and mediawiki workers getting overloaded - https://phabricator.wikimedia.org/T387032 [16:35:23] (03CR) 10Fabfur: [C:04-2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:35:24] (03CR) 10Fabfur: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1121636 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [16:35:35] (03CR) 10Andrew Bogott: [C:03+2] Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [16:38:30] (03PS6) 10Andrew Bogott: Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) [16:38:30] (03PS4) 10Andrew Bogott: validatecloudvpsfqdn.py: Support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030) [16:38:30] (03PS6) 10Andrew Bogott: wmcs puppet-enc: use project id for endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1121347 [16:38:30] (03PS6) 10Andrew Bogott: wmfkeystonehooks: use project name instead of project id for ldap key [puppet] - 10https://gerrit.wikimedia.org/r/1121345 (https://phabricator.wikimedia.org/T379030) [16:38:32] (03PS3) 10Andrew Bogott: validatecloudvpsfqdn.py: Only support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121423 (https://phabricator.wikimedia.org/T379030) [16:39:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10571638 (10MatthewVernon) This system has been drained, so I think it's OK to re-set-up the new card and then reimage the node. The disks should all be JBOD. [16:41:14] (03CR) 10Andrew Bogott: [C:03+2] Add wmcs_project_id custom fact and handling in realm [puppet] - 10https://gerrit.wikimedia.org/r/1121346 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [16:42:06] (03CR) 10Andrew Bogott: [C:03+2] validatecloudvpsfqdn.py: Support projects with project_name in fqdn [puppet] - 10https://gerrit.wikimedia.org/r/1121344 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [16:45:28] (03PS3) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [16:46:40] (03CR) 10CI reject: [V:04-1] clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [16:47:14] (03PS6) 10Federico Ceratto: clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 (https://phabricator.wikimedia.org/T387023) [16:47:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-be2075.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:47:39] (03CR) 10CI reject: [V:04-1] clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [16:57:17] (03CR) 10BCornwall: provision: Adjust thermal profile for F4 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1121086 (https://phabricator.wikimedia.org/T373993) (owner: 10BCornwall) [17:03:03] (03CR) 10LD: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [17:12:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-be2075.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:12:57] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be2075'] [17:13:19] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ms-be2075'] [17:25:44] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [17:25:47] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [17:33:20] PROBLEM - MegaRAID on an-worker1065 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:38:05] (03PS1) 10Andrew Bogott: Fix yet another ordering issue with new puppetservers. [puppet] - 10https://gerrit.wikimedia.org/r/1121664 [17:38:27] (03CR) 10CI reject: [V:04-1] Fix yet another ordering issue with new puppetservers. [puppet] - 10https://gerrit.wikimedia.org/r/1121664 (owner: 10Andrew Bogott) [17:41:20] (03PS7) 10Federico Ceratto: clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 (https://phabricator.wikimedia.org/T387023) [17:41:20] (03PS4) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [17:41:40] (03CR) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) (owner: 10Federico Ceratto) [17:41:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [17:42:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10571782 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye [17:43:40] (03PS6) 10Federico Ceratto: clone.py: Add helper functions for later use [cookbooks] - 10https://gerrit.wikimedia.org/r/1120213 (https://phabricator.wikimedia.org/T387023) [17:44:10] (03PS8) 10Federico Ceratto: clone.py: Cleanup, extract fqdn and hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1120214 (https://phabricator.wikimedia.org/T387023) [17:44:47] (03PS5) 10Federico Ceratto: clone.py, clone_test.py: Automate cloning [cookbooks] - 10https://gerrit.wikimedia.org/r/1120605 (https://phabricator.wikimedia.org/T387023) [17:52:40] (03PS51) 10Federico Ceratto: sre.mysql.sanitize-wiki: sanitize wiki cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1080129 (https://phabricator.wikimedia.org/T366146) (owner: 10Arnaudb) [18:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10571844 (10phaultfinder) [18:11:03] !log bking@apt1002:~$ sudo -E reprepro --ignore=wrongdistribution -C component/opensearch13 include bullseye-wikimedia $HOME/madvise-pkg/opensearch-madvise_0.1_amd64.changes T387030 [18:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:06] T387030: Recompile/repackage elasticsearch-madvise for Opensearch - https://phabricator.wikimedia.org/T387030 [18:11:13] (03PS1) 10DCausse: cirrus: configure wgCirrusSearchLanguageKeywordExtraFields [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121666 (https://phabricator.wikimedia.org/T271776) [18:20:05] (03PS1) 10Bking: relforge: re-enable opensearch-madvise [puppet] - 10https://gerrit.wikimedia.org/r/1121671 (https://phabricator.wikimedia.org/T387030) [18:27:06] (03PS1) 10Aklapper: Move recentEditRatio code to where it's used [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121674 [18:28:05] (03CR) 10Aklapper: [V:03+2 C:03+2] Move recentEditRatio code to where it's used [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1121674 (owner: 10Aklapper) [18:33:30] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@60223e2]: Deploying latest DAGs for the analytics Airflow instance. T387033. [18:33:33] T387033: Figure root cause of silent failures when computing metrics for mediawiki_content_history_v1 - https://phabricator.wikimedia.org/T387033 [18:34:15] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@60223e2]: Deploying latest DAGs for the analytics Airflow instance. T387033. (duration: 00m 45s) [18:37:47] (03CR) 10Brouberol: [C:03+1] relforge: re-enable opensearch-madvise [puppet] - 10https://gerrit.wikimedia.org/r/1121671 (https://phabricator.wikimedia.org/T387030) (owner: 10Bking) [18:43:49] (03CR) 10Bking: [C:03+2] relforge: re-enable opensearch-madvise [puppet] - 10https://gerrit.wikimedia.org/r/1121671 (https://phabricator.wikimedia.org/T387030) (owner: 10Bking) [18:44:10] (03PS1) 10Dzahn: puppetserver: create full path to /srv/puppet/server/ssl/ca [puppet] - 10https://gerrit.wikimedia.org/r/1121682 [18:44:12] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, 10Observability-Logging: decommission logstash202[6-9] - https://phabricator.wikimedia.org/T383288#10571966 (10Jhancock.wm) 05Openβ†’03Resolved a:03Jhancock.wm [18:47:47] (03CR) 10JHathaway: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1121682 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [18:47:47] (03PS2) 10Dzahn: puppetserver: create full path to /srv/puppet/server/ssl/ca [puppet] - 10https://gerrit.wikimedia.org/r/1121682 (https://phabricator.wikimedia.org/T382960) [18:49:18] (03CR) 10Dzahn: [C:03+2] puppetserver: create full path to /srv/puppet/server/ssl/ca [puppet] - 10https://gerrit.wikimedia.org/r/1121682 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [18:51:37] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1004* for test ability to ban opensearch node - bking@cumin2002 - T387030 [18:51:38] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1004* for test ability to ban opensearch node - bking@cumin2002 - T387030 [18:51:40] T387030: Recompile/repackage elasticsearch-madvise for Opensearch - https://phabricator.wikimedia.org/T387030 [18:53:19] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:47] RECOVERY - Host ms-be2075 is UP: PING OK - Packet loss = 0%, RTA = 34.42 ms [18:56:39] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for Melos - https://phabricator.wikimedia.org/T386581#10571996 (10KFrancis) Hi all, the NDA is complete. Thanks! [18:57:58] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10572009 (10Jhancock.wm) @MoritzMuehlenhoff when is a good time next week to move maps2009? 1500 UTC on is when... [18:58:37] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for Melos - https://phabricator.wikimedia.org/T386581#10572010 (10KFrancis) Hi all, the NDA is complete. Thanks! [19:01:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [19:01:45] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10572014 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be20... [19:06:48] (03PS1) 10Dzahn: puppetserver: create file resource for /srv/puppet/server/ssl/ca [puppet] - 10https://gerrit.wikimedia.org/r/1121684 [19:10:05] (03Abandoned) 10Andrew Bogott: Fix yet another ordering issue with new puppetservers. [puppet] - 10https://gerrit.wikimedia.org/r/1121664 (owner: 10Andrew Bogott) [19:11:19] PROBLEM - Host ms-be2075 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:47] RECOVERY - Host ms-be2075 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [19:16:08] (03PS2) 10Dzahn: puppetserver: create file resource for /srv/puppet/server/ssl/ca [puppet] - 10https://gerrit.wikimedia.org/r/1121684 (https://phabricator.wikimedia.org/T382960) [19:16:33] (03CR) 10CI reject: [V:04-1] puppetserver: create file resource for /srv/puppet/server/ssl/ca [puppet] - 10https://gerrit.wikimedia.org/r/1121684 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [19:17:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be2075.codfw.wmnet with OS bullseye [19:17:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10572039 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye [19:18:41] (03PS3) 10Dzahn: puppetserver: create file resource for /srv/puppet/server/ssl/ca [puppet] - 10https://gerrit.wikimedia.org/r/1121684 (https://phabricator.wikimedia.org/T382960) [19:19:10] (03CR) 10CI reject: [V:04-1] puppetserver: create file resource for /srv/puppet/server/ssl/ca [puppet] - 10https://gerrit.wikimedia.org/r/1121684 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [19:20:05] (03PS4) 10Dzahn: puppetserver: create file resource for /srv/puppet/server/ssl/ca [puppet] - 10https://gerrit.wikimedia.org/r/1121684 (https://phabricator.wikimedia.org/T382960) [19:20:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10572041 (10cmooney) Bit of an update on this one. I was able to get on to three of the devices with the default password for t... [19:21:16] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10572043 (10cmooney) Also @Jhancock.wm not sure if you want to close this task at this point, or if we want to keep it open to d... [19:23:16] (03PS1) 10ZhaoFJx: zhwiki: Change abusefilter-editor group name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121687 (https://phabricator.wikimedia.org/T386879) [19:31:13] PROBLEM - Host relforge1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:31:43] RECOVERY - Host relforge1004 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [19:31:55] RESOLVED: SystemdUnitFailed: opensearch-disable-readahead.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:17] (03PS1) 10ZhaoFJx: kywiki: Add namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121688 (https://phabricator.wikimedia.org/T386617) [19:37:55] FIRING: SystemdUnitFailed: opensearch-disable-readahead.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, February 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1121688 (https://phabricator.wikimedia.org/T386617) (owner: 10ZhaoFJx) [19:44:04] (03PS11) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [19:44:04] (03PS1) 10Andrew Bogott: openstack_project_id.rb: discard trailing newline [puppet] - 10https://gerrit.wikimedia.org/r/1121691 (https://phabricator.wikimedia.org/T379030) [19:44:06] (03PS1) 10Andrew Bogott: Revert "cloud-vps instance: populate /etc/openstack/project_id" [puppet] - 10https://gerrit.wikimedia.org/r/1121692 [19:44:38] (03PS1) 10Bking: cirrus: point bash script to the correct executable [puppet] - 10https://gerrit.wikimedia.org/r/1121693 (https://phabricator.wikimedia.org/T387030) [19:44:59] (03CR) 10CI reject: [V:04-1] openstack_project_id.rb: discard trailing newline [puppet] - 10https://gerrit.wikimedia.org/r/1121691 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [19:46:48] (03PS2) 10Andrew Bogott: openstack_project_id.rb: discard trailing newline [puppet] - 10https://gerrit.wikimedia.org/r/1121691 (https://phabricator.wikimedia.org/T379030) [19:46:48] (03PS2) 10Andrew Bogott: Revert "cloud-vps instance: populate /etc/openstack/project_id" [puppet] - 10https://gerrit.wikimedia.org/r/1121692 [19:46:48] (03PS12) 10Andrew Bogott: nova vendordata: set fqdn from project_name rather than project_id [puppet] - 10https://gerrit.wikimedia.org/r/1120684 (https://phabricator.wikimedia.org/T379030) [19:50:03] (03CR) 10Andrew Bogott: [C:03+2] openstack_project_id.rb: discard trailing newline [puppet] - 10https://gerrit.wikimedia.org/r/1121691 (https://phabricator.wikimedia.org/T379030) (owner: 10Andrew Bogott) [19:56:17] (03CR) 10Bking: [C:03+2] cirrus: point bash script to the correct executable [puppet] - 10https://gerrit.wikimedia.org/r/1121693 (https://phabricator.wikimedia.org/T387030) (owner: 10Bking) [19:56:31] (03CR) 10Bking: [V:03+2 C:03+2] "self-merging, as this does not affect production services." [puppet] - 10https://gerrit.wikimedia.org/r/1121693 (https://phabricator.wikimedia.org/T387030) (owner: 10Bking) [20:01:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-be2075.codfw.wmnet with OS bullseye [20:01:48] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10572167 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-be2075.codfw.wmnet with OS bullseye executed with errors: - ms-be20... [20:09:37] (03CR) 10JHathaway: puppetserver: create file resource for /srv/puppet/server/ssl/ca (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121684 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [20:18:05] (03CR) 10LD: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [20:18:38] (03PS13) 10LD: frwiki: Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) [20:20:53] (03CR) 10LD: [C:03+1] "@daimona.wiki@gmail.com thanks for the feedback and refs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1120152 (https://phabricator.wikimedia.org/T386622) (owner: 10LD) [20:31:06] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [20:31:09] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [20:37:55] RESOLVED: SystemdUnitFailed: opensearch-disable-readahead.service on relforge1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:56:21] (03CR) 10Cathal Mooney: [C:03+1] "Well spotted!" [homer/public] - 10https://gerrit.wikimedia.org/r/1121484 (https://phabricator.wikimedia.org/T386766) (owner: 10Papaul) [21:15:14] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge [21:15:16] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge [22:11:12] (03PS1) 10Andrew Bogott: puppetserver: remove duplicate directory definition [puppet] - 10https://gerrit.wikimedia.org/r/1121702 [22:11:38] (03CR) 10CI reject: [V:04-1] puppetserver: remove duplicate directory definition [puppet] - 10https://gerrit.wikimedia.org/r/1121702 (owner: 10Andrew Bogott) [22:15:24] (03PS2) 10Andrew Bogott: puppetserver: move duplicate directory definition [puppet] - 10https://gerrit.wikimedia.org/r/1121702 [22:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10572405 (10phaultfinder) [22:25:36] (03CR) 10JHathaway: [C:03+2] puppetserver: create file resource for /srv/puppet/server/ssl/ca (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1121684 (https://phabricator.wikimedia.org/T382960) (owner: 10Dzahn) [22:26:47] (03PS1) 10Bking: relforge: reassign relforge1005 to Opensearch role [puppet] - 10https://gerrit.wikimedia.org/r/1121711 (https://phabricator.wikimedia.org/T380752) [22:31:46] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on ms-be2088.codfw.wmnet with reason: T381919 [22:31:50] T381919: Supermicro: unable to set boot order after using Redfish to boot once - https://phabricator.wikimedia.org/T381919 [22:34:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10572419 (10phaultfinder) [23:11:36] (03CR) 10Dzahn: "I don't think we need it anymore since the duplicate definition was already fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/" [puppet] - 10https://gerrit.wikimedia.org/r/1121702 (owner: 10Andrew Bogott) [23:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10572466 (10phaultfinder) [23:39:55] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:40:41] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down