[00:07:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:46] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11113548 (10Ladsgroup) To be sure update category membership is the culprit, I went through all slow write queries reordered by the master around the time of th... [00:07:58] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181307 [00:07:58] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181307 (owner: 10TrainBranchBot) [00:11:51] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11113549 (10Ladsgroup) Specifically these edits seemed to be the main reason: https://commons.wikimedia.org/w/index.php?title=Special:Contributions/Yac%C3%A0wot... [00:36:26] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181307 (owner: 10TrainBranchBot) [00:43:56] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11113571 (10phaultfinder) [00:48:53] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11113574 (10phaultfinder) [01:16:17] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11113584 (10ecarg) Thank you so much, @RLazarus! Will keep you posted with any Qs 😃 [01:32:25] RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:37:44] FIRING: NodeTextfileStale: Stale textfile for wdqs2025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:42:44] FIRING: NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:45:44] FIRING: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:48:44] FIRING: [2x] NodeTextfileStale: Stale textfile for apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:49:44] FIRING: [6x] NodeTextfileStale: Stale textfile for cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:58:58] FIRING: [5x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:59:53] FIRING: [4x] NodeTextfileStale: Stale textfile for wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:00:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:00:48] FIRING: [112x] NodeTextfileStale: Stale textfile for cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:01:44] FIRING: [5x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:04:53] FIRING: [19x] NodeTextfileStale: Stale textfile for wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:05:53] FIRING: [3x] NodeTextfileStale: Stale textfile for relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:07:43] (03PS1) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) [02:08:31] (03CR) 10CI reject: [V:04-1] Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [02:14:25] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11113595 (10Zache) @Ladsgroup : Just FYI, from the Cat-a-lot code side, the user was using a pre-August 18, 2024 version of Cat-a-lot which didn't have the thro... [02:15:54] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [02:57:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:14:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [03:14:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [03:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [04:45:39] RESOLVED: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:48:55] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11113663 (10phaultfinder) [04:53:58] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11113664 (10phaultfinder) [05:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:05] 10ops-codfw, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402759 (10phaultfinder) 03NEW [05:28:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:37:44] FIRING: NodeTextfileStale: Stale textfile for wdqs2025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:42:44] FIRING: NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:45:44] FIRING: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:48:44] FIRING: [2x] NodeTextfileStale: Stale textfile for apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:49:44] FIRING: [6x] NodeTextfileStale: Stale textfile for cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:58:53] FIRING: [5x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:59:53] FIRING: [4x] NodeTextfileStale: Stale textfile for wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:00:44] FIRING: [6x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:00:49] FIRING: [112x] NodeTextfileStale: Stale textfile for cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:01:44] FIRING: [5x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:04:53] FIRING: [19x] NodeTextfileStale: Stale textfile for wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:05:44] FIRING: [3x] NodeTextfileStale: Stale textfile for relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:23:39] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:38:07] jouncebot: nowandnext [06:38:08] For the next 0 hour(s) and 21 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250824T0700) [06:38:08] In 0 hour(s) and 21 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T0700) [06:38:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181130 (https://phabricator.wikimedia.org/T402641) (owner: 10Kosta Harlan) [06:38:50] (03CR) 10Arnaudb: [C:03+1] "thanks for the bump" [puppet] - 10https://gerrit.wikimedia.org/r/1181213 (owner: 10Dzahn) [06:40:19] (03Merged) 10jenkins-bot: hcaptcha: Delay challenge execution until submit [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181130 (https://phabricator.wikimedia.org/T402641) (owner: 10Kosta Harlan) [06:42:21] dcausse: it looks like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1114956 was merged, but not backported [06:42:43] kostajh: yes I think it was merged by accident on friday [06:42:47] ok [06:42:51] dcausse: ok if I sync it now? [06:43:02] I'm syncing another patch, and scap is asking me to deploy it [06:43:07] kostajh: yes please [06:43:23] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1181130|hcaptcha: Delay challenge execution until submit (T402641)]] [06:43:28] T402641: hCaptcha: Only display challenge on form submission - https://phabricator.wikimedia.org/T402641 [06:43:38] dcausse: ok, it's going out now. Do you need to verify it when it's on mwdebug? [06:44:05] kostajh: I could do a quick test yes [06:44:18] ok I'll let you know when it's ready for review [06:44:26] sure [06:48:07] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [06:48:36] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [06:57:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:00:05] Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T0700). [07:00:05] kostajh and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098962 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:04:31] dcausse: it's still syncing out... [07:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:06:22] (03CR) 10Muehlenhoff: [C:03+2] cloudcontrol/codfw1dev:: Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/1098962 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:08:29] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1181130|hcaptcha: Delay challenge execution until submit (T402641)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:08:34] T402641: hCaptcha: Only display challenge on form submission - https://phabricator.wikimedia.org/T402641 [07:08:55] testing [07:09:26] kostajh: all good from my side [07:11:40] cool [07:11:51] !log kharlan@deploy1003 kharlan: Continuing with sync [07:13:36] 06SRE, 10envoy, 06serviceops, 06Traffic, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11113754 (10MoritzMuehlenhoff) We also have 237 baremetal hosts with Envoy, how shall we handle these? We could e.g. add a profile parameter $use_future to pro... [07:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [07:15:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:24:46] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181130|hcaptcha: Delay challenge execution until submit (T402641)]] (duration: 41m 22s) [07:24:50] T402641: hCaptcha: Only display challenge on form submission - https://phabricator.wikimedia.org/T402641 [07:25:04] dcausse: synced! [07:25:24] kostajh: thanks and sorry about this! [07:25:31] no worries at all [07:29:13] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update image for readability model on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180098 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou) [07:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [07:31:11] (03Merged) 10jenkins-bot: ml-services: update image for readability model on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180098 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou) [07:31:54] (03PS1) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) [07:33:12] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . [07:33:41] 06SRE, 10envoy, 06serviceops, 06Traffic, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11113776 (10hashar) I have updated the [[ https://integration.wikimedia.org/ci/job/helm-lint/ | helm-lint ]] job to the new image :) [07:34:57] (03CR) 10CI reject: [V:04-1] Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [07:36:25] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [07:36:30] (03PS2) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) [07:36:44] (03PS1) 10Brouberol: mediawiki-dumps-legacy: adapt RBAC to a recent apache-airflow-providers-cncf-kubernetes upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181519 (https://phabricator.wikimedia.org/T402529) [07:36:54] 06SRE, 06Traffic, 13Patch-For-Review, 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): hCaptcha: Ensure GeoIP and WMF-Uniq cookies are removed in proxied requests - https://phabricator.wikimedia.org/T402713#11113808 (10kostajh) [07:41:28] (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role on install4003 [puppet] - 10https://gerrit.wikimedia.org/r/1181094 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [07:41:31] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [07:43:07] (03PS3) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) [07:43:49] (03CR) 10Urbanecm: [C:04-1] [Growth] enwiki: Deploy "Add a link" to 100% of users (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [07:45:22] kostajh: any other deployment still happening, do you know? [07:46:16] (03PS4) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) [07:46:33] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [07:50:29] (03CR) 10Ayounsi: [C:04-1] "Ready for review but can't be merged before we have a compatible dnsmasq (v2.92) in APT." [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [07:55:16] FIRING: [19x] NodeTextfileStale: Stale textfile for wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:55:30] FIRING: [4x] NodeTextfileStale: Stale textfile for wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:55:34] FIRING: [6x] NodeTextfileStale: Stale textfile for cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:55:59] RESOLVED: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:56:08] RESOLVED: [6x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:56:41] RESOLVED: [3x] NodeTextfileStale: Stale textfile for relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:57:19] RESOLVED: [112x] NodeTextfileStale: Stale textfile for cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:57:31] RESOLVED: [5x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:58:34] RESOLVED: NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:58:53] RESOLVED: NodeTextfileStale: Stale textfile for wdqs2025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:59:06] (03PS1) 10Muehlenhoff: Add dummy keytabs for new install servers T396487 [labs/private] - 10https://gerrit.wikimedia.org/r/1181638 [07:59:43] RESOLVED: [5x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:00:11] RESOLVED: [2x] NodeTextfileStale: Stale textfile for apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:01:19] RESOLVED: [19x] NodeTextfileStale: Stale textfile for wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:01:23] RESOLVED: [4x] NodeTextfileStale: Stale textfile for wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:01:42] RESOLVED: [6x] NodeTextfileStale: Stale textfile for cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [08:13:15] (03CR) 10Jcrespo: "Hey, Moritz, maybe I am not understanding the patch, but I don't think this will work as intended, "os.path.exists" runs something on the " [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff) [08:18:48] (03PS1) 10Slyngshede: P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) [08:23:41] (03PS1) 10Brouberol: airflow-ml: define an RBD volume claim used as a model training scratch space [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181641 (https://phabricator.wikimedia.org/T396495) [08:26:55] (03CR) 10Vgutierrez: P:cache::haproxy block generic user-agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [08:29:46] (03CR) 10Vgutierrez: [C:04-1] P:cache::haproxy block generic user-agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [08:32:44] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181641 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol) [08:35:30] (03CR) 10Brouberol: [C:03+2] airflow-ml: define an RBD volume claim used as a model training scratch space [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181641 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol) [08:36:15] (03PS2) 10Slyngshede: P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) [08:39:09] (03PS3) 10Slyngshede: P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) [08:39:15] (03CR) 10Slyngshede: P:cache::haproxy block generic user-agents (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [08:40:22] (03CR) 10Muehlenhoff: [C:03+2] installserver: Failover DHCP server in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1181095 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:40:44] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add dummy keytabs for new install servers T396487 [labs/private] - 10https://gerrit.wikimedia.org/r/1181638 (owner: 10Muehlenhoff) [08:41:33] (03PS4) 10Máté Szabó: hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) [08:41:59] (03CR) 10Máté Szabó: hcaptcha: Only pass hmt_id cookie to upstream (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [08:43:03] (03CR) 10Muehlenhoff: [C:03+2] Failover webproxy in ulsfo to new node [dns] - 10https://gerrit.wikimedia.org/r/1181096 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [08:43:08] !log jmm@dns1004 START - running authdns-update [08:43:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181131 (owner: 10Vgutierrez) [08:43:42] (03CR) 10CI reject: [V:04-1] hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [08:44:15] !log jmm@dns1004 END - running authdns-update [08:45:25] (03PS2) 10Vgutierrez: haproxy: Provide basic X-Analytics data for blocked requests [puppet] - 10https://gerrit.wikimedia.org/r/1181131 [08:46:46] (03PS5) 10Máté Szabó: hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) [08:46:59] (03CR) 10Cyndywikime: "Thanks Martin, you are right!Created the task here : https://phabricator.wikimedia.org/T402769 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [08:47:53] (03PS3) 10Cyndywikime: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) [08:53:52] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11114056 (10phaultfinder) [08:54:14] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: clarify libvirtd debug levels [puppet] - 10https://gerrit.wikimedia.org/r/1180535 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [08:54:19] (03CR) 10Filippo Giunchedi: [C:03+2] openstack: enable cfssl certs for libvirt in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1180556 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [08:56:13] (03CR) 10Slyngshede: [C:03+1] "Looks good. Syntax tested in local test environment." [puppet] - 10https://gerrit.wikimedia.org/r/1181131 (owner: 10Vgutierrez) [08:56:34] PROBLEM - nova-compute proc minimum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:46] PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:54] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:56:58] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:57:42] PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:57:58] PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [08:58:54] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11114087 (10phaultfinder) [08:58:58] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:16] that's me [09:00:18] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:34] PROBLEM - nova-compute proc maximum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:42] PROBLEM - nova-compute proc maximum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:00:58] PROBLEM - nova-compute proc maximum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:22] PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:52] PROBLEM - nova-compute proc maximum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:01:54] PROBLEM - nova-compute proc maximum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:03:52] PROBLEM - nova-compute proc maximum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:04:42] (03CR) 10Vgutierrez: [C:03+2] haproxy: Provide basic X-Analytics data for blocked requests [puppet] - 10https://gerrit.wikimedia.org/r/1181131 (owner: 10Vgutierrez) [09:04:42] RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:04:54] RECOVERY - nova-compute proc maximum on cloudvirt1059 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:05:42] PROBLEM - nova-compute proc maximum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:07:17] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi) [09:08:54] PROBLEM - nova-compute proc minimum on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:18] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:18] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:19] PROBLEM - nova-compute proc minimum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:22] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:22] PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:23] RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:24] PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:25] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:26] PROBLEM - nova-compute proc minimum on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:27] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:28] PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:29] PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:30] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:31] PROBLEM - nova-compute proc minimum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:32] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:33] PROBLEM - nova-compute proc minimum on cloudvirt1074 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:34] PROBLEM - nova-compute proc minimum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:35] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:36] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:37] RECOVERY - nova-compute proc maximum on cloudvirt1070 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:38] PROBLEM - nova-compute proc minimum on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:39] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:40] RECOVERY - nova-compute proc minimum on cloudvirt1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:42] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:42] PROBLEM - nova-compute proc minimum on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:43] RECOVERY - nova-compute proc maximum on cloudvirt1071 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:44] RECOVERY - nova-compute proc maximum on cloudvirt1051 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:45] PROBLEM - nova-compute proc minimum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:47] RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:54] RECOVERY - nova-compute proc maximum on cloudvirt1056 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:54] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:58] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:58] RECOVERY - nova-compute proc maximum on cloudvirt1069 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:09:59] PROBLEM - nova-compute proc minimum on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:10:02] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:10:23] should be recovering fully soon [09:10:30] RECOVERY - nova-compute proc minimum on cloudvirt1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:10:59] (03PS1) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 [09:11:22] RECOVERY - nova-compute proc minimum on cloudvirt1076 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:25] (03CR) 10CI reject: [V:04-1] hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [09:11:26] PROBLEM - nova-compute proc minimum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:27] RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:28] RECOVERY - nova-compute proc minimum on cloudvirt1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:28] RECOVERY - nova-compute proc minimum on cloudvirt1074 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:34] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:35] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:35] RECOVERY - nova-compute proc minimum on cloudvirt1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:42] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:54] RECOVERY - nova-compute proc minimum on cloudvirt1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:58] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:11:58] RECOVERY - nova-compute proc minimum on cloudvirt1072 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:02] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:18] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:18] RECOVERY - nova-compute proc minimum on cloudvirt1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:22] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:22] RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:23] RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:24] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:25] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:26] RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:28] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:28] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:34] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:42] RECOVERY - nova-compute proc minimum on cloudvirt1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:42] RECOVERY - nova-compute proc minimum on cloudvirt1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:12:53] (03PS2) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 [09:13:51] (03PS6) 10Máté Szabó: hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) [09:14:46] PROBLEM - nova-compute proc maximum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:14:54] PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:16:36] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [09:17:01] (03PS3) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 [09:17:12] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [09:18:00] RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:26] RECOVERY - nova-compute proc minimum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:46] RECOVERY - nova-compute proc maximum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:54] RECOVERY - nova-compute proc maximum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:18:55] RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [09:19:50] (03PS1) 10Muehlenhoff: Point webproxy in eqsin to install5003 [dns] - 10https://gerrit.wikimedia.org/r/1181645 (https://phabricator.wikimedia.org/T396487) [09:19:52] (03PS1) 10Muehlenhoff: Apply installserver role to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181646 (https://phabricator.wikimedia.org/T396487) [09:19:53] (03PS4) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 [09:19:54] (03PS1) 10Muehlenhoff: Failover DHCP server in eqsin to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181647 (https://phabricator.wikimedia.org/T396487) [09:20:24] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [09:22:38] (03PS1) 10Vgutierrez: haproxy: Fix x-analytics for requestctl blocked requests [puppet] - 10https://gerrit.wikimedia.org/r/1181648 [09:25:23] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1181648 (owner: 10Vgutierrez) [09:25:31] (03CR) 10Vgutierrez: [C:03+2] haproxy: Fix x-analytics for requestctl blocked requests [puppet] - 10https://gerrit.wikimedia.org/r/1181648 (owner: 10Vgutierrez) [09:29:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:32:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:32:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T399249)', diff saved to https://phabricator.wikimedia.org/P81734 and previous config saved to /var/cache/conftool/dbconfig/20250825-093241-fceratto.json [09:32:46] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [09:34:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:35:45] (03PS1) 10Muehlenhoff: Update install server in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1181651 [09:36:19] (03PS1) 10Muehlenhoff: Update install server in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1181652 [09:37:19] (03CR) 10Ayounsi: [C:03+1] Failover DHCP server in eqsin to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181647 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [09:37:39] (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [09:37:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:49] (03CR) 10Ayounsi: [C:03+1] Update install server in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1181651 (owner: 10Muehlenhoff) [09:38:04] (03CR) 10Ayounsi: [C:03+1] Update install server in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1181652 (owner: 10Muehlenhoff) [09:41:41] (03CR) 10Muehlenhoff: [C:03+2] Update install server in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1181651 (owner: 10Muehlenhoff) [09:41:48] (03PS1) 10Brouberol: provision the mysql analytics research password in the analytics-ml HDFS home [puppet] - 10https://gerrit.wikimedia.org/r/1181653 (https://phabricator.wikimedia.org/T398950) [09:41:57] (03PS13) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1181124 (https://phabricator.wikimedia.org/T402611) [09:41:57] (03CR) 10Arnaudb: "this change adds mod_qos to gerrit's httpd reverse proxy configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1181124 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb) [09:43:33] (03PS1) 10Zabe: Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) [09:44:00] (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [09:44:17] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181653 (https://phabricator.wikimedia.org/T398950) (owner: 10Brouberol) [09:46:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [09:46:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [09:47:26] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [09:48:38] 10ops-eqiad, 06SRE, 06DC-Ops: Supermicro incorrectly exposing LinkStatus in Redfish - https://phabricator.wikimedia.org/T400034#11114185 (10ayounsi) Thanks, fyi with that firmware I replied the following: > The LinkStatus is now missing : > > {'@odata.etag': '"12722e886e91fe533b80e55b5bbd72ee"', >... [09:48:49] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [09:48:53] jouncebot: nowandnext [09:48:53] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [09:48:53] In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1000) [09:49:08] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779 (10mszwarc) 03NEW [09:49:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [09:49:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [09:50:08] jouncebot: nowandnext [09:50:08] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [09:50:08] In 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1000) [09:50:14] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [09:50:16] zabe: trying to deploy sth atm [09:50:24] alright [09:50:29] (03CR) 10Urbanecm: [C:04-2] "will conflict with my deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [09:50:29] will wait [09:50:31] ty [09:50:33] (03PS1) 10Brouberol: airflow-ml: fix typo in storage quantity and storage class name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181656 (https://phabricator.wikimedia.org/T396495) [09:50:36] (03CR) 10Urbanecm: Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [09:52:00] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11114199 (10OKryva-WMF) Approve as a Marcin's EM. [09:52:02] !log urbanecm@deploy1003 Started scap sync-world: Deploying a security patch (T402698, T402600) [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:55] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install4002.wikimedia.org [09:54:08] !log urbanecm@deploy1003 Finished scap sync-world: Deploying a security patch (T402698, T402600) (duration: 02m 06s) [09:54:18] verifying [09:55:16] (03PS1) 10Hnowlan: thumbor: use native subprocess timeouts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1181657 (https://phabricator.wikimedia.org/T379569) [09:55:23] (03CR) 10CI reject: [V:04-1] thumbor: use native subprocess timeouts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1181657 (https://phabricator.wikimedia.org/T379569) (owner: 10Hnowlan) [09:55:59] (03CR) 10Brouberol: [C:03+2] airflow-ml: fix typo in storage quantity and storage class name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181656 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol) [09:57:12] hmm...i am doing something wrong [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:43] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:58:13] (03PS1) 10Brouberol: airflow-ml: drop -pvc suffix in the PVC name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181658 (https://phabricator.wikimedia.org/T396495) [09:58:21] (03PS3) 10Tiziano Fogli: nrpe2nodexp: remove file under node.d for disabled checks [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446) [09:58:21] (03CR) 10Tiziano Fogli: "It ensures that the cache of any disabled check is removed from /var/lib/prometheus/node.d. This prevents the NodeTextfileStale alert from" [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [09:58:25] (03CR) 10Vgutierrez: [C:03+2] benthos: Verify TLS cert of kafka brokers on webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/1181101 (https://phabricator.wikimedia.org/T291905) (owner: 10Vgutierrez) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1000) [10:00:08] (03CR) 10Ozge: [C:03+1] provision the mysql analytics research password in the analytics-ml HDFS home [puppet] - 10https://gerrit.wikimedia.org/r/1181653 (https://phabricator.wikimedia.org/T398950) (owner: 10Brouberol) [10:00:21] (03CR) 10Brouberol: [C:03+2] provision the mysql analytics research password in the analytics-ml HDFS home [puppet] - 10https://gerrit.wikimedia.org/r/1181653 (https://phabricator.wikimedia.org/T398950) (owner: 10Brouberol) [10:00:45] !log urbanecm@deploy1003 Started scap sync-world: Deploying a security patch (T402698, T402600) [10:00:46] one more time [10:01:38] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:02:01] (03PS4) 10Tiziano Fogli: nrpe2nodexp: remove file under node.d for disabled checks [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446) [10:02:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [10:02:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:02:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install4002.wikimedia.org [10:02:27] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11114237 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install4002.wikimedia.org` - install4002.wikimedia.org (**PASS**) - Do... [10:05:20] (03CR) 10Gkyziridis: [C:03+1] "Thank you" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181658 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol) [10:05:39] (03PS2) 10Volans: ServiceOps: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 [10:05:45] (03CR) 10Brouberol: [C:03+2] airflow-ml: drop -pvc suffix in the PVC name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181658 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol) [10:06:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [10:06:54] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [10:07:03] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [10:07:09] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [10:08:12] (03PS1) 10Brouberol: airflow-ml: fix typo in PVC class name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181660 [10:08:57] (03CR) 10Hnowlan: [C:03+2] imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [10:09:07] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:09:34] (03PS2) 10Hnowlan: thumbor: use native subprocess timeouts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1181657 (https://phabricator.wikimedia.org/T379569) [10:09:59] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:15:51] !log urbanecm@deploy1003 Finished scap sync-world: Deploying a security patch (T402698, T402600) (duration: 15m 06s) [10:15:54] (03CR) 10Brouberol: [C:03+2] airflow-ml: fix typo in PVC class name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181660 (owner: 10Brouberol) [10:17:16] (03CR) 10Clément Goubert: [C:03+1] ServiceOps: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 (owner: 10Volans) [10:20:07] (03Merged) 10jenkins-bot: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber) [10:21:17] (03CR) 10CI reject: [V:04-1] thumbor: use native subprocess timeouts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1181657 (https://phabricator.wikimedia.org/T379569) (owner: 10Hnowlan) [10:22:46] !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [10:23:36] (03CR) 10Filippo Giunchedi: [C:03+1] nrpe2nodexp: remove file under node.d for disabled checks [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [10:24:50] !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [10:27:51] (03CR) 10Tiziano Fogli: [C:03+2] nrpe2nodexp: remove file under node.d for disabled checks [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [10:30:47] !log installing postgresql-13 security updates [10:30:49] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11114309 (10MoritzMuehlenhoff) [10:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:34] jouncebot: nowandnext [10:35:34] For the next 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1000) [10:35:34] In 2 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1300) [10:36:01] (03PS1) 10Kosta Harlan: hcaptcha: Instrument siteverify API call [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181664 (https://phabricator.wikimedia.org/T402492) [10:36:15] (03PS1) 10Kosta Harlan: hCaptcha: Log errors to Logstash [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181665 (https://phabricator.wikimedia.org/T402767) [10:36:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181664 (https://phabricator.wikimedia.org/T402492) (owner: 10Kosta Harlan) [10:36:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181665 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan) [10:42:18] (03PS1) 10Hnowlan: thumbor: new version with permissive pyexiv error handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181666 (https://phabricator.wikimedia.org/T381594) [10:44:47] (03CR) 10Clément Goubert: [C:03+1] thumbor: new version with permissive pyexiv error handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181666 (https://phabricator.wikimedia.org/T381594) (owner: 10Hnowlan) [10:45:29] (03CR) 10Hnowlan: [C:03+2] thumbor: new version with permissive pyexiv error handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181666 (https://phabricator.wikimedia.org/T381594) (owner: 10Hnowlan) [10:46:45] FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:47:27] (03Merged) 10jenkins-bot: thumbor: new version with permissive pyexiv error handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181666 (https://phabricator.wikimedia.org/T381594) (owner: 10Hnowlan) [10:47:48] !log installing openjdk-17 security updates [10:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:05] ^ the alert for ulsfo is expected and will recover shortly [10:49:10] (03Merged) 10jenkins-bot: hcaptcha: Instrument siteverify API call [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181664 (https://phabricator.wikimedia.org/T402492) (owner: 10Kosta Harlan) [10:49:45] (03PS1) 10Hnowlan: thumbor: remove staging version pin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181667 [10:50:01] (03Merged) 10jenkins-bot: hCaptcha: Log errors to Logstash [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181665 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan) [10:50:21] !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1181664|hcaptcha: Instrument siteverify API call (T402492)]], [[gerrit:1181665|hCaptcha: Log errors to Logstash (T402767)]] [10:50:27] T402492: hCaptcha: Instrument call to /siteverify - https://phabricator.wikimedia.org/T402492 [10:50:27] T402767: hCaptcha: Log hCaptcha error codes to Logstash - https://phabricator.wikimedia.org/T402767 [10:50:51] (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181646 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [10:52:20] (03CR) 10Hnowlan: [C:03+2] thumbor: remove staging version pin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181667 (owner: 10Hnowlan) [10:53:56] (03Merged) 10jenkins-bot: thumbor: remove staging version pin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181667 (owner: 10Hnowlan) [10:54:47] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply [10:54:54] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply [10:56:32] !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1181664|hcaptcha: Instrument siteverify API call (T402492)]], [[gerrit:1181665|hCaptcha: Log errors to Logstash (T402767)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [10:56:41] T402492: hCaptcha: Instrument call to /siteverify - https://phabricator.wikimedia.org/T402492 [10:56:41] T402767: hCaptcha: Log hCaptcha error codes to Logstash - https://phabricator.wikimedia.org/T402767 [10:57:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:58:07] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply [10:58:16] !log kharlan@deploy1003 kharlan: Continuing with sync [11:01:57] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:03:22] (03PS1) 10Urbanecm: [Growth] wikidata: Preconfigure for limited Growth features release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) [11:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:48] !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181664|hcaptcha: Instrument siteverify API call (T402492)]], [[gerrit:1181665|hCaptcha: Log errors to Logstash (T402767)]] (duration: 14m 26s) [11:04:54] T402492: hCaptcha: Instrument call to /siteverify - https://phabricator.wikimedia.org/T402492 [11:04:54] T402767: hCaptcha: Log hCaptcha error codes to Logstash - https://phabricator.wikimedia.org/T402767 [11:09:23] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply [11:11:05] PROBLEM - HTTP on install5003 is CRITICAL: connect to address 103.102.166.11 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers [11:12:16] ^ install5003 is WIP, will resolve soon [11:12:25] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11114408 (10ABran-WMF) [11:12:51] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [11:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [11:15:05] RECOVERY - HTTP on install5003 is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Install_servers [11:15:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [11:16:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:21:03] (03CR) 10Muehlenhoff: [C:03+2] Update install server in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1181652 (owner: 10Muehlenhoff) [11:23:11] (03CR) 10Muehlenhoff: [C:03+2] Point webproxy in eqsin to install5003 [dns] - 10https://gerrit.wikimedia.org/r/1181645 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [11:23:16] !log jmm@dns1004 START - running authdns-update [11:24:24] !log jmm@dns1004 END - running authdns-update [11:24:57] (03PS1) 10Clément Goubert: mw_experimental: Fix PuppetConstantChange alert [puppet] - 10https://gerrit.wikimedia.org/r/1181673 (https://phabricator.wikimedia.org/T396767) [11:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:33:38] FIRING: [2x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:36:18] (03PS3) 10FNegri: maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) (owner: 10Zabe) [11:37:03] (03PS1) 10Máté Szabó: hcaptcha: Add proxied CSP reporting endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1181675 [11:38:07] (03CR) 10Muehlenhoff: [C:03+2] Failover DHCP server in eqsin to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181647 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [11:41:18] (03PS2) 10Máté Szabó: hcaptcha: Add proxied CSP reporting endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1181675 [11:41:36] (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [11:45:58] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181673 (https://phabricator.wikimedia.org/T396767) (owner: 10Clément Goubert) [11:56:01] (03PS1) 10Filippo Giunchedi: openstack: switch libvirt live migration uri to cloud-private hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145) [11:57:05] (03CR) 10Filippo Giunchedi: "Tested in codfw1dev for correctness, live migration happens over the expected hostnames, e.g." [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [11:58:39] RESOLVED: [2x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:01:53] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11114554 (10Ladsgroup) [12:03:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.098s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:04:45] FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:07:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:08:47] (03CR) 10Muehlenhoff: [C:03+2] Blacklist orangefs [puppet] - 10https://gerrit.wikimedia.org/r/1181110 (owner: 10Muehlenhoff) [12:09:31] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:15:05] (03PS1) 10Slyngshede: P:cache::varnish::frontend user-agent rate limit cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1181679 (https://phabricator.wikimedia.org/T400119) [12:18:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.615s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:23:07] !log Restarted CI Jenkins to update some plugins [12:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.068s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:28:46] (03CR) 10Ayounsi: [C:03+2] Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [12:33:00] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181687 [12:34:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:36:05] (03CR) 10Michael Große: [C:03+1] [Growth] wikidata: Preconfigure for limited Growth features release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm) [12:36:39] (03PS1) 10STran: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) [12:38:59] (03PS1) 10Bartosz Dziewoński: PHPSessionHandler: In warn mode, report the changed keys [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181695 (https://phabricator.wikimedia.org/T400668) [12:40:39] (03CR) 10STran: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran) [12:43:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.449s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:43:42] (03PS1) 10Ayounsi: Routed ganeti: fix nftables typoes [puppet] - 10https://gerrit.wikimedia.org/r/1181696 (https://phabricator.wikimedia.org/T402372) [12:44:07] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181696 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [12:44:27] (03PS1) 10Bartosz Dziewoński: Set wgPHPSessionHandling to 'warn' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324) [12:44:44] (03PS2) 10Bartosz Dziewoński: Set wgPHPSessionHandling to 'warn' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324) [12:45:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181695 (https://phabricator.wikimedia.org/T400668) (owner: 10Bartosz Dziewoński) [12:45:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [12:51:02] (03CR) 10Ayounsi: [C:03+2] Routed ganeti: fix nftables typoes [puppet] - 10https://gerrit.wikimedia.org/r/1181696 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi) [12:55:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.392s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:55:27] (03CR) 10Slyngshede: [C:03+1] "Looks correct in local tests." [puppet] - 10https://gerrit.wikimedia.org/r/1181133 (owner: 10Vgutierrez) [12:55:39] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11114728 (10Jhancock.wm) I'll be in today to do these. was OoO last week. [12:58:53] (03PS3) 10Bartosz Dziewoński: Set wgPHPSessionHandling to 'warn' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324) [12:58:59] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11114735 (10phaultfinder) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1300). [13:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.29s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:00:41] hi [13:01:27] (03CR) 10Vgutierrez: [C:03+2] haproxy: Stop sending X-Analytics-TLS to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1181133 (owner: 10Vgutierrez) [13:02:25] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11114753 (10Jclark-ctr) Connected Mgmt to m sw in rack E11 and console to msw2-eqiad [13:03:53] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11114754 (10phaultfinder) [13:04:01] o/ [13:05:12] I can deploy [13:05:44] MatmaRex: should the two changes be deployed together? [13:06:03] Lucas_WMDE: yes please [13:06:05] thanks :) [13:06:08] alright [13:06:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181695 (https://phabricator.wikimedia.org/T400668) (owner: 10Bartosz Dziewoński) [13:06:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [13:07:10] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11114777 (10Andrew) [13:07:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.777s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:07:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:39] (03Merged) 10jenkins-bot: Set wgPHPSessionHandling to 'warn' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński) [13:09:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:11:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [13:11:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet [13:11:21] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11114786 (10cmooney) >>! In T378828#11111862, @Andrew wrote: > This is getting very close! I still see ping failures with cloudcephosd1045, probably because the second network connection isn'... [13:11:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [13:11:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet [13:12:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.777s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:12:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet [13:13:19] (03Merged) 10jenkins-bot: PHPSessionHandler: In warn mode, report the changed keys [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181695 (https://phabricator.wikimedia.org/T400668) (owner: 10Bartosz Dziewoński) [13:13:36] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1181695|PHPSessionHandler: In warn mode, report the changed keys (T400668)]], [[gerrit:1181697|Set wgPHPSessionHandling to 'warn' again (T362324)]] [13:13:42] T400668: Debug warnings that were recorded with $wgPHPSessionHandling = 'warn' in WMF production - https://phabricator.wikimedia.org/T400668 [13:13:42] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [13:14:25] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:16:54] (03CR) 10Máté Szabó: "Boldly tagging Hugh for review per last week, please untag or delegate if the assignment is no longer appropriate!" [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [13:17:02] (03CR) 10Máté Szabó: "Boldly tagging Hugh for review per last week, please untag or delegate if the assignment is no longer appropriate!" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [13:17:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:17:14] (03CR) 10Máté Szabó: "Boldly tagging Hugh for review per last week, please untag or delegate if the assignment is no longer appropriate!" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [13:18:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2034.codfw.wmnet [13:20:04] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1181695|PHPSessionHandler: In warn mode, report the changed keys (T400668)]], [[gerrit:1181697|Set wgPHPSessionHandling to 'warn' again (T362324)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:20:10] T400668: Debug warnings that were recorded with $wgPHPSessionHandling = 'warn' in WMF production - https://phabricator.wikimedia.org/T400668 [13:20:10] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [13:21:12] Lucas_WMDE: seems good [13:21:30] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Continuing with sync [13:21:30] yay [13:22:09] * Lucas_WMDE sees some INFO but no WARNING in mwdebug lostsah [13:22:12] *logstash [13:22:47] i'm not sure how to trigger the logging, i just checked that i can log in [13:23:05] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [13:23:20] ok [13:23:20] one case i know of involved being IP blocked and visiting a wiki where you don't have a local account, which is a bit complex [13:23:39] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:58] (03CR) 10Kosta Harlan: "Seems reasonable to me." [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [13:24:05] looks like we have one real log entry already [13:24:21] https://logstash.wikimedia.org/goto/ba89d8cfaaadca0b6af9953af7b408e4 [13:24:33] (which is exactly the case i said, heh) [13:25:12] hehe neat [13:25:37] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate using BGP addpath for unicast IBGP spine/leaf pods - https://phabricator.wikimedia.org/T402640#11114837 (10cmooney) [13:26:40] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11114838 (10TheDJ) There's reports that this breaks command line download of mediawiki tarballs via https://releases.wikimedia.org/mediawiki/1.44/ That se... [13:26:50] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181695|PHPSessionHandler: In warn mode, report the changed keys (T400668)]], [[gerrit:1181697|Set wgPHPSessionHandling to 'warn' again (T362324)]] (duration: 13m 14s) [13:26:56] T400668: Debug warnings that were recorded with $wgPHPSessionHandling = 'warn' in WMF production - https://phabricator.wikimedia.org/T400668 [13:26:56] T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324 [13:27:48] !log UTC afternoon backport+config window done [13:27:51] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:11] thanks Lucas_WMDE [13:31:35] np :) [13:32:33] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) (owner: 10Andrew Bogott) [13:32:55] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:34:16] grafana is down [13:35:51] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:36:51] FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs-main.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:36:56] ah [13:37:07] but well no, that's unrelated [13:37:10] I'm around [13:37:11] !incidents [13:37:12] 6699 (UNACKED) ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet esams) [13:37:12] 6646 (RESOLVED) db1238 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:12] 6643 (RESOLVED) db1221 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:12] 6642 (RESOLVED) db1243 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:12] 6644 (RESOLVED) db1199 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:13] 6651 (RESOLVED) db1242 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:13] 6652 (RESOLVED) db2240 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:13] 6650 (RESOLVED) db1190 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:14] 6649 (RESOLVED) db1249 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:14] 6648 (RESOLVED) db1247 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:15] 6647 (RESOLVED) db1241 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:15] 6645 (RESOLVED) db1248 (paged)/MariaDB Replica Lag: s4 (paged) [13:37:16] 6641 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [13:37:16] 6640 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [13:37:17] !ack 6699 [13:37:38] Although if grafana is down that's gonna be hard to check -_- [13:37:39] around as well [13:37:55] grafana is slow but responds here [13:38:00] claime: it's slow but back [13:38:07] I take it back [13:38:14] tappof: grafana is still down [13:38:51] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:39:29] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11114905 (10Jhancock.wm) @Ladsgroup i have two proposals for es2039. 1) we leave it where it is and use port 43 on the switch. It'll be using a port that wo... [13:39:33] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:41:52] I've been able to hit https://grafana-next-rw.wikimedia.org if needed [13:42:29] (03PS1) 10Muehlenhoff: Make wmf-update-known-hosts-production compatible with enforcement of robot policy [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181709 [13:43:55] (03CR) 10Kosta Harlan: "Looks like this was implemented in I3aff0a5be5a87fe01ee4f365b920d2c98e6e7cee" [puppet] - 10https://gerrit.wikimedia.org/r/1175876 (owner: 10Hnowlan) [13:44:23] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:44:26] sukhe: is back [13:44:38] thanks :) what was the secret sauce? [13:44:45] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:44:50] asking in case it happens again, did you restart the service or something? [13:45:52] sukhe: I’ve just restarted the Apache service. I’ll look into the reason. [13:46:27] thanks <3 [13:48:47] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:48:47] MatmaRex: fyi I can see the warnings in logspam-watch, it’s now the top warning (above https://phabricator.wikimedia.org/T304960) but at a manageable volume (~350 hits in the last ~half hour) [13:48:55] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [13:49:11] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11114961 (10Bugreporter) >>! In T400119#11114838, @TheDJ wrote: > There's reports that this breaks command line download of mediawiki tarballs via https://... [13:49:18] Looks like the 500s are wdqs timeouting on mwapi request calls [13:49:45] (03CR) 10Volans: [C:03+1] "LGTM" [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181709 (owner: 10Muehlenhoff) [13:50:23] https://logstash.wikimedia.org/goto/7f6edb2400b275c9fea9e077ef4e5221 the uri_path seems to match [13:50:40] (03CR) 10Volans: [C:03+1] "LGTM" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [13:51:49] (03CR) 10Volans: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1181117 (owner: 10Ayounsi) [13:51:51] RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs-main.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [13:53:39] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:54:26] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Make wmf-update-known-hosts-production compatible with enforcement of robot policy [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181709 (owner: 10Muehlenhoff) [13:55:43] (03PS1) 10Muehlenhoff: wmf-laptop: Update changefog for 1.0.3 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181716 [13:57:17] (03CR) 10Stevemunene: [C:03+1] mediawiki-dumps-legacy: adapt RBAC to a recent apache-airflow-providers-cncf-kubernetes upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181519 (https://phabricator.wikimedia.org/T402529) (owner: 10Brouberol) [13:57:26] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] wmf-laptop: Update changefog for 1.0.3 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181716 (owner: 10Muehlenhoff) [13:59:27] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11115023 (10Andrew) [13:59:48] (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: adapt RBAC to a recent apache-airflow-providers-cncf-kubernetes upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181519 (https://phabricator.wikimedia.org/T402529) (owner: 10Brouberol) [14:00:11] (03CR) 10Brouberol: [C:03+1] Remove mention of an-druid100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1180529 (https://phabricator.wikimedia.org/T401116) (owner: 10Stevemunene) [14:03:53] Lucas_WMDE: yes, that was expected, i have patches in progress that will resolve it soon [14:04:41] but i wanted to see if there are any lower-frequency warnings before we resolve them all [14:04:49] 👍 [14:04:53] and wanted to be able to compare the log voluime more easily [14:05:16] since the logs from the last time have almost rotated out of logstash already [14:06:10] jouncebot: nowandnext [14:06:10] No deployments scheduled for the next 0 hour(s) and 23 minute(s) [14:06:10] In 0 hour(s) and 23 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1430) [14:08:09] (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [14:08:56] (03Merged) 10jenkins-bot: Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe) [14:08:59] 06SRE, 06Infrastructure-Foundations: Move failoid to trixie - https://phabricator.wikimedia.org/T402406#11115040 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [14:09:25] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1181655|Set categorylinks to read new on commonswiki (T397912)]] [14:09:30] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [14:10:49] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115049 (10TheDJ) Yeah getting the swagger spec via `curl https://api.wikimedia.org/core/v1/wikipedia/en/search/page?q=earth&limit=10` also no longer work... [14:12:50] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6710/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181134 (owner: 10Vgutierrez) [14:15:00] !log zabe@deploy1003 zabe: Backport for [[gerrit:1181655|Set categorylinks to read new on commonswiki (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:15:05] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [14:15:23] (03CR) 10Scott French: [C:03+1] mw_experimental: Fix PuppetConstantChange alert [puppet] - 10https://gerrit.wikimedia.org/r/1181673 (https://phabricator.wikimedia.org/T396767) (owner: 10Clément Goubert) [14:15:34] (03CR) 10Clément Goubert: [C:03+2] mw_experimental: Fix PuppetConstantChange alert [puppet] - 10https://gerrit.wikimedia.org/r/1181673 (https://phabricator.wikimedia.org/T396767) (owner: 10Clément Goubert) [14:16:08] !log zabe@deploy1003 zabe: Continuing with sync [14:17:42] (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [14:17:53] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6711/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181134 (owner: 10Vgutierrez) [14:20:50] (03CR) 10Andrew Bogott: [C:03+1] "This looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [14:21:22] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181655|Set categorylinks to read new on commonswiki (T397912)]] (duration: 11m 56s) [14:21:27] T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912 [14:21:55] (03PS1) 10Slyngshede: PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) [14:22:20] (03CR) 10CI reject: [V:04-1] PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [14:22:33] PROBLEM - Host ms-be2081 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:58] (03PS2) 10Slyngshede: PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) [14:23:34] (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181720 (https://phabricator.wikimedia.org/T399579) [14:25:19] (03CR) 10Vgutierrez: "given you only need to pass one cookie I think you could skip the map entirely and just use `proxy_set_header Cookie $http_cookie_hmt_id;`" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [14:26:01] RECOVERY - Host ms-be2081 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms [14:28:13] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11115141 (10ayounsi) Changing its rack would also allow us to change its IP to per rack vlans: https://wikitech.wikimedia.org/wiki/Vlan_migration [14:28:31] (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [14:28:58] (03CR) 10Ayounsi: [C:03+2] Add CI for python [homer/public] - 10https://gerrit.wikimedia.org/r/1181117 (owner: 10Ayounsi) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1430) [14:30:33] (03Merged) 10jenkins-bot: Add CI for python [homer/public] - 10https://gerrit.wikimedia.org/r/1181117 (owner: 10Ayounsi) [14:31:29] (03CR) 10Ozge: [C:03+1] "hello, thanks for working on this. We are looking forward to get approval for this patch in SRE IF meeting today. Please feel free to ask " [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:34:56] (03PS1) 10Clément Goubert: Revert^2 "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181722 [14:35:44] (03PS2) 10Clément Goubert: Revert^2 "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181722 (https://phabricator.wikimedia.org/T395893) [14:36:10] (03CR) 10Vgutierrez: PCC: Add user-agent to PCC util (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [14:37:43] (03CR) 10Clément Goubert: [C:03+2] Revert^2 "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181722 (https://phabricator.wikimedia.org/T395893) (owner: 10Clément Goubert) [14:38:37] (03CR) 10Muehlenhoff: [C:03+1] "Approved in the weekly SRE IF meeting" [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol) [14:38:58] (03PS3) 10Slyngshede: PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) [14:39:08] (03PS1) 10Vgutierrez: haproxy: Skip curl/wget from ua_policy:library_default [puppet] - 10https://gerrit.wikimedia.org/r/1181723 (https://phabricator.wikimedia.org/T400119) [14:39:51] (03Merged) 10jenkins-bot: Revert^2 "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181722 (https://phabricator.wikimedia.org/T395893) (owner: 10Clément Goubert) [14:39:56] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11115223 (10Andrew) [14:40:05] (03CR) 10Slyngshede: PCC: Add user-agent to PCC util (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [14:40:18] (03PS7) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595) [14:41:00] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1181723 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:41:17] (03CR) 10Máté Szabó: "Thanks! Unfortunately it doesn't seem like it'd make things simpler because we'd need to set `hmt_id=$cookie_hmt_id` conditionally if `$co" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [14:41:34] (03CR) 10Ssingh: [C:03+1] haproxy: Skip curl/wget from ua_policy:library_default [puppet] - 10https://gerrit.wikimedia.org/r/1181723 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:41:53] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11115228 (10RobH) I don't see any sensor firing over '60' when it isn't quite clear what sensor they mean via this alert? [14:42:38] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11115236 (10RobH) 05Open→03Resolved Other than perhaps the line frequency which now shows Line Frequency: 60.0 Hz but perhaps it feed in at 60.1 at some poinut? It is now flow... [14:45:45] (03CR) 10Vgutierrez: [C:03+2] haproxy: Skip curl/wget from ua_policy:library_default [puppet] - 10https://gerrit.wikimedia.org/r/1181723 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez) [14:48:30] (03CR) 10Vgutierrez: [C:03+1] PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [14:49:05] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115266 (10Bugreporter) curl/wget should still be rate limited with 1/s. [14:51:16] (03CR) 10Vgutierrez: "let's keep curl|wget with their own tag, something like `ua_policy:cli_tool`" [puppet] - 10https://gerrit.wikimedia.org/r/1181679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [14:56:24] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115304 (10Vgutierrez) [14:56:33] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [14:56:46] jouncebot: nowandnext [14:56:46] For the next 0 hour(s) and 3 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1430) [14:56:46] In 0 hour(s) and 33 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1530) [14:57:20] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115308 (10Vgutierrez) [14:57:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:57:59] 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115313 (10Vgutierrez) [14:58:26] (03CR) 10Slyngshede: [C:03+2] PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede) [14:59:40] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [15:00:19] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [15:00:58] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [15:03:14] (03CR) 10Vgutierrez: "so proxy_set_header should only set the header if its value isn't an empty string, nginx doc says:" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [15:03:32] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for frac pdus - jclark@cumin1002" [15:03:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for frac pdus - jclark@cumin1002" [15:03:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:04:32] Hey folks, I need to run a couple queries in production to fix some logspam due to invalid stored data. I put the queries in T402239#11115333 (at the end of the comment). May I go ahead? [15:04:32] T402239: RuntimeException: Event 1836 should have only one address. - https://phabricator.wikimedia.org/T402239 [15:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:04:58] (03CR) 10Máté Szabó: "But the value needs to be `hmt_id=$cookie_hmt_id` because the variable won't include the cookie name and equals sign, so it'd still need t" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [15:05:56] !log imported wmf-laptop 1.0.3 to apt.wikimedia.org [15:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:09] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11115384 (10Jhancock.wm) a:05Jgreen→03Papaul [15:11:52] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11115390 (10Jhancock.wm) @Papaul these servers are ready for your part. mgmt ips are pingable. [15:12:43] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:13:04] 10ops-magru, 06DC-Ops, 06Traffic: planned power redundancy depreciation 2025-09-20 @ 18:00 GMT to 2025-09-21 @ 21:00 GMT - https://phabricator.wikimedia.org/T402818 (10RobH) 03NEW p:05Triage→03Medium [15:13:26] (03CR) 10Stevemunene: [C:03+2] Remove mention of an-druid100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1180529 (https://phabricator.wikimedia.org/T401116) (owner: 10Stevemunene) [15:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [15:15:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:16:37] PROBLEM - Druid broker on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:16:45] PROBLEM - Druid historical on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:16:57] PROBLEM - Druid coordinator on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:16:57] PROBLEM - Druid overlord on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:17:03] PROBLEM - Druid middlemanager on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:17:38] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating new frack mgmt ips - jhancock@cumin1003" [15:17:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating new frack mgmt ips - jhancock@cumin1003" [15:17:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:17:48] !log stevemunene@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-druid1001.eqiad.wmnet [15:24:36] !log stevemunene@cumin1003 START - Cookbook sre.dns.netbox [15:26:16] Retrying since my message above got lost amongst the bot stuff. I need to run 3 queries in production to fix some logspam: T402239#11115333. I would like to go ahead shortly unless instructed otherwise. [15:26:16] T402239: RuntimeException: Event 1836 should have only one address. - https://phabricator.wikimedia.org/T402239 [15:28:39] FIRING: [4x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:28:39] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:06] !log stevemunene@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-druid1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1003" [15:29:16] (03PS1) 10Urbanecm: [beta] Growth: Enable on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937) [15:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:29:37] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-druid1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1003" [15:29:37] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:38] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-druid1001.eqiad.wmnet [15:30:05] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1530). nyaa~ [15:31:18] (03CR) 10Urbanecm: [Growth] wikidata: Preconfigure for limited Growth features release (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm) [15:31:41] (03PS2) 10Urbanecm: [Growth] wikidata: Preconfigure for limited Growth features release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) [15:32:18] Daimona: I think you should be fine, how long do you expect the queries to take (for information) [15:32:33] A split second ;) [15:32:40] Fire away then [15:33:34] !log Running queries from T402239#11115333 in x1.wikishared to fix broken event addresses [15:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:39] RESOLVED: [4x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:33:39] T402239: RuntimeException: Event 1836 should have only one address. - https://phabricator.wikimedia.org/T402239 [15:34:25] Done, thank you :) [15:38:28] (03PS2) 10Urbanecm: [beta] Growth: Enable on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937) [15:39:22] (03CR) 10Urbanecm: [Growth] wikidata: Preconfigure for limited Growth features release (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm) [15:40:32] PROBLEM - Druid historical on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:40:38] PROBLEM - Druid middlemanager on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:40:46] PROBLEM - Druid overlord on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:41:04] PROBLEM - Druid coordinator on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:41:14] (03Abandoned) 10Hnowlan: profile::hcaptcha: add missing private configs to subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1175876 (owner: 10Hnowlan) [15:41:24] PROBLEM - Druid broker on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:41:26] !log stevemunene@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-druid1002.eqiad.wmnet [15:44:27] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6713/console" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [15:44:53] (03PS1) 10Muehlenhoff: Assign installserver role to install6003 [puppet] - 10https://gerrit.wikimedia.org/r/1181732 (https://phabricator.wikimedia.org/T396487) [15:44:55] (03PS1) 10Muehlenhoff: Point DHCP server in drmrs to install6003 [puppet] - 10https://gerrit.wikimedia.org/r/1181733 (https://phabricator.wikimedia.org/T396487) [15:45:02] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [15:45:56] (03PS1) 10Muehlenhoff: Update DHCP server in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1181734 (https://phabricator.wikimedia.org/T396487) [15:46:50] (03PS1) 10Muehlenhoff: Point webproxy in drmrs to install6003 [dns] - 10https://gerrit.wikimedia.org/r/1181736 (https://phabricator.wikimedia.org/T396487) [15:47:06] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11115578 (10Jhancock.wm) a:05Papaul→03Jgreen @Jgreen i forgot about the new netbox script. the networking is set up on these for you. Let us know if you need any further assist... [15:49:36] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating new f servers in codfw - jhancock@cumin1003" [15:49:54] (03PS5) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 [15:50:06] !log stevemunene@cumin1003 START - Cookbook sre.dns.netbox [15:51:53] (03CR) 10SBassett: [C:03+1] hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [15:52:41] jhancock@cumin1003 netbox (PID 3662257) is awaiting input [15:52:52] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:52:53] !log stevemunene@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-druid1002.eqiad.wmnet [15:54:01] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6714/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [15:55:05] (03PS40) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [15:58:01] (03CR) 10Hnowlan: [V:03+1 C:03+1] "lgtm- a corresponding DNS change will be needed first, I can set that up." [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [15:59:47] (03CR) 10Vgutierrez: [C:03+1] "oh gotcha! you're totally right :)" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [16:02:07] (03PS1) 10Hnowlan: wikimedia.org: add hcaptcha-sentry CNAME [dns] - 10https://gerrit.wikimedia.org/r/1181739 (https://phabricator.wikimedia.org/T397841) [16:02:21] jouncebot: nowandnext [16:02:21] No deployments scheduled for the next 0 hour(s) and 57 minute(s) [16:02:21] In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1700) [16:02:22] In 0 hour(s) and 57 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1700) [16:03:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm) [16:04:30] (03Merged) 10jenkins-bot: [Growth] wikidata: Preconfigure for limited Growth features release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm) [16:04:45] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1181669|[Growth] wikidata: Preconfigure for limited Growth features release (T400937)]] [16:04:51] T400937: Investigate Feasibility of Enabling Growth Features on Wikidata - https://phabricator.wikimedia.org/T400937 [16:07:07] !log set unused FPC 0 line card to offline mode on cr1-codfw T401937 [16:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:12] T401937: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937 [16:09:05] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402759#11115710 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:09:11] (03CR) 10Hnowlan: [V:03+1] "We will also need a record added to hieradata/common/profile/trafficserver/backend.yaml in this change to remap the domain in the same way" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó) [16:10:38] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1181669|[Growth] wikidata: Preconfigure for limited Growth features release (T400937)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:10:43] T400937: Investigate Feasibility of Enabling Growth Features on Wikidata - https://phabricator.wikimedia.org/T400937 [16:11:15] !log urbanecm@deploy1003 urbanecm: Continuing with sync [16:12:27] (03PS3) 10Urbanecm: [beta] Growth: Enable on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937) [16:13:50] !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cr1-codfw with reason: suppress alerts so we can re-seat one of the PSUs [16:13:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11115746 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=47d79845-d3b9-4b1e-af6c-788acd3f696b) set by cmooney@cumin1003 f... [16:16:34] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181669|[Growth] wikidata: Preconfigure for limited Growth features release (T400937)]] (duration: 11m 49s) [16:16:39] T400937: Investigate Feasibility of Enabling Growth Features on Wikidata - https://phabricator.wikimedia.org/T400937 [16:17:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm) [16:17:22] (03CR) 10BCornwall: [C:03+1] Point webproxy in drmrs to install6003 [dns] - 10https://gerrit.wikimedia.org/r/1181736 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff) [16:17:57] (03Merged) 10jenkins-bot: [beta] Growth: Enable on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm) [16:19:10] (03CR) 10BCornwall: [C:03+1] wikimedia.org: add hcaptcha-sentry CNAME [dns] - 10https://gerrit.wikimedia.org/r/1181739 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [16:19:49] 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11115783 (10RLazarus) >>! In T402584#11113754, @MoritzMuehlenhoff wrote: > We also have 237 baremetal hosts with Envoy, how shall we handle these? We could e.g. add a profile parame... [16:28:32] (03CR) 10Volans: "I've left some suggestions on the potential abstractions to be more DRY inline. None of the suggestions is a blocker and feel free to igno" [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney) [16:34:22] (03PS1) 10Ssingh: wikidata.org: adding additional TXT verification [dns] - 10https://gerrit.wikimedia.org/r/1181742 [16:34:45] (03CR) 10Hnowlan: [C:03+2] wikimedia.org: add hcaptcha-sentry CNAME [dns] - 10https://gerrit.wikimedia.org/r/1181739 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan) [16:35:07] !log hnowlan@dns1004 START - running authdns-update [16:35:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11115886 (10Jclark-ctr) The correction and console connect to scs-f8-eqiad has been completed. NetBox records have been updated accordingly. An IP address has been successfully assigned.... [16:35:23] (03PS1) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181743 [16:35:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11115890 (10Jclark-ctr) [16:36:21] !log hnowlan@dns1004 END - running authdns-update [16:36:28] (03CR) 10BCornwall: [C:03+1] wikidata.org: adding additional TXT verification [dns] - 10https://gerrit.wikimedia.org/r/1181742 (owner: 10Ssingh) [16:36:57] (03CR) 10Ssingh: [C:03+2] wikidata.org: adding additional TXT verification [dns] - 10https://gerrit.wikimedia.org/r/1181742 (owner: 10Ssingh) [16:37:06] !log sukhe@dns1004 START - running authdns-update [16:37:29] (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6715/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [16:38:15] !log sukhe@dns1004 END - running authdns-update [16:38:56] (03CR) 10Hnowlan: [V:03+1 C:03+1] hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó) [16:43:10] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating new f servers in codfw - jhancock@cumin1003" [16:43:10] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:45:01] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11115927 (10Papaul) Case open with Juniper ` Case Number 2025-0825-829681 [16:46:06] (03PS2) 10Vgutierrez: ncredir,benthos: Move processors to the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1030010 (https://phabricator.wikimedia.org/T364379) [16:46:32] (03CR) 10CI reject: [V:04-1] ncredir,benthos: Move processors to the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1030010 (https://phabricator.wikimedia.org/T364379) (owner: 10Vgutierrez) [16:51:07] (03PS2) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) [16:51:49] (03CR) 10Hnowlan: [C:03+1] hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [16:51:57] (03CR) 10CI reject: [V:04-1] Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [16:52:31] 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11115983 (10FCeratto-WMF) Hello @Miriam, sorry for the recurrent ask, could you please approve @diego's request for membership in analytics-research-admins? Thank you [16:56:41] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host deploy2003.codfw.wmnet with OS bookworm [16:56:50] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm [16:59:40] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1181745 [17:00:05] swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1700). [17:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1700). [17:00:13] o/ [17:01:19] (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181149 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French) [17:01:23] (03CR) 10Scott French: [C:03+2] mediawiki: clean up php.version overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181149 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French) [17:03:56] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835 (10phaultfinder) 03NEW [17:04:32] (03Merged) 10jenkins-bot: mediawiki: clean up php.version overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181149 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French) [17:05:25] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11116039 (10FCeratto-WMF) 05Open→03In progress p:05Triage→03Medium [17:06:18] * swfrench-wmf is waiting for chartmuseum ... [17:06:45] 10ops-codfw, 06SRE, 06DC-Ops: codfw netbox cable cleanup - https://phabricator.wikimedia.org/T402535#11116042 (10Jhancock.wm) 05Open→03Resolved [17:07:25] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:45] (03PS5) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) [17:08:57] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11116046 (10phaultfinder) [17:09:35] (03CR) 10CI reject: [V:04-1] Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [17:13:54] (03PS1) 10Scott French: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181747 (https://phabricator.wikimedia.org/T401721) [17:14:17] * swfrench-wmf shakes fist at chart version [17:16:44] (03CR) 10RLazarus: [C:03+1] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181747 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French) [17:17:01] (03CR) 10Scott French: [C:03+2] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181747 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French) [17:17:30] (03CR) 10Kosta Harlan: [C:03+1] "When could we deploy this?" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó) [17:19:55] (03Merged) 10jenkins-bot: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181747 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French) [17:20:06] * swfrench-wmf is *actually* waiting for chartmuseum ... [17:21:48] (03CR) 10Ssingh: "I think it looks good but let's run PCC on both the hosts (dns1004, doh1001) to confirm." [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [17:22:28] (03PS10) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [17:24:08] !log swfrench@deploy1003 Started scap sync-world: Helmfile-only deployment for php.version override cleanup - T401721 [17:24:13] T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721 [17:24:27] (03CR) 10Bking: opensearch-operator: Add chart for review (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [17:26:29] !log swfrench@deploy1003 Finished scap sync-world: Helmfile-only deployment for php.version override cleanup - T401721 (duration: 03m 34s) [17:27:10] no additional items planned on my end for this infra window [17:33:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T399249)', diff saved to https://phabricator.wikimedia.org/P81735 and previous config saved to /var/cache/conftool/dbconfig/20250825-173358-fceratto.json [17:34:03] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:44:53] (03CR) 10Ssingh: "I think we can abandon this?" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [17:49:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P81736 and previous config saved to /var/cache/conftool/dbconfig/20250825-174905-fceratto.json [18:02:06] (03PS1) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) [18:02:35] (03CR) 10CI reject: [V:04-1] wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine) [18:04:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P81737 and previous config saved to /var/cache/conftool/dbconfig/20250825-180413-fceratto.json [18:16:29] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host deploy2003.codfw.wmnet with OS bookworm [18:16:38] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116267 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm executed with errors: - deploy2003 (**... [18:19:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T399249)', diff saved to https://phabricator.wikimedia.org/P81739 and previous config saved to /var/cache/conftool/dbconfig/20250825-181920-fceratto.json [18:19:26] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:23:45] (03PS1) 10David Caro: wmcs-enc-cli: update client params [puppet] - 10https://gerrit.wikimedia.org/r/1181756 [18:25:47] (03CR) 10Andrew Bogott: [C:03+1] "I don't know what work 'enabled' was doing there but we tend to delete/replace endpoints so this should be fine in our setup regardless." [puppet] - 10https://gerrit.wikimedia.org/r/1181756 (owner: 10David Caro) [18:45:02] (03CR) 10David Caro: [C:03+2] wmcs-enc-cli: update client params [puppet] - 10https://gerrit.wikimedia.org/r/1181756 (owner: 10David Caro) [18:50:53] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11116357 (10SCherukuwada) Ollie's status in Dayforce is not up-to-date. Skip-level manager approving. [18:57:56] FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:58:04] (03CR) 10Ssingh: "Leaving to Valenti.n for the final say; some initial thoughts:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins) [18:58:12] (03CR) 10CDobbins: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [19:01:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349) (owner: 10Arlolra) [19:01:33] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host deploy2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:34] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:09:12] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T402846 (10GuidoSP) 03NEW Closing this task as invalid due to missing information. [19:09:41] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host deploy2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:10:04] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1169156/6727/doh1001.wikimedia.org/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [19:12:26] 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11116443 (10Jgreen) Hi @Jhancock.wm I'm not able to ping frdata2002's management interface, is it up on the IP that is in DNS? I'm able to ssh to frmx2002, but where can get the p... [19:12:44] I suspect Gerrit is unhappy [19:14:09] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['deploy2003'] [19:14:28] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['deploy2003'] [19:14:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:14:34] yeah... [19:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:15:01] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host deploy2003.codfw.wmnet with OS bookworm [19:15:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:15:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm [19:18:39] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:20:23] oh good. [19:21:06] !log restart apache gerrit1003 [19:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:19] thcipriani: much better thanks. I guess I will just do it in future instead of waiting :] [19:22:52] !log dancy@deploy1003 Installing scap version "4.209.0" for 169 host(s) [19:23:17] sukhe: it's typically only apache that needs a kick there [19:23:39] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:24:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:27] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1045 [19:26:55] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1045 [19:26:57] !log dancy@deploy1003 Installation of scap version "4.209.0" completed for 169 hosts [19:29:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:29:32] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11116548 (10VRiley-WMF) @cmooney Thanks! The second link on cloudcephosd1045 in port 23 in cloudsw1-d5-eqiad. I also made a few changes to the cable itself. I pushed out the update as well. I... [19:29:39] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:31:16] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:31:17] sukhe: thcipriani: wasnt here. it was from google cloud this time :( [19:31:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11116590 (10Jclark-ctr) [19:31:59] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11116598 (10Jclark-ctr) Both pdu's have been configured and added to librenms [19:32:08] 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11116601 (10Jclark-ctr) 05Open→03Resolved [19:32:55] (03CR) 10Cwhite: [C:03+2] k8s-ops: add disk space check overrides (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [19:34:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:35:30] (03Merged) 10jenkins-bot: k8s-ops: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [19:36:09] (03PS2) 10Cwhite: logstash: alert on unassigned shards and cluster status [alerts] - 10https://gerrit.wikimedia.org/r/1179226 [19:39:02] (03PS1) 10Andrew Bogott: Replace cloudvirt1045 [puppet] - 10https://gerrit.wikimedia.org/r/1181767 (https://phabricator.wikimedia.org/T401693) [19:40:19] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test [19:40:24] (03PS1) 10Dzahn: gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847) [19:40:39] (03CR) 10CI reject: [V:04-1] gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn) [19:40:54] (03CR) 10Dzahn: [C:03+2] gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn) [19:41:03] (03PS2) 10Dzahn: gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847) [19:41:35] (03CR) 10Dzahn: [C:03+2] gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn) [19:43:40] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:43:41] (03PS2) 10Ebernhardson: EventStream: Enable hive ingestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 [19:43:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 (owner: 10Ebernhardson) [19:43:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083) (owner: 10Ebernhardson) [19:44:25] (03PS3) 10Ebernhardson: EventStream: Enable hive ingestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 [19:45:24] (03PS41) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) [19:45:34] (03PS4) 10Ebernhardson: EventStream: Enable hive ingestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 (https://phabricator.wikimedia.org/T391383) [19:48:14] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1179226 (owner: 10Cwhite) [19:48:22] (03PS3) 10Ebernhardson: cirrus: Enable phrase suggester variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083) [19:48:42] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:49:10] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:50:56] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [19:51:01] (03PS1) 10Dzahn: gerrit: add prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181769 (https://phabricator.wikimedia.org/T402847) [19:51:09] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:51:54] (03CR) 10Dzahn: [C:03+2] gerrit: add prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181769 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn) [19:52:00] (03PS2) 10Dzahn: gerrit: add prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181769 (https://phabricator.wikimedia.org/T402847) [19:52:24] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [19:53:09] (03CR) 10Dzahn: [C:03+2] gerrit: add prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181769 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn) [19:54:46] mutante: :( [19:54:59] where do you check this out of curiosity? [19:55:09] just access logs? [19:58:15] (03PS1) 10Dzahn: gerrit: add more prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181770 [19:58:15] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [19:58:36] (03CR) 10CI reject: [V:04-1] gerrit: add more prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181770 (owner: 10Dzahn) [19:58:54] (03PS3) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) [19:59:06] (03PS2) 10Dzahn: gerrit: add more prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181770 [19:59:41] (03PS8) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 [19:59:46] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [19:59:48] (03CR) 10CI reject: [V:04-1] Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle) [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2000). [20:00:05] arlolra and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:00:34] here. I can handle my deploy [20:00:47] \o [20:00:51] i can do mine after [20:01:05] I'll get started [20:01:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116698 (10Jhancock.wm) @Papaul this one is going to fail again. looks like there might be a missmatch between hardware and the site.pp or preseed. I'm not sure which, but they both exi... [20:01:16] (03CR) 10Cwhite: [C:03+2] logstash: alert on unassigned shards and cluster status [alerts] - 10https://gerrit.wikimedia.org/r/1179226 (owner: 10Cwhite) [20:01:24] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 30.45 ms [20:01:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116699 (10Jhancock.wm) [20:01:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349) (owner: 10Arlolra) [20:02:27] (03Merged) 10jenkins-bot: logstash: alert on unassigned shards and cluster status [alerts] - 10https://gerrit.wikimedia.org/r/1179226 (owner: 10Cwhite) [20:02:36] (03CR) 10Dzahn: [C:03+2] gerrit: add more prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181770 (owner: 10Dzahn) [20:02:49] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to ~20 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349) (owner: 10Arlolra) [20:03:07] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1180229|Deploy Parsoid Read Views to ~20 Wikipedias (T402349)]] [20:03:11] T402349: Parsoid Read Views to Wikipedia deploy ~2025-08-25 - https://phabricator.wikimedia.org/T402349 [20:06:34] (03PS4) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) [20:08:52] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1180229|Deploy Parsoid Read Views to ~20 Wikipedias (T402349)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:08:56] T402349: Parsoid Read Views to Wikipedia deploy ~2025-08-25 - https://phabricator.wikimedia.org/T402349 [20:10:25] !log arlolra@deploy1003 arlolra: Continuing with sync [20:14:47] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:14:48] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:14:55] jhathaway@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [20:15:46] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180229|Deploy Parsoid Read Views to ~20 Wikipedias (T402349)]] (duration: 12m 40s) [20:15:51] T402349: Parsoid Read Views to Wikipedia deploy ~2025-08-25 - https://phabricator.wikimedia.org/T402349 [20:15:57] ebernhardson: all yours [20:16:24] arlolra: thanks [20:17:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 (https://phabricator.wikimedia.org/T391383) (owner: 10Ebernhardson) [20:17:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083) (owner: 10Ebernhardson) [20:18:09] (03Merged) 10jenkins-bot: EventStream: Enable hive ingestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 (https://phabricator.wikimedia.org/T391383) (owner: 10Ebernhardson) [20:18:16] (03Merged) 10jenkins-bot: cirrus: Enable phrase suggester variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083) (owner: 10Ebernhardson) [20:18:31] !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1154300|EventStream: Enable hive ingestion for wcqs-external.sparql-query (T391383)]], [[gerrit:1180610|cirrus: Enable phrase suggester variant (T397083)]] [20:18:38] T391383: Metrics for federated querying - https://phabricator.wikimedia.org/T391383 [20:18:38] T397083: Add a second suggest field to the CirrusSearch mapping - https://phabricator.wikimedia.org/T397083 [20:19:26] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:19:32] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:20:52] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:52] (03PS1) 10Santiago Faci: xLab: Deploy v0.8.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181773 (https://phabricator.wikimedia.org/T380592) [20:22:03] (03PS2) 10Santiago Faci: xLab: Deploy v0.8.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181773 (https://phabricator.wikimedia.org/T380592) [20:23:24] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6729/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [20:23:54] !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1154300|EventStream: Enable hive ingestion for wcqs-external.sparql-query (T391383)]], [[gerrit:1180610|cirrus: Enable phrase suggester variant (T397083)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:23:58] (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.8.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181773 (https://phabricator.wikimedia.org/T380592) (owner: 10Santiago Faci) [20:24:00] T391383: Metrics for federated querying - https://phabricator.wikimedia.org/T391383 [20:24:00] T397083: Add a second suggest field to the CirrusSearch mapping - https://phabricator.wikimedia.org/T397083 [20:24:18] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms [20:24:59] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6730/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [20:25:44] (03Merged) 10jenkins-bot: xLab: Deploy v0.8.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181773 (https://phabricator.wikimedia.org/T380592) (owner: 10Santiago Faci) [20:26:29] !log ebernhardson@deploy1003 ebernhardson: Continuing with sync [20:27:18] 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11116795 (10FCeratto-WMF) [20:31:36] !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154300|EventStream: Enable hive ingestion for wcqs-external.sparql-query (T391383)]], [[gerrit:1180610|cirrus: Enable phrase suggester variant (T397083)]] (duration: 13m 04s) [20:31:42] T391383: Metrics for federated querying - https://phabricator.wikimedia.org/T391383 [20:31:43] T397083: Add a second suggest field to the CirrusSearch mapping - https://phabricator.wikimedia.org/T397083 [20:34:40] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host deploy2003.codfw.wmnet with OS bookworm [20:34:46] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm executed with errors: - deploy2003 (**... [20:35:25] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:37:20] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:07] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:39:55] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11116917 (10FCeratto-WMF) [20:40:18] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [20:45:14] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:46:07] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2003.codfw.wmnet with reason: sleep test [20:48:07] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [20:48:30] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:50:16] PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100% [20:50:29] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:51:24] RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms [20:51:31] (03PS1) 10Cwhite: logstash: bugfix: add missing threshold [alerts] - 10https://gerrit.wikimedia.org/r/1181777 [20:53:00] (03CR) 10Cwhite: [C:03+2] logstash: bugfix: add missing threshold [alerts] - 10https://gerrit.wikimedia.org/r/1181777 (owner: 10Cwhite) [20:53:15] !log rzl@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist Version.php # dblist: https://phabricator.wikimedia.org/P81742 [20:53:20] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test [20:54:11] (03Merged) 10jenkins-bot: logstash: bugfix: add missing threshold [alerts] - 10https://gerrit.wikimedia.org/r/1181777 (owner: 10Cwhite) [20:54:18] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:58:20] (03PS1) 10RLazarus: deployment_server: Include --local_dblist contents when logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1181779 (https://phabricator.wikimedia.org/T401737) [20:59:17] Hey all - is the late backport window wrapped up yet? [20:59:20] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:59:27] We’ve definitely got a few sec patches to get out during the window. [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2100). [21:00:44] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2006.codfw.wmnet with reason: sleep test [21:02:58] (03CR) 10RLazarus: "Tested:" [puppet] - 10https://gerrit.wikimedia.org/r/1181779 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus) [21:03:40] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:03:50] (03CR) 10Andrew Bogott: [C:03+2] Replace cloudvirt1045 [puppet] - 10https://gerrit.wikimedia.org/r/1181767 (https://phabricator.wikimedia.org/T401693) (owner: 10Andrew Bogott) [21:07:36] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:07:40] FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:09] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116995 (10Papaul) @Jhancock.wm no entry on the wrong puppet server for this server. Please check site.pp. Thanks [21:08:25] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2009.codfw.wmnet with reason: sleep test [21:08:45] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:08:50] 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11116999 (10phaultfinder) [21:10:24] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:13:50] 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11117005 (10phaultfinder) [21:13:51] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:15:08] (03PS1) 10Bartosz Dziewoński: PHPSessionHandler: Better handle objects stored in the session [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181782 (https://phabricator.wikimedia.org/T402602) [21:15:41] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [21:16:11] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1002.eqiad.wmnet with reason: sleep test [21:17:02] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1003.eqiad.wmnet with reason: sleep test [21:17:06] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:17:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181782 (https://phabricator.wikimedia.org/T402602) (owner: 10Bartosz Dziewoński) [21:18:47] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:19:23] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:19:46] (03PS1) 10Bartosz Dziewoński: Add maint script to fix global edit count of renamed users [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181788 (https://phabricator.wikimedia.org/T313900) [21:19:59] (03PS1) 10Bartosz Dziewoński: Add maint script to fix wrong actors in local log entries for global renames [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181789 (https://phabricator.wikimedia.org/T398177) [21:20:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181788 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński) [21:20:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181789 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński) [21:21:51] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [21:21:53] (03PS11) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [21:23:26] (03CR) 10CI reject: [V:04-1] opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [21:23:51] !log Deployed security mitigations for T402146, T402077, T402095, T400525 [21:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:56] (03PS2) 10Ladsgroup: Move update of category members count to a dedicated job [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181786 (https://phabricator.wikimedia.org/T365303) [21:33:24] (03PS12) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) [21:38:07] (03CR) 10Ladsgroup: [C:03+2] Move update of category members count to a dedicated job [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181786 (https://phabricator.wikimedia.org/T365303) (owner: 10Ladsgroup) [21:38:13] jouncebot: nowandnext [21:38:13] For the next 1 hour(s) and 21 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2100) [21:38:14] In 1 hour(s) and 21 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2300) [21:39:30] (03PS6) 10Jdlrobson: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang) [21:41:40] (03PS1) 10Andrew Bogott: Remove manifests/files/templaces for openstack 'Caracal' [puppet] - 10https://gerrit.wikimedia.org/r/1181790 (https://phabricator.wikimedia.org/T390914) [21:42:06] (03Merged) 10jenkins-bot: Move update of category members count to a dedicated job [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181786 (https://phabricator.wikimedia.org/T365303) (owner: 10Ladsgroup) [21:44:57] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181790 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott) [21:45:00] (03PS1) 10Cwhite: opensearch: selectively enable cluster health check [puppet] - 10https://gerrit.wikimedia.org/r/1181791 (https://phabricator.wikimedia.org/T321808) [21:47:23] (03CR) 10Andrew Bogott: [C:03+2] Remove manifests/files/templaces for openstack 'Caracal' [puppet] - 10https://gerrit.wikimedia.org/r/1181790 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott) [21:47:28] !log Deployed updated security mitigations for T399627 [21:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:34] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] [21:47:39] T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303 [21:49:07] (03PS2) 10Andrew Bogott: openstack: switch libvirt live migration uri to cloud-private hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [21:49:51] sbassett: my patch went immediately after yours, I'll be quick. Need to fix this UBN [21:51:15] preparing to do further security deploys [21:51:23] Amir1 are you done with yours? [21:51:32] nope, it's running [21:51:53] I ping you once I'm done [21:53:36] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:53:41] T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303 [21:56:25] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11117170 (10VRiley-WMF) [21:56:39] great just let me know [22:00:01] (03PS1) 10JHathaway: provision: poll for reboot via Redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 [22:05:24] !log ladsgroup@deploy1003 Sync cancelled. [22:05:50] maryum: is anything you're doing in core? if so, then let me revert my patch [22:06:04] if not, then I can spend time to fix it [22:06:05] I was going to deploy a core patch, but it's not working right now [22:06:19] ah okay [22:06:37] I have a patch to deploy for abuse filter and one for cirrus search [22:07:12] I have a patch you wrote to deploy as well [22:07:19] (03CR) 10CI reject: [V:04-1] provision: poll for reboot via Redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway) [22:07:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:07:45] wait, is that the core patch you're trying to deploy? [22:08:03] Amir1 are working on the same thing possibly [22:08:07] *we [22:08:24] nope, mine is different [22:08:28] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] [22:08:33] T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303 [22:08:39] I'm pushing it again, I realized what was wrong [22:08:55] well I do have a core patch that you wrote that I also want to deploy [22:09:00] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:09:03] so just let me know when I can get started [22:09:04] (03CR) 10Cwhite: "PCC: OK https://puppet-compiler.wmflabs.org/output/1181791/6732/" [puppet] - 10https://gerrit.wikimedia.org/r/1181791 (https://phabricator.wikimedia.org/T321808) (owner: 10Cwhite) [22:09:55] Ah I remember which patch [22:10:02] it's different :D [22:10:15] 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11117185 (10VRiley-WMF) [22:10:21] I'll be done quickly, sorry for barging in the security window, I'm having a very fun time [22:11:04] (03PS1) 10Bking: [WIP]:dse-k8s-worker: Add sysctl setting that's required for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T362105) [22:11:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T362105) (owner: 10Bking) [22:13:55] To be fair, "making cat-a-lot not crash wikipedia" is arguably security related. [22:14:08] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:14:09] Amir1 there's still time in the window that's fine [22:14:13] T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303 [22:15:44] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [22:16:36] perryprog: xD Indeeeeed [22:20:55] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] (duration: 12m 26s) [22:21:00] T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303 [22:21:16] maryum: I'm done, feel free to move forward [22:21:24] awesome, thanks!! [22:23:08] preparing to deploy the core security patch first [22:27:07] !log Deployed security fix for T298690 [22:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:41] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117263 (10Ladsgroup) >>! In T402749#11113595, @Zache wrote: > @Ladsgroup : Just FYI, from the Cat-a-lot code side, the user was using a pre-August 18, 2024 ve... [22:41:22] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1181799 (https://phabricator.wikimedia.org/T402870) [22:41:27] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870) [22:42:35] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1181801 (https://phabricator.wikimedia.org/T402871) [22:42:39] (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181802 (https://phabricator.wikimedia.org/T402871) [22:44:20] ran into some issues, running scap again [22:51:41] have one more scap to run after this, will go over this window for a slight bit [22:54:25] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117358 (10JJMC89) >>! In T402749#11117263, @Ladsgroup wrote: >>>! In T402749#11113595, @Zache wrote: >> @Ladsgroup : Just FYI, from the Cat-a-lot code side, t... [22:55:49] !log Deploy security fix for T401220 [22:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:36] (03PS1) 10Dzahn: zuul: add a provider and zookeeper server to nodepool config [puppet] - 10https://gerrit.wikimedia.org/r/1181804 (https://phabricator.wikimedia.org/T401614) [22:56:38] running the last of the scaps [22:59:02] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1181804/6733/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1181804 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn) [22:59:56] finished with scap [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2300) [23:00:37] !log Deploy security fix for T397396 [23:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:04] maryum I just need to deploy a beta cluster only change. Are you done with your deploys? [23:04:34] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:04:35] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:07:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:08:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181743 (owner: 10Jdlrobson) [23:09:00] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:09:15] (03Merged) 10jenkins-bot: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181743 (owner: 10Jdlrobson) [23:14:54] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:14:54] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:14:54] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [23:15:06] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [23:15:26] (03PS1) 10RLazarus: mathoid: Upgrade to envoy-future:1.26.8-2 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181806 (https://phabricator.wikimedia.org/T402584) [23:17:55] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1003.eqiad.wmnet with reason: sleep test [23:21:02] !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [23:23:29] !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [23:24:41] jouncebot: nowandnext [23:24:41] For the next 0 hour(s) and 35 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2300) [23:24:41] In 2 hour(s) and 35 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T0200) [23:29:29] (03CR) 10Scott French: [C:03+1] "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1181779 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus) [23:29:34] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:29:40] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117413 (10Ladsgroup) >>! In T402749#11117358, @JJMC89 wrote: >>>! In T402749#11117263, @Ladsgroup wrote: >>>>! In T402749#11113595, @Zache wrote: >>> @Ladsgro... [23:30:55] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T402871 [23:30:59] T402871: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T402871 [23:31:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set db1160 with weight 0 T402871', diff saved to https://phabricator.wikimedia.org/P81743 and previous config saved to /var/cache/conftool/dbconfig/20250825-233128-ladsgroup.json [23:33:40] 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117423 (10Josve05a) >>! In T402749#11117413, @Ladsgroup wrote: > [...] Maybe someone should mention it to them? There is https://commons.wikimedia.org/wiki/U... [23:35:02] (03CR) 10RLazarus: [C:03+2] deployment_server: Include --local_dblist contents when logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1181779 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus) [23:37:11] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1181801 (https://phabricator.wikimedia.org/T402871) [23:37:16] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1181801 (https://phabricator.wikimedia.org/T402871) (owner: 10Gerrit maintenance bot) [23:38:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181807 [23:38:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181807 (owner: 10TrainBranchBot) [23:39:20] !log Starting s4 eqiad failover from db1244 to db1160 - T402871 [23:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:24] T402871: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T402871 [23:39:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T402871', diff saved to https://phabricator.wikimedia.org/P81744 and previous config saved to /var/cache/conftool/dbconfig/20250825-233934-ladsgroup.json [23:42:43] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): decommission an-druid100[1-2] - https://phabricator.wikimedia.org/T402814#11117433 (10Jclark-ctr) [23:43:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Promote db1160 to s4 primary and set section read-write T402871', diff saved to https://phabricator.wikimedia.org/P81745 and previous config saved to /var/cache/conftool/dbconfig/20250825-234303-ladsgroup.json [23:43:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:45:32] (03CR) 10Ladsgroup: [V:03+2 C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181802 (https://phabricator.wikimedia.org/T402871) (owner: 10Gerrit maintenance bot) [23:45:46] !log ladsgroup@dns1004 START - running authdns-update [23:47:01] !log ladsgroup@dns1004 END - running authdns-update [23:48:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [23:48:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1244 T402871', diff saved to https://phabricator.wikimedia.org/P81746 and previous config saved to /var/cache/conftool/dbconfig/20250825-234856-ladsgroup.json [23:49:02] T402871: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T402871 [23:50:08] (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870) (owner: 10Gerrit maintenance bot) [23:51:27] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181807 (owner: 10TrainBranchBot) [23:54:53] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.379 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:54:53] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.525 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:59:13] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for db1244.eqiad.wmnet [23:59:21] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.depool db1244 - Upgrading db1244.eqiad.wmnet [23:59:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1244 - Upgrading db1244.eqiad.wmnet