[00:07:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:07:46] <wikibugs>	 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11113548 (10Ladsgroup) To be sure update category membership is the culprit, I went through all slow write queries reordered by the master around the time of th...
[00:07:58] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181307
[00:07:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181307 (owner: 10TrainBranchBot)
[00:11:51] <wikibugs>	 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11113549 (10Ladsgroup) Specifically these edits seemed to be the main reason: https://commons.wikimedia.org/w/index.php?title=Special:Contributions/Yac%C3%A0wot...
[00:36:26] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1181307 (owner: 10TrainBranchBot)
[00:43:56] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11113571 (10phaultfinder)
[00:48:53] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11113574 (10phaultfinder)
[01:16:17] <wikibugs>	 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11113584 (10ecarg) Thank you so much, @RLazarus! Will keep you posted with any Qs 😃
[01:32:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:37:44] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for wdqs2025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:42:44] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:45:44] <jinxer-wm>	 FIRING: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:48:44] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:49:44] <jinxer-wm>	 FIRING: [6x] NodeTextfileStale: Stale textfile for cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:58:58] <jinxer-wm>	 FIRING: [5x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:59:53] <jinxer-wm>	 FIRING: [4x] NodeTextfileStale: Stale textfile for wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:00:44] <jinxer-wm>	 FIRING: [6x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:00:48] <jinxer-wm>	 FIRING: [112x] NodeTextfileStale: Stale textfile for cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:01:44] <jinxer-wm>	 FIRING: [5x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:04:53] <jinxer-wm>	 FIRING: [19x] NodeTextfileStale: Stale textfile for wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:05:53] <jinxer-wm>	 FIRING: [3x] NodeTextfileStale: Stale textfile for relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:07:43] <wikibugs>	 (03PS1) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595)
[02:08:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[02:14:25] <wikibugs>	 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11113595 (10Zache) @Ladsgroup : Just FYI, from the Cat-a-lot code side, the user was using a pre-August 18, 2024 version of Cat-a-lot which didn't have the thro...
[02:15:54] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[02:57:56] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:04:34] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[03:04:34] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[03:14:39] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[03:14:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[03:29:34] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[04:45:39] <jinxer-wm>	 RESOLVED: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[04:48:55] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11113663 (10phaultfinder)
[04:53:58] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11113664 (10phaultfinder)
[05:08:39] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:20:05] <wikibugs>	 10ops-codfw, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402759 (10phaultfinder) 03NEW
[05:28:39] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:37:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:37:44] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for wdqs2025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:42:44] <jinxer-wm>	 FIRING: NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:45:44] <jinxer-wm>	 FIRING: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:48:44] <jinxer-wm>	 FIRING: [2x] NodeTextfileStale: Stale textfile for apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:49:44] <jinxer-wm>	 FIRING: [6x] NodeTextfileStale: Stale textfile for cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:58:53] <jinxer-wm>	 FIRING: [5x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:59:53] <jinxer-wm>	 FIRING: [4x] NodeTextfileStale: Stale textfile for wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:00:44] <jinxer-wm>	 FIRING: [6x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:00:49] <jinxer-wm>	 FIRING: [112x] NodeTextfileStale: Stale textfile for cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:01:44] <jinxer-wm>	 FIRING: [5x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:04:53] <jinxer-wm>	 FIRING: [19x] NodeTextfileStale: Stale textfile for wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:05:44] <jinxer-wm>	 FIRING: [3x] NodeTextfileStale: Stale textfile for relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:23:39] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:38:07] <kostajh>	 jouncebot: nowandnext
[06:38:08] <jouncebot>	 For the next 0 hour(s) and 21 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250824T0700)
[06:38:08] <jouncebot>	 In 0 hour(s) and 21 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T0700)
[06:38:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181130 (https://phabricator.wikimedia.org/T402641) (owner: 10Kosta Harlan)
[06:38:50] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "thanks for the bump" [puppet] - 10https://gerrit.wikimedia.org/r/1181213 (owner: 10Dzahn)
[06:40:19] <wikibugs>	 (03Merged) 10jenkins-bot: hcaptcha: Delay challenge execution until submit [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181130 (https://phabricator.wikimedia.org/T402641) (owner: 10Kosta Harlan)
[06:42:21] <kostajh>	 dcausse: it looks like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1114956 was merged, but not backported 
[06:42:43] <dcausse>	 kostajh: yes I think it was merged by accident on friday
[06:42:47] <kostajh>	 ok 
[06:42:51] <kostajh>	 dcausse: ok if I sync it now? 
[06:43:02] <kostajh>	 I'm syncing another patch, and scap is asking me to deploy it 
[06:43:07] <dcausse>	 kostajh: yes please
[06:43:23] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1181130|hcaptcha: Delay challenge execution until submit (T402641)]]
[06:43:28] <stashbot>	 T402641: hCaptcha: Only display challenge on form submission - https://phabricator.wikimedia.org/T402641
[06:43:38] <kostajh>	 dcausse: ok, it's going out now. Do you need to verify it when it's on mwdebug?
[06:44:05] <dcausse>	 kostajh: I could do a quick test yes
[06:44:18] <kostajh>	 ok I'll let you know when it's ready for review 
[06:44:26] <dcausse>	 sure
[06:48:07] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[06:48:36] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[06:57:56] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T0700).
[07:00:05] <jouncebot>	 kostajh and dcausse: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1098962 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[07:04:31] <kostajh>	 dcausse: it's still syncing out... 
[07:04:34] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[07:04:34] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[07:06:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] cloudcontrol/codfw1dev:: Enable profile::auto_restarts::service for apache2 [puppet] - 10https://gerrit.wikimedia.org/r/1098962 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[07:08:29] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1181130|hcaptcha: Delay challenge execution until submit (T402641)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[07:08:34] <stashbot>	 T402641: hCaptcha: Only display challenge on form submission - https://phabricator.wikimedia.org/T402641
[07:08:55] <dcausse>	 testing
[07:09:26] <dcausse>	 kostajh: all good from my side
[07:11:40] <kostajh>	 cool
[07:11:51] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[07:13:36] <wikibugs>	 06SRE, 10envoy, 06serviceops, 06Traffic, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11113754 (10MoritzMuehlenhoff) We also have 237 baremetal hosts with Envoy, how shall we handle these? We could e.g. add a profile parameter $use_future to pro...
[07:14:54] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[07:15:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[07:24:46] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181130|hcaptcha: Delay challenge execution until submit (T402641)]] (duration: 41m 22s)
[07:24:50] <stashbot>	 T402641: hCaptcha: Only display challenge on form submission - https://phabricator.wikimedia.org/T402641
[07:25:04] <kostajh>	 dcausse: synced!
[07:25:24] <dcausse>	 kostajh: thanks and sorry about this!
[07:25:31] <kostajh>	 no worries at all 
[07:29:13] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update image for readability model on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180098 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou)
[07:29:34] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[07:31:11] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update image for readability model on prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1180098 (https://phabricator.wikimedia.org/T400352) (owner: 10AikoChou)
[07:31:54] <wikibugs>	 (03PS1) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864)
[07:33:12] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' .
[07:33:41] <wikibugs>	 06SRE, 10envoy, 06serviceops, 06Traffic, 13Patch-For-Review: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11113776 (10hashar) I have updated the [[ https://integration.wikimedia.org/ci/job/helm-lint/ | helm-lint ]] job to the new image :)
[07:34:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi)
[07:36:25] <logmsgbot>	 !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' .
[07:36:30] <wikibugs>	 (03PS2) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864)
[07:36:44] <wikibugs>	 (03PS1) 10Brouberol: mediawiki-dumps-legacy: adapt RBAC to a recent apache-airflow-providers-cncf-kubernetes upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181519 (https://phabricator.wikimedia.org/T402529)
[07:36:54] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 10WE4.2 Bot detection (WE4.2 hCaptcha account creation trial): hCaptcha: Ensure GeoIP and WMF-Uniq cookies are removed in proxied requests - https://phabricator.wikimedia.org/T402713#11113808 (10kostajh)
[07:41:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role on install4003 [puppet] - 10https://gerrit.wikimedia.org/r/1181094 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[07:41:31] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi)
[07:43:07] <wikibugs>	 (03PS3) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864)
[07:43:49] <wikibugs>	 (03CR) 10Urbanecm: [C:04-1] [Growth] enwiki: Deploy "Add a link" to 100% of users (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime)
[07:45:22] <urbanecm>	 kostajh: any other deployment still happening, do you know?
[07:46:16] <wikibugs>	 (03PS4) 10Ayounsi: Routed Ganeti: switch to dnsmasq for DHCP relay [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864)
[07:46:33] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi)
[07:50:29] <wikibugs>	 (03CR) 10Ayounsi: [C:04-1] "Ready for review but can't be merged before we have a compatible dnsmasq (v2.92) in APT." [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi)
[07:55:16] <jinxer-wm>	 FIRING: [19x] NodeTextfileStale: Stale textfile for wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:55:30] <jinxer-wm>	 FIRING: [4x] NodeTextfileStale: Stale textfile for wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:55:34] <jinxer-wm>	 FIRING: [6x] NodeTextfileStale: Stale textfile for cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:55:59] <jinxer-wm>	 RESOLVED: [3x] NodeTextfileStale: Stale textfile for wdqs1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:56:08] <jinxer-wm>	 RESOLVED: [6x] NodeTextfileStale: Stale textfile for wcqs1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:56:41] <jinxer-wm>	 RESOLVED: [3x] NodeTextfileStale: Stale textfile for relforge1008:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:57:19] <jinxer-wm>	 RESOLVED: [112x] NodeTextfileStale: Stale textfile for cirrussearch1068:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:57:31] <jinxer-wm>	 RESOLVED: [5x] NodeTextfileStale: Stale textfile for wdqs1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:58:34] <jinxer-wm>	 RESOLVED: NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:58:53] <jinxer-wm>	 RESOLVED: NodeTextfileStale: Stale textfile for wdqs2025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:59:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Add dummy keytabs for new install servers T396487 [labs/private] - 10https://gerrit.wikimedia.org/r/1181638
[07:59:43] <jinxer-wm>	 RESOLVED: [5x] NodeTextfileStale: Stale textfile for wdqs1018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:00:11] <jinxer-wm>	 RESOLVED: [2x] NodeTextfileStale: Stale textfile for apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:01:19] <jinxer-wm>	 RESOLVED: [19x] NodeTextfileStale: Stale textfile for wdqs1011:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:01:23] <jinxer-wm>	 RESOLVED: [4x] NodeTextfileStale: Stale textfile for wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:01:42] <jinxer-wm>	 RESOLVED: [6x] NodeTextfileStale: Stale textfile for cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[08:13:15] <wikibugs>	 (03CR) 10Jcrespo: "Hey, Moritz, maybe I am not understanding the patch, but I don't think this will work as intended, "os.path.exists" runs something on the " [software/transferpy] - 10https://gerrit.wikimedia.org/r/1180570 (https://phabricator.wikimedia.org/T393692) (owner: 10Muehlenhoff)
[08:18:48] <wikibugs>	 (03PS1) 10Slyngshede: P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119)
[08:23:41] <wikibugs>	 (03PS1) 10Brouberol: airflow-ml: define an RBD volume claim used as a model training scratch space [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181641 (https://phabricator.wikimedia.org/T396495)
[08:26:55] <wikibugs>	 (03CR) 10Vgutierrez: P:cache::haproxy block generic user-agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[08:29:46] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] P:cache::haproxy block generic user-agents (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[08:32:44] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181641 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol)
[08:35:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-ml: define an RBD volume claim used as a model training scratch space [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181641 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol)
[08:36:15] <wikibugs>	 (03PS2) 10Slyngshede: P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119)
[08:39:09] <wikibugs>	 (03PS3) 10Slyngshede: P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119)
[08:39:15] <wikibugs>	 (03CR) 10Slyngshede: P:cache::haproxy block generic user-agents (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[08:40:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] installserver: Failover DHCP server in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1181095 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[08:40:44] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add dummy keytabs for new install servers T396487 [labs/private] - 10https://gerrit.wikimedia.org/r/1181638 (owner: 10Muehlenhoff)
[08:41:33] <wikibugs>	 (03PS4) 10Máté Szabó: hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713)
[08:41:59] <wikibugs>	 (03CR) 10Máté Szabó: hcaptcha: Only pass hmt_id cookie to upstream (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[08:43:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover webproxy in ulsfo to new node [dns] - 10https://gerrit.wikimedia.org/r/1181096 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[08:43:08] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[08:43:16] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181131 (owner: 10Vgutierrez)
[08:43:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[08:44:15] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[08:45:25] <wikibugs>	 (03PS2) 10Vgutierrez: haproxy: Provide basic X-Analytics data for blocked requests [puppet] - 10https://gerrit.wikimedia.org/r/1181131
[08:46:46] <wikibugs>	 (03PS5) 10Máté Szabó: hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713)
[08:46:59] <wikibugs>	 (03CR) 10Cyndywikime: "Thanks Martin, you are right!Created the task here : https://phabricator.wikimedia.org/T402769 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime)
[08:47:53] <wikibugs>	 (03PS3) 10Cyndywikime: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524)
[08:53:52] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11114056 (10phaultfinder)
[08:54:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] openstack: clarify libvirtd debug levels [puppet] - 10https://gerrit.wikimedia.org/r/1180535 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[08:54:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] openstack: enable cfssl certs for libvirt in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1180556 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[08:56:13] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good. Syntax tested in local test environment." [puppet] - 10https://gerrit.wikimedia.org/r/1181131 (owner: 10Vgutierrez)
[08:56:34] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:56:46] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:56:54] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:56:58] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:57:42] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:57:58] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[08:58:54] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11114087 (10phaultfinder)
[08:58:58] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:16] <godog>	 that's me
[09:00:18] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:34] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:42] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:00:58] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:22] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:52] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:01:54] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:03:52] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:04:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy: Provide basic X-Analytics data for blocked requests [puppet] - 10https://gerrit.wikimedia.org/r/1181131 (owner: 10Vgutierrez)
[09:04:42] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:04:54] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1059 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:05:42] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:07:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1181505 (https://phabricator.wikimedia.org/T396864) (owner: 10Ayounsi)
[09:08:54] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:18] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:18] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:19] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:22] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:22] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:23] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:24] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:25] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:26] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:27] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:28] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:29] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:30] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:31] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:32] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:33] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1074 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:34] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:35] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:36] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:37] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1070 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:38] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:39] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:40] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:42] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:42] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:43] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1071 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:44] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1051 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:45] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:47] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:54] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1056 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:54] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:58] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:58] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirt1069 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:09:59] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:10:02] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:10:23] <godog>	 should be recovering fully soon
[09:10:30] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:10:59] <wikibugs>	 (03PS1) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644
[09:11:22] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1076 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó)
[09:11:26] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:27] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:28] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:28] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1074 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:34] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:35] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:35] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:42] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:54] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:58] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:11:58] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1072 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:02] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:18] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:18] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:22] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:22] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:23] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:24] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:25] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:26] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:28] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:28] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:34] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:42] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:42] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:12:53] <wikibugs>	 (03PS2) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644
[09:13:51] <wikibugs>	 (03PS6) 10Máté Szabó: hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713)
[09:14:46] <icinga-wm>	 PROBLEM - nova-compute proc maximum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:14:54] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:16:36] <wikibugs>	 (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[09:17:01] <wikibugs>	 (03PS3) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644
[09:17:12] <wikibugs>	 (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó)
[09:18:00] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:18:26] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:18:46] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:18:54] <icinga-wm>	 RECOVERY - nova-compute proc maximum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:18:55] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:19:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Point webproxy in eqsin to install5003 [dns] - 10https://gerrit.wikimedia.org/r/1181645 (https://phabricator.wikimedia.org/T396487)
[09:19:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Apply installserver role to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181646 (https://phabricator.wikimedia.org/T396487)
[09:19:53] <wikibugs>	 (03PS4) 10Máté Szabó: hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644
[09:19:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover DHCP server in eqsin to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181647 (https://phabricator.wikimedia.org/T396487)
[09:20:24] <wikibugs>	 (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó)
[09:22:38] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Fix x-analytics for requestctl blocked requests [puppet] - 10https://gerrit.wikimedia.org/r/1181648
[09:25:23] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1181648 (owner: 10Vgutierrez)
[09:25:31] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy: Fix x-analytics for requestctl blocked requests [puppet] - 10https://gerrit.wikimedia.org/r/1181648 (owner: 10Vgutierrez)
[09:29:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:32:34] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2179.codfw.wmnet with reason: Maintenance
[09:32:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T399249)', diff saved to https://phabricator.wikimedia.org/P81734 and previous config saved to /var/cache/conftool/dbconfig/20250825-093241-fceratto.json
[09:32:46] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[09:34:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.75s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:35:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Update install server in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1181651
[09:36:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Update install server in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1181652
[09:37:19] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Failover DHCP server in eqsin to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181647 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[09:37:39] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[09:37:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:37:49] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Update install server in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1181651 (owner: 10Muehlenhoff)
[09:38:04] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Update install server in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1181652 (owner: 10Muehlenhoff)
[09:41:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update install server in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/1181651 (owner: 10Muehlenhoff)
[09:41:48] <wikibugs>	 (03PS1) 10Brouberol: provision the mysql analytics research password in the analytics-ml HDFS home [puppet] - 10https://gerrit.wikimedia.org/r/1181653 (https://phabricator.wikimedia.org/T398950)
[09:41:57] <wikibugs>	 (03PS13) 10Arnaudb: gerrit: mod qos configuration [puppet] - 10https://gerrit.wikimedia.org/r/1181124 (https://phabricator.wikimedia.org/T402611)
[09:41:57] <wikibugs>	 (03CR) 10Arnaudb: "this change adds mod_qos to gerrit's httpd reverse proxy configuration" [puppet] - 10https://gerrit.wikimedia.org/r/1181124 (https://phabricator.wikimedia.org/T402611) (owner: 10Arnaudb)
[09:43:33] <wikibugs>	 (03PS1) 10Zabe: Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912)
[09:44:00] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:cache::haproxy block generic user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1181639 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[09:44:17] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181653 (https://phabricator.wikimedia.org/T398950) (owner: 10Brouberol)
[09:46:09] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[09:46:41] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[09:47:26] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[09:48:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Supermicro incorrectly exposing LinkStatus in Redfish - https://phabricator.wikimedia.org/T400034#11114185 (10ayounsi) Thanks, fyi with that firmware I replied the following:  > The LinkStatus is now missing : >  >   {'@odata.etag': '"12722e886e91fe533b80e55b5bbd72ee"', >...
[09:48:49] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[09:48:53] <urbanecm>	 jouncebot: nowandnext
[09:48:53] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 11 minute(s)
[09:48:53] <jouncebot>	 In 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1000)
[09:49:08] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779 (10mszwarc) 03NEW
[09:49:25] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[09:49:31] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[09:50:08] <zabe>	 jouncebot: nowandnext
[09:50:08] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 9 minute(s)
[09:50:08] <jouncebot>	 In 0 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1000)
[09:50:14] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[09:50:16] <urbanecm>	 zabe: trying to deploy sth atm
[09:50:24] <zabe>	 alright
[09:50:29] <wikibugs>	 (03CR) 10Urbanecm: [C:04-2] "will conflict with my deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[09:50:29] <zabe>	 will wait
[09:50:31] <urbanecm>	 ty
[09:50:33] <wikibugs>	 (03PS1) 10Brouberol: airflow-ml: fix typo in storage quantity and storage class name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181656 (https://phabricator.wikimedia.org/T396495)
[09:50:36] <wikibugs>	 (03CR) 10Urbanecm: Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[09:52:00] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11114199 (10OKryva-WMF) Approve as a Marcin's EM.
[09:52:02] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Deploying a security patch (T402698, T402600)
[09:52:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:52:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts install4002.wikimedia.org
[09:54:08] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Deploying a security patch (T402698, T402600) (duration: 02m 06s)
[09:54:18] <urbanecm>	 verifying
[09:55:16] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: use native subprocess timeouts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1181657 (https://phabricator.wikimedia.org/T379569)
[09:55:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] thumbor: use native subprocess timeouts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1181657 (https://phabricator.wikimedia.org/T379569) (owner: 10Hnowlan)
[09:55:59] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-ml: fix typo in storage quantity and storage class name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181656 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol)
[09:57:12] <urbanecm>	 hmm...i am doing something wrong
[09:57:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:57:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[09:58:13] <wikibugs>	 (03PS1) 10Brouberol: airflow-ml: drop -pvc suffix in the PVC name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181658 (https://phabricator.wikimedia.org/T396495)
[09:58:21] <wikibugs>	 (03PS3) 10Tiziano Fogli: nrpe2nodexp: remove file under node.d for disabled checks [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446)
[09:58:21] <wikibugs>	 (03CR) 10Tiziano Fogli: "It ensures that the cache of any disabled check is removed from /var/lib/prometheus/node.d. This prevents the NodeTextfileStale alert from" [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[09:58:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] benthos: Verify TLS cert of kafka brokers on webrequest_live [puppet] - 10https://gerrit.wikimedia.org/r/1181101 (https://phabricator.wikimedia.org/T291905) (owner: 10Vgutierrez)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1000)
[10:00:08] <wikibugs>	 (03CR) 10Ozge: [C:03+1] provision the mysql analytics research password in the analytics-ml HDFS home [puppet] - 10https://gerrit.wikimedia.org/r/1181653 (https://phabricator.wikimedia.org/T398950) (owner: 10Brouberol)
[10:00:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] provision the mysql analytics research password in the analytics-ml HDFS home [puppet] - 10https://gerrit.wikimedia.org/r/1181653 (https://phabricator.wikimedia.org/T398950) (owner: 10Brouberol)
[10:00:45] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Deploying a security patch (T402698, T402600)
[10:00:46] <urbanecm>	 one more time
[10:01:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:02:01] <wikibugs>	 (03PS4) 10Tiziano Fogli: nrpe2nodexp: remove file under node.d for disabled checks [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446)
[10:02:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: install4002.wikimedia.org decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[10:02:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:02:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts install4002.wikimedia.org
[10:02:27] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11114237 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install4002.wikimedia.org` - install4002.wikimedia.org (**PASS**)   - Do...
[10:05:20] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "Thank you" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181658 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol)
[10:05:39] <wikibugs>	 (03PS2) 10Volans: ServiceOps: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888
[10:05:45] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-ml: drop -pvc suffix in the PVC name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181658 (https://phabricator.wikimedia.org/T396495) (owner: 10Brouberol)
[10:06:29] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[10:06:54] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[10:07:03] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply
[10:07:09] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply
[10:08:12] <wikibugs>	 (03PS1) 10Brouberol: airflow-ml: fix typo in PVC class name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181660
[10:08:57] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[10:09:07] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:09:34] <wikibugs>	 (03PS2) 10Hnowlan: thumbor: use native subprocess timeouts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1181657 (https://phabricator.wikimedia.org/T379569)
[10:09:59] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[10:15:51] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Deploying a security patch (T402698, T402600) (duration: 15m 06s)
[10:15:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-ml: fix typo in PVC class name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181660 (owner: 10Brouberol)
[10:17:16] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] ServiceOps: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167888 (owner: 10Volans)
[10:20:07] <wikibugs>	 (03Merged) 10jenkins-bot: imagemagick: ignore all py3exiv2 exceptions [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101154 (https://phabricator.wikimedia.org/T381594) (owner: 10AntiCompositeNumber)
[10:21:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] thumbor: use native subprocess timeouts [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1181657 (https://phabricator.wikimedia.org/T379569) (owner: 10Hnowlan)
[10:22:46] <logmsgbot>	 !log urbanecm@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
[10:23:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] nrpe2nodexp: remove file under node.d for disabled checks [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[10:24:50] <logmsgbot>	 !log urbanecm@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
[10:27:51] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] nrpe2nodexp: remove file under node.d for disabled checks [puppet] - 10https://gerrit.wikimedia.org/r/1181650 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli)
[10:30:47] <moritzm>	 !log installing postgresql-13 security updates
[10:30:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move install servers to Bookworm - https://phabricator.wikimedia.org/T396487#11114309 (10MoritzMuehlenhoff)
[10:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:34] <kostajh>	 jouncebot: nowandnext
[10:35:34] <jouncebot>	 For the next 0 hour(s) and 24 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1000)
[10:35:34] <jouncebot>	 In 2 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1300)
[10:36:01] <wikibugs>	 (03PS1) 10Kosta Harlan: hcaptcha: Instrument siteverify API call [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181664 (https://phabricator.wikimedia.org/T402492)
[10:36:15] <wikibugs>	 (03PS1) 10Kosta Harlan: hCaptcha: Log errors to Logstash [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181665 (https://phabricator.wikimedia.org/T402767)
[10:36:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181664 (https://phabricator.wikimedia.org/T402492) (owner: 10Kosta Harlan)
[10:36:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1003 using scap backport" [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181665 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan)
[10:42:18] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: new version with permissive pyexiv error handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181666 (https://phabricator.wikimedia.org/T381594)
[10:44:47] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] thumbor: new version with permissive pyexiv error handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181666 (https://phabricator.wikimedia.org/T381594) (owner: 10Hnowlan)
[10:45:29] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] thumbor: new version with permissive pyexiv error handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181666 (https://phabricator.wikimedia.org/T381594) (owner: 10Hnowlan)
[10:46:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:47:27] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: new version with permissive pyexiv error handling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181666 (https://phabricator.wikimedia.org/T381594) (owner: 10Hnowlan)
[10:47:48] <moritzm>	 !log installing openjdk-17 security updates
[10:47:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:05] <moritzm>	 ^ the alert for ulsfo is expected and will recover shortly
[10:49:10] <wikibugs>	 (03Merged) 10jenkins-bot: hcaptcha: Instrument siteverify API call [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181664 (https://phabricator.wikimedia.org/T402492) (owner: 10Kosta Harlan)
[10:49:45] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: remove staging version pin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181667
[10:50:01] <wikibugs>	 (03Merged) 10jenkins-bot: hCaptcha: Log errors to Logstash [extensions/ConfirmEdit] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181665 (https://phabricator.wikimedia.org/T402767) (owner: 10Kosta Harlan)
[10:50:21] <logmsgbot>	 !log kharlan@deploy1003 Started scap sync-world: Backport for [[gerrit:1181664|hcaptcha: Instrument siteverify API call (T402492)]], [[gerrit:1181665|hCaptcha: Log errors to Logstash (T402767)]]
[10:50:27] <stashbot>	 T402492: hCaptcha: Instrument call to /siteverify - https://phabricator.wikimedia.org/T402492
[10:50:27] <stashbot>	 T402767: hCaptcha: Log hCaptcha error codes to Logstash - https://phabricator.wikimedia.org/T402767
[10:50:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Apply installserver role to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181646 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[10:52:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] thumbor: remove staging version pin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181667 (owner: 10Hnowlan)
[10:53:56] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: remove staging version pin [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181667 (owner: 10Hnowlan)
[10:54:47] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/thumbor: apply
[10:54:54] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[10:56:32] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Backport for [[gerrit:1181664|hcaptcha: Instrument siteverify API call (T402492)]], [[gerrit:1181665|hCaptcha: Log errors to Logstash (T402767)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[10:56:41] <stashbot>	 T402492: hCaptcha: Instrument call to /siteverify - https://phabricator.wikimedia.org/T402492
[10:56:41] <stashbot>	 T402767: hCaptcha: Log hCaptcha error codes to Logstash - https://phabricator.wikimedia.org/T402767
[10:57:56] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:58:07] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/thumbor: apply
[10:58:16] <logmsgbot>	 !log kharlan@deploy1003 kharlan: Continuing with sync
[11:01:57] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:03:22] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] wikidata: Preconfigure for limited Growth features release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937)
[11:04:34] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[11:04:34] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[11:04:48] <logmsgbot>	 !log kharlan@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181664|hcaptcha: Instrument siteverify API call (T402492)]], [[gerrit:1181665|hCaptcha: Log errors to Logstash (T402767)]] (duration: 14m 26s)
[11:04:54] <stashbot>	 T402492: hCaptcha: Instrument call to /siteverify - https://phabricator.wikimedia.org/T402492
[11:04:54] <stashbot>	 T402767: hCaptcha: Log hCaptcha error codes to Logstash - https://phabricator.wikimedia.org/T402767
[11:09:23] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[11:11:05] <icinga-wm>	 PROBLEM - HTTP on install5003 is CRITICAL: connect to address 103.102.166.11 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Install_servers
[11:12:16] <moritzm>	 ^ install5003 is WIP, will resolve soon
[11:12:25] <wikibugs>	 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Spamassassin with Rspam for VRTS on Postfix - https://phabricator.wikimedia.org/T402260#11114408 (10ABran-WMF)
[11:12:51] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[11:14:54] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[11:15:05] <icinga-wm>	 RECOVERY - HTTP on install5003 is OK: HTTP OK: HTTP/1.1 200 OK - 848 bytes in 0.449 second response time https://wikitech.wikimedia.org/wiki/Install_servers
[11:15:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[11:16:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in ulsfo - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:21:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Update install server in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/1181652 (owner: 10Muehlenhoff)
[11:23:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Point webproxy in eqsin to install5003 [dns] - 10https://gerrit.wikimedia.org/r/1181645 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[11:23:16] <logmsgbot>	 !log jmm@dns1004 START - running authdns-update
[11:24:24] <logmsgbot>	 !log jmm@dns1004 END - running authdns-update
[11:24:57] <wikibugs>	 (03PS1) 10Clément Goubert: mw_experimental: Fix PuppetConstantChange alert [puppet] - 10https://gerrit.wikimedia.org/r/1181673 (https://phabricator.wikimedia.org/T396767)
[11:29:34] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[11:33:38] <jinxer-wm>	 FIRING: [2x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:36:18] <wikibugs>	 (03PS3) 10FNegri: maintain-views: Stop providing rc_new and rc_type to replicas [puppet] - 10https://gerrit.wikimedia.org/r/1178899 (https://phabricator.wikimedia.org/T36320) (owner: 10Zabe)
[11:37:03] <wikibugs>	 (03PS1) 10Máté Szabó: hcaptcha: Add proxied CSP reporting endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1181675
[11:38:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Failover DHCP server in eqsin to install5003 [puppet] - 10https://gerrit.wikimedia.org/r/1181647 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[11:41:18] <wikibugs>	 (03PS2) 10Máté Szabó: hcaptcha: Add proxied CSP reporting endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1181675
[11:41:36] <wikibugs>	 (03CR) 10Máté Szabó: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó)
[11:45:58] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181673 (https://phabricator.wikimedia.org/T396767) (owner: 10Clément Goubert)
[11:56:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: openstack: switch libvirt live migration uri to cloud-private hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145)
[11:57:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Tested in codfw1dev for correctness, live migration happens over the expected hostnames, e.g." [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[11:58:39] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Ripe Atlas anchor atlas5001:80 is not returning HTTP 200 OK on port 80 - https://wikitech.wikimedia.org/wiki/RIPE_Atlas#HTTP_checks_failing - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:01:53] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#11114554 (10Ladsgroup)
[12:03:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.098s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:04:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:07:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:08:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Blacklist orangefs [puppet] - 10https://gerrit.wikimedia.org/r/1181110 (owner: 10Muehlenhoff)
[12:09:31] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:15:05] <wikibugs>	 (03PS1) 10Slyngshede: P:cache::varnish::frontend user-agent rate limit cleanup [puppet] - 10https://gerrit.wikimedia.org/r/1181679 (https://phabricator.wikimedia.org/T400119)
[12:18:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.615s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:23:07] <hashar>	 !log Restarted CI Jenkins to update some plugins
[12:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.068s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:28:46] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Routed ganeti: improve firewalling [puppet] - 10https://gerrit.wikimedia.org/r/1180579 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi)
[12:33:00] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181687
[12:34:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:36:05] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] [Growth] wikidata: Preconfigure for limited Growth features release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm)
[12:36:39] <wikibugs>	 (03PS1) 10STran: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115)
[12:38:59] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: PHPSessionHandler: In warn mode, report the changed keys [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181695 (https://phabricator.wikimedia.org/T400668)
[12:40:39] <wikibugs>	 (03CR) 10STran: mediawiki: Run CheckUser/revokeTemporaryAccountViewerGroup.php (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1181689 (https://phabricator.wikimedia.org/T375115) (owner: 10STran)
[12:43:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.449s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:43:42] <wikibugs>	 (03PS1) 10Ayounsi: Routed ganeti: fix nftables typoes [puppet] - 10https://gerrit.wikimedia.org/r/1181696 (https://phabricator.wikimedia.org/T402372)
[12:44:07] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181696 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi)
[12:44:27] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Set wgPHPSessionHandling to 'warn' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324)
[12:44:44] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Set wgPHPSessionHandling to 'warn' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324)
[12:45:01] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181695 (https://phabricator.wikimedia.org/T400668) (owner: 10Bartosz Dziewoński)
[12:45:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[12:51:02] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Routed ganeti: fix nftables typoes [puppet] - 10https://gerrit.wikimedia.org/r/1181696 (https://phabricator.wikimedia.org/T402372) (owner: 10Ayounsi)
[12:55:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.392s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:55:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks correct in local tests." [puppet] - 10https://gerrit.wikimedia.org/r/1181133 (owner: 10Vgutierrez)
[12:55:39] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11114728 (10Jhancock.wm) I'll be in today to do these. was OoO last week.
[12:58:53] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Set wgPHPSessionHandling to 'warn' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324)
[12:58:59] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11114735 (10phaultfinder)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1300).
[13:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.29s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:00:41] <MatmaRex>	 hi
[13:01:27] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy: Stop sending X-Analytics-TLS to varnish [puppet] - 10https://gerrit.wikimedia.org/r/1181133 (owner: 10Vgutierrez)
[13:02:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11114753 (10Jclark-ctr) Connected Mgmt to m sw in rack  E11 and console to msw2-eqiad
[13:03:53] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11114754 (10phaultfinder)
[13:04:01] <Lucas_WMDE>	 o/
[13:05:12] <Lucas_WMDE>	 I can deploy
[13:05:44] <Lucas_WMDE>	 MatmaRex: should the two changes be deployed together?
[13:06:03] <MatmaRex>	 Lucas_WMDE: yes please
[13:06:05] <MatmaRex>	 thanks :)
[13:06:08] <Lucas_WMDE>	 alright
[13:06:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181695 (https://phabricator.wikimedia.org/T400668) (owner: 10Bartosz Dziewoński)
[13:06:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[13:07:10] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11114777 (10Andrew)
[13:07:15] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.777s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:07:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:07:39] <wikibugs>	 (03Merged) 10jenkins-bot: Set wgPHPSessionHandling to 'warn' again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181697 (https://phabricator.wikimedia.org/T362324) (owner: 10Bartosz Dziewoński)
[13:09:31] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:11:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet
[13:11:07] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet
[13:11:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11114786 (10cmooney) >>! In T378828#11111862, @Andrew wrote: > This is getting very close! I still see ping failures with cloudcephosd1045, probably because the second network connection isn'...
[13:11:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet
[13:11:30] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2034.codfw.wmnet
[13:12:15] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.777s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:12:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2034.codfw.wmnet
[13:13:19] <wikibugs>	 (03Merged) 10jenkins-bot: PHPSessionHandler: In warn mode, report the changed keys [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181695 (https://phabricator.wikimedia.org/T400668) (owner: 10Bartosz Dziewoński)
[13:13:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1181695|PHPSessionHandler: In warn mode, report the changed keys (T400668)]], [[gerrit:1181697|Set wgPHPSessionHandling to 'warn' again (T362324)]]
[13:13:42] <stashbot>	 T400668: Debug warnings that were recorded with $wgPHPSessionHandling = 'warn' in WMF production - https://phabricator.wikimedia.org/T400668
[13:13:42] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[13:14:25] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[13:16:54] <wikibugs>	 (03CR) 10Máté Szabó: "Boldly tagging Hugh for review per last week, please untag or delegate if the assignment is no longer appropriate!" [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó)
[13:17:02] <wikibugs>	 (03CR) 10Máté Szabó: "Boldly tagging Hugh for review per last week, please untag or delegate if the assignment is no longer appropriate!" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó)
[13:17:09] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:17:14] <wikibugs>	 (03CR) 10Máté Szabó: "Boldly tagging Hugh for review per last week, please untag or delegate if the assignment is no longer appropriate!" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[13:18:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2034.codfw.wmnet
[13:20:04] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1181695|PHPSessionHandler: In warn mode, report the changed keys (T400668)]], [[gerrit:1181697|Set wgPHPSessionHandling to 'warn' again (T362324)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[13:20:10] <stashbot>	 T400668: Debug warnings that were recorded with $wgPHPSessionHandling = 'warn' in WMF production - https://phabricator.wikimedia.org/T400668
[13:20:10] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[13:21:12] <MatmaRex>	 Lucas_WMDE: seems good
[13:21:30] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, matmarex: Continuing with sync
[13:21:30] <Lucas_WMDE>	 yay
[13:22:09] * Lucas_WMDE sees some INFO but no WARNING in mwdebug lostsah
[13:22:12] <Lucas_WMDE>	 *logstash
[13:22:47] <MatmaRex>	 i'm not sure how to trigger the logging, i just checked that i can log in
[13:23:05] <icinga-wm>	 PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:23:20] <Lucas_WMDE>	 ok
[13:23:20] <MatmaRex>	 one case i know of involved being IP blocked and visiting a wiki where you don't have a local account, which is a bit complex
[13:23:39] <jinxer-wm>	 FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:23:58] <wikibugs>	 (03CR) 10Kosta Harlan: "Seems reasonable to me." [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[13:24:05] <MatmaRex>	 looks like we have one real log entry already
[13:24:21] <MatmaRex>	 https://logstash.wikimedia.org/goto/ba89d8cfaaadca0b6af9953af7b408e4
[13:24:33] <MatmaRex>	 (which is exactly the case i said, heh)
[13:25:12] <Lucas_WMDE>	 hehe neat
[13:25:37] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate using BGP addpath for unicast IBGP spine/leaf pods - https://phabricator.wikimedia.org/T402640#11114837 (10cmooney)
[13:26:40] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11114838 (10TheDJ) There's reports that this breaks command line download of mediawiki tarballs via https://releases.wikimedia.org/mediawiki/1.44/  That se...
[13:26:50] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181695|PHPSessionHandler: In warn mode, report the changed keys (T400668)]], [[gerrit:1181697|Set wgPHPSessionHandling to 'warn' again (T362324)]] (duration: 13m 14s)
[13:26:56] <stashbot>	 T400668: Debug warnings that were recorded with $wgPHPSessionHandling = 'warn' in WMF production - https://phabricator.wikimedia.org/T400668
[13:26:56] <stashbot>	 T362324: Disable PHPSessionHandler in Wikimedia production - https://phabricator.wikimedia.org/T362324
[13:27:48] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:27:51] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:11] <MatmaRex>	 thanks Lucas_WMDE
[13:31:35] <Lucas_WMDE>	 np :)
[13:32:33] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181185 (https://phabricator.wikimedia.org/T402317) (owner: 10Andrew Bogott)
[13:32:55] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:34:16] <sukhe>	 grafana is down
[13:35:51] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:36:51] <jinxer-wm>	 FIRING: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs-main.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:36:56] <sukhe>	 ah
[13:37:07] <sukhe>	 but well no, that's unrelated
[13:37:10] <claime>	 I'm around
[13:37:11] <sukhe>	 !incidents
[13:37:12] <sirenbot>	 6699 (UNACKED)  ATSBackendErrorsHigh cache_text sre (wdqs-main.discovery.wmnet esams)
[13:37:12] <sirenbot>	 6646 (RESOLVED)  db1238 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:12] <sirenbot>	 6643 (RESOLVED)  db1221 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:12] <sirenbot>	 6642 (RESOLVED)  db1243 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:12] <sirenbot>	 6644 (RESOLVED)  db1199 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:13] <sirenbot>	 6651 (RESOLVED)  db1242 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:13] <sirenbot>	 6652 (RESOLVED)  db2240 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:13] <sirenbot>	 6650 (RESOLVED)  db1190 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:14] <sirenbot>	 6649 (RESOLVED)  db1249 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:14] <sirenbot>	 6648 (RESOLVED)  db1247 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:15] <sirenbot>	 6647 (RESOLVED)  db1241 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:15] <sirenbot>	 6645 (RESOLVED)  db1248 (paged)/MariaDB Replica Lag: s4 (paged)
[13:37:16] <sirenbot>	 6641 (RESOLVED)  ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad)
[13:37:16] <sirenbot>	 6640 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[13:37:17] <sukhe>	 !ack 6699
[13:37:38] <claime>	 Although if grafana is down that's gonna be hard to check -_-
[13:37:39] <arnaudb>	 around as well
[13:37:55] <arnaudb>	 grafana is slow but responds here
[13:38:00] <sukhe>	 claime: it's slow but back
[13:38:07] <sukhe>	 I take it back
[13:38:14] <sukhe>	 tappof: grafana is still down
[13:38:51] <icinga-wm>	 PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:39:29] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11114905 (10Jhancock.wm) @Ladsgroup  i have two proposals for es2039. 1) we leave it where it is and use port 43 on the switch. It'll be using a port that wo...
[13:39:33] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:41:52] <arnaudb>	 I've been able to hit https://grafana-next-rw.wikimedia.org if needed
[13:42:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Make wmf-update-known-hosts-production compatible with enforcement of robot policy [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181709
[13:43:55] <wikibugs>	 (03CR) 10Kosta Harlan: "Looks like this was implemented in I3aff0a5be5a87fe01ee4f365b920d2c98e6e7cee" [puppet] - 10https://gerrit.wikimedia.org/r/1175876 (owner: 10Hnowlan)
[13:44:23] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:44:26] <tappof>	 sukhe: is back
[13:44:38] <sukhe>	 thanks :) what was the secret sauce?
[13:44:45] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[13:44:50] <sukhe>	 asking in case it happens again, did you restart the service or something?
[13:45:52] <tappof>	 sukhe: I’ve just restarted the Apache service. I’ll look into the reason.
[13:46:27] <sukhe>	 thanks <3
[13:48:47] <icinga-wm>	 RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:48:47] <Lucas_WMDE>	 MatmaRex: fyi I can see the warnings in logspam-watch, it’s now the top warning (above https://phabricator.wikimedia.org/T304960) but at a manageable volume (~350 hits in the last ~half hour)
[13:48:55] <icinga-wm>	 RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy
[13:49:11] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11114961 (10Bugreporter) >>! In T400119#11114838, @TheDJ wrote: > There's reports that this breaks command line download of mediawiki tarballs via https://...
[13:49:18] <claime>	 Looks like the 500s are wdqs timeouting on mwapi request calls
[13:49:45] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181709 (owner: 10Muehlenhoff)
[13:50:23] <arnaudb>	 https://logstash.wikimedia.org/goto/7f6edb2400b275c9fea9e077ef4e5221 the uri_path seems to match
[13:50:40] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/1180557 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney)
[13:51:49] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1181117 (owner: 10Ayounsi)
[13:51:51] <jinxer-wm>	 RESOLVED: ATSBackendErrorsHigh: ATS: elevated 5xx errors from wdqs-main.discovery.wmnet in esams #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-site=esams&var-cluster=text&var-origin=wdqs-main.discovery.wmnet - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[13:53:39] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:54:26] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Make wmf-update-known-hosts-production compatible with enforcement of robot policy [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181709 (owner: 10Muehlenhoff)
[13:55:43] <wikibugs>	 (03PS1) 10Muehlenhoff: wmf-laptop: Update changefog for 1.0.3 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181716
[13:57:17] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] mediawiki-dumps-legacy: adapt RBAC to a recent apache-airflow-providers-cncf-kubernetes upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181519 (https://phabricator.wikimedia.org/T402529) (owner: 10Brouberol)
[13:57:26] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] wmf-laptop: Update changefog for 1.0.3 [debs/wmf-laptop] - 10https://gerrit.wikimedia.org/r/1181716 (owner: 10Muehlenhoff)
[13:59:27] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11115023 (10Andrew)
[13:59:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mediawiki-dumps-legacy: adapt RBAC to a recent apache-airflow-providers-cncf-kubernetes upgrade [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181519 (https://phabricator.wikimedia.org/T402529) (owner: 10Brouberol)
[14:00:11] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Remove mention of an-druid100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1180529 (https://phabricator.wikimedia.org/T401116) (owner: 10Stevemunene)
[14:03:53] <MatmaRex>	 Lucas_WMDE: yes, that was expected, i have patches in progress that will resolve it soon
[14:04:41] <MatmaRex>	 but i wanted to see if there are any lower-frequency warnings before we resolve them all
[14:04:49] <Lucas_WMDE>	 👍
[14:04:53] <MatmaRex>	 and wanted to be able to compare the log voluime more easily
[14:05:16] <MatmaRex>	 since the logs from the last time have almost rotated out of logstash already
[14:06:10] <zabe>	 jouncebot: nowandnext
[14:06:10] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 23 minute(s)
[14:06:10] <jouncebot>	 In 0 hour(s) and 23 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1430)
[14:08:09] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[14:08:56] <wikibugs>	 (03Merged) 10jenkins-bot: Set categorylinks to read new on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181655 (https://phabricator.wikimedia.org/T397912) (owner: 10Zabe)
[14:08:59] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Move failoid to trixie - https://phabricator.wikimedia.org/T402406#11115040 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff
[14:09:25] <logmsgbot>	 !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1181655|Set categorylinks to read new on commonswiki (T397912)]]
[14:09:30] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[14:10:49] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115049 (10TheDJ) Yeah getting the swagger spec via `curl https://api.wikimedia.org/core/v1/wikipedia/en/search/page?q=earth&limit=10` also no longer work...
[14:12:50] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6710/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181134 (owner: 10Vgutierrez)
[14:15:00] <logmsgbot>	 !log zabe@deploy1003 zabe: Backport for [[gerrit:1181655|Set categorylinks to read new on commonswiki (T397912)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[14:15:05] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[14:15:23] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw_experimental: Fix PuppetConstantChange alert [puppet] - 10https://gerrit.wikimedia.org/r/1181673 (https://phabricator.wikimedia.org/T396767) (owner: 10Clément Goubert)
[14:15:34] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw_experimental: Fix PuppetConstantChange alert [puppet] - 10https://gerrit.wikimedia.org/r/1181673 (https://phabricator.wikimedia.org/T396767) (owner: 10Clément Goubert)
[14:16:08] <logmsgbot>	 !log zabe@deploy1003 zabe: Continuing with sync
[14:17:42] <wikibugs>	 (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[14:17:53] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6711/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181134 (owner: 10Vgutierrez)
[14:20:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "This looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[14:21:22] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181655|Set categorylinks to read new on commonswiki (T397912)]] (duration: 11m 56s)
[14:21:27] <stashbot>	 T397912: Set categorylinks to read new - https://phabricator.wikimedia.org/T397912
[14:21:55] <wikibugs>	 (03PS1) 10Slyngshede: PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119)
[14:22:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[14:22:33] <icinga-wm>	 PROBLEM - Host ms-be2081 is DOWN: PING CRITICAL - Packet loss = 100%
[14:22:58] <wikibugs>	 (03PS2) 10Slyngshede: PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119)
[14:23:34] <wikibugs>	 (03PS1) 10Zabe: Stop writing to cl_to and cl_collation on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181720 (https://phabricator.wikimedia.org/T399579)
[14:25:19] <wikibugs>	 (03CR) 10Vgutierrez: "given you only need to pass one cookie I think you could skip the map entirely and just use `proxy_set_header Cookie $http_cookie_hmt_id;`" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[14:26:01] <icinga-wm>	 RECOVERY - Host ms-be2081 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms
[14:28:13] <wikibugs>	 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Transition codfw data persistence external storage (es) hosts to 10G - https://phabricator.wikimedia.org/T399927#11115141 (10ayounsi) Changing its rack would also allow us to change its IP to per rack vlans: https://wikitech.wikimedia.org/wiki/Vlan_migration
[14:28:31] <wikibugs>	 (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[14:28:58] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Add CI for python [homer/public] - 10https://gerrit.wikimedia.org/r/1181117 (owner: 10Ayounsi)
[14:30:05] <jouncebot>	 Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1430)
[14:30:33] <wikibugs>	 (03Merged) 10jenkins-bot: Add CI for python [homer/public] - 10https://gerrit.wikimedia.org/r/1181117 (owner: 10Ayounsi)
[14:31:29] <wikibugs>	 (03CR) 10Ozge: [C:03+1] "hello, thanks for working on this. We are looking forward to get approval for this patch in SRE IF meeting today. Please feel free to ask " [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[14:34:56] <wikibugs>	 (03PS1) 10Clément Goubert: Revert^2 "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181722
[14:35:44] <wikibugs>	 (03PS2) 10Clément Goubert: Revert^2 "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181722 (https://phabricator.wikimedia.org/T395893)
[14:36:10] <wikibugs>	 (03CR) 10Vgutierrez: PCC: Add user-agent to PCC util (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[14:37:43] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] Revert^2 "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181722 (https://phabricator.wikimedia.org/T395893) (owner: 10Clément Goubert)
[14:38:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Approved in the weekly SRE IF meeting" [puppet] - 10https://gerrit.wikimedia.org/r/1180559 (https://phabricator.wikimedia.org/T400902) (owner: 10Brouberol)
[14:38:58] <wikibugs>	 (03PS3) 10Slyngshede: PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119)
[14:39:08] <wikibugs>	 (03PS1) 10Vgutierrez: haproxy: Skip curl/wget from ua_policy:library_default [puppet] - 10https://gerrit.wikimedia.org/r/1181723 (https://phabricator.wikimedia.org/T400119)
[14:39:51] <wikibugs>	 (03Merged) 10jenkins-bot: Revert^2 "mw-cron: Disable memory limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181722 (https://phabricator.wikimedia.org/T395893) (owner: 10Clément Goubert)
[14:39:56] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: Put cloudcephosd10[42-47] in service - https://phabricator.wikimedia.org/T401693#11115223 (10Andrew)
[14:40:05] <wikibugs>	 (03CR) 10Slyngshede: PCC: Add user-agent to PCC util (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[14:40:18] <wikibugs>	 (03PS7) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969 (https://phabricator.wikimedia.org/T401595)
[14:41:00] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1181723 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez)
[14:41:17] <wikibugs>	 (03CR) 10Máté Szabó: "Thanks! Unfortunately it doesn't seem like it'd make things simpler because we'd need to set `hmt_id=$cookie_hmt_id` conditionally if `$co" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[14:41:34] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] haproxy: Skip curl/wget from ua_policy:library_default [puppet] - 10https://gerrit.wikimedia.org/r/1181723 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez)
[14:41:53] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11115228 (10RobH) I don't see any sensor firing over '60' when it isn't quite clear what sensor they mean via this alert?
[14:42:38] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402581#11115236 (10RobH) 05Open→03Resolved Other than perhaps the line frequency which now shows  Line Frequency: 60.0 Hz but perhaps it feed in at 60.1 at some poinut?  It is now flow...
[14:45:45] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] haproxy: Skip curl/wget from ua_policy:library_default [puppet] - 10https://gerrit.wikimedia.org/r/1181723 (https://phabricator.wikimedia.org/T400119) (owner: 10Vgutierrez)
[14:48:30] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[14:49:05] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115266 (10Bugreporter) curl/wget should still be rate limited with 1/s.
[14:51:16] <wikibugs>	 (03CR) 10Vgutierrez: "let's keep curl|wget with their own tag, something like `ua_policy:cli_tool`" [puppet] - 10https://gerrit.wikimedia.org/r/1181679 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[14:56:24] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115304 (10Vgutierrez)
[14:56:33] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' .
[14:56:46] <Daimona>	 jouncebot: nowandnext
[14:56:46] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1430)
[14:56:46] <jouncebot>	 In 0 hour(s) and 33 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1530)
[14:57:20] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115308 (10Vgutierrez)
[14:57:56] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[14:57:59] <wikibugs>	 06SRE, 06Traffic, 13Patch-For-Review, 07User-notice: Block traffic from user-agents not honoring our policy - https://phabricator.wikimedia.org/T400119#11115313 (10Vgutierrez)
[14:58:26] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] PCC: Add user-agent to PCC util [puppet] - 10https://gerrit.wikimedia.org/r/1181719 (https://phabricator.wikimedia.org/T400119) (owner: 10Slyngshede)
[14:59:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.dns.netbox
[15:00:19] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[15:00:58] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[15:03:14] <wikibugs>	 (03CR) 10Vgutierrez: "so proxy_set_header should only set the header if its value isn't an empty string, nginx doc says:" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[15:03:32] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for frac pdus  - jclark@cumin1002"
[15:03:37] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for frac pdus  - jclark@cumin1002"
[15:03:37] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:04:32] <Daimona>	 Hey folks, I need to run a couple queries in production to fix some logspam due to invalid stored data. I put the queries in T402239#11115333 (at the end of the comment). May I go ahead?
[15:04:32] <stashbot>	 T402239: RuntimeException: Event 1836 should have only one address. - https://phabricator.wikimedia.org/T402239
[15:04:34] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[15:04:34] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[15:04:58] <wikibugs>	 (03CR) 10Máté Szabó: "But the value needs to be `hmt_id=$cookie_hmt_id` because the variable won't include the cookie name and equals sign, so it'd still need t" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[15:05:56] <moritzm>	 !log imported wmf-laptop 1.0.3 to apt.wikimedia.org
[15:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:39] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11115384 (10Jhancock.wm) a:05Jgreen→03Papaul
[15:11:52] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11115390 (10Jhancock.wm) @Papaul these servers are ready for your part. mgmt ips are pingable.
[15:12:43] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[15:13:04] <wikibugs>	 10ops-magru, 06DC-Ops, 06Traffic: planned power redundancy depreciation 2025-09-20 @ 18:00 GMT to 2025-09-21 @ 21:00 GMT - https://phabricator.wikimedia.org/T402818 (10RobH) 03NEW p:05Triage→03Medium
[15:13:26] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Remove mention of an-druid100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/1180529 (https://phabricator.wikimedia.org/T401116) (owner: 10Stevemunene)
[15:14:54] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[15:15:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[15:16:37] <icinga-wm>	 PROBLEM - Druid broker on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:16:45] <icinga-wm>	 PROBLEM - Druid historical on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:16:57] <icinga-wm>	 PROBLEM - Druid coordinator on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:16:57] <icinga-wm>	 PROBLEM - Druid overlord on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:17:03] <icinga-wm>	 PROBLEM - Druid middlemanager on an-druid1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:17:38] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating new frack mgmt ips - jhancock@cumin1003"
[15:17:42] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating new frack mgmt ips - jhancock@cumin1003"
[15:17:42] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:17:48] <logmsgbot>	 !log stevemunene@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-druid1001.eqiad.wmnet
[15:24:36] <logmsgbot>	 !log stevemunene@cumin1003 START - Cookbook sre.dns.netbox
[15:26:16] <Daimona>	 Retrying since my message above got lost amongst the bot stuff. I need to run 3 queries in production to fix some logspam: T402239#11115333. I would like to go ahead shortly unless instructed otherwise.
[15:26:16] <stashbot>	 T402239: RuntimeException: Event 1836 should have only one address. - https://phabricator.wikimedia.org/T402239
[15:28:39] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:28:39] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:29:06] <logmsgbot>	 !log stevemunene@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-druid1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1003"
[15:29:16] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Growth: Enable on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937)
[15:29:34] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[15:29:37] <logmsgbot>	 !log stevemunene@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-druid1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1003"
[15:29:37] <logmsgbot>	 !log stevemunene@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:29:38] <logmsgbot>	 !log stevemunene@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-druid1001.eqiad.wmnet
[15:30:05] <jouncebot>	 jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1530). nyaa~
[15:31:18] <wikibugs>	 (03CR) 10Urbanecm: [Growth] wikidata: Preconfigure for limited Growth features release (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm)
[15:31:41] <wikibugs>	 (03PS2) 10Urbanecm: [Growth] wikidata: Preconfigure for limited Growth features release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937)
[15:32:18] <claime>	 Daimona: I think you should be fine, how long do you expect the queries to take (for information)
[15:32:33] <Daimona>	 A split second ;)
[15:32:40] <claime>	 Fire away then
[15:33:34] <Daimona>	 !log Running queries from T402239#11115333 in x1.wikishared to fix broken event addresses
[15:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:39] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:33:39] <stashbot>	 T402239: RuntimeException: Event 1836 should have only one address. - https://phabricator.wikimedia.org/T402239
[15:34:25] <Daimona>	 Done, thank you :)
[15:38:28] <wikibugs>	 (03PS2) 10Urbanecm: [beta] Growth: Enable on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937)
[15:39:22] <wikibugs>	 (03CR) 10Urbanecm: [Growth] wikidata: Preconfigure for limited Growth features release (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm)
[15:40:32] <icinga-wm>	 PROBLEM - Druid historical on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:40:38] <icinga-wm>	 PROBLEM - Druid middlemanager on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:40:46] <icinga-wm>	 PROBLEM - Druid overlord on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:41:04] <icinga-wm>	 PROBLEM - Druid coordinator on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:41:14] <wikibugs>	 (03Abandoned) 10Hnowlan: profile::hcaptcha: add missing private configs to subdomains [puppet] - 10https://gerrit.wikimedia.org/r/1175876 (owner: 10Hnowlan)
[15:41:24] <icinga-wm>	 PROBLEM - Druid broker on an-druid1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[15:41:26] <logmsgbot>	 !log stevemunene@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-druid1002.eqiad.wmnet
[15:44:27] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6713/console" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó)
[15:44:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Assign installserver role to install6003 [puppet] - 10https://gerrit.wikimedia.org/r/1181732 (https://phabricator.wikimedia.org/T396487)
[15:44:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Point DHCP server in drmrs to install6003 [puppet] - 10https://gerrit.wikimedia.org/r/1181733 (https://phabricator.wikimedia.org/T396487)
[15:45:02] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
[15:45:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Update DHCP server in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/1181734 (https://phabricator.wikimedia.org/T396487)
[15:46:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Point webproxy in drmrs to install6003 [dns] - 10https://gerrit.wikimedia.org/r/1181736 (https://phabricator.wikimedia.org/T396487)
[15:47:06] <wikibugs>	 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11115578 (10Jhancock.wm) a:05Papaul→03Jgreen @Jgreen i forgot about the new netbox script. the networking is set up on these for you. Let us know if you need any further assist...
[15:49:36] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating new f servers in codfw - jhancock@cumin1003"
[15:49:54] <wikibugs>	 (03PS5) 10CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137
[15:50:06] <logmsgbot>	 !log stevemunene@cumin1003 START - Cookbook sre.dns.netbox
[15:51:53] <wikibugs>	 (03CR) 10SBassett: [C:03+1] hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó)
[15:52:41] <logmsgbot>	 jhancock@cumin1003 netbox (PID 3662257) is awaiting input
[15:52:52] <logmsgbot>	 !log stevemunene@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:52:53] <logmsgbot>	 !log stevemunene@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-druid1002.eqiad.wmnet
[15:54:01] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6714/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó)
[15:55:05] <wikibugs>	 (03PS40) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[15:58:01] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1 C:03+1] "lgtm- a corresponding DNS change will be needed first, I can set that up." [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó)
[15:59:47] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "oh gotcha! you're totally right :)" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[16:02:07] <wikibugs>	 (03PS1) 10Hnowlan: wikimedia.org: add hcaptcha-sentry CNAME [dns] - 10https://gerrit.wikimedia.org/r/1181739 (https://phabricator.wikimedia.org/T397841)
[16:02:21] <urbanecm>	 jouncebot: nowandnext
[16:02:21] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 57 minute(s)
[16:02:21] <jouncebot>	 In 0 hour(s) and 57 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1700)
[16:02:22] <jouncebot>	 In 0 hour(s) and 57 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1700)
[16:03:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm)
[16:04:30] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] wikidata: Preconfigure for limited Growth features release [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181669 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm)
[16:04:45] <logmsgbot>	 !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1181669|[Growth] wikidata: Preconfigure for limited Growth features release (T400937)]]
[16:04:51] <stashbot>	 T400937: Investigate Feasibility of Enabling Growth Features on Wikidata - https://phabricator.wikimedia.org/T400937
[16:07:07] <topranks>	 !log set unused FPC 0 line card to offline mode on cr1-codfw T401937
[16:07:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:12] <stashbot>	 T401937: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937
[16:09:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-d4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402759#11115710 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[16:09:11] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "We will also need a record added to hieradata/common/profile/trafficserver/backend.yaml in this change to remap the domain in the same way" [puppet] - 10https://gerrit.wikimedia.org/r/1181675 (owner: 10Máté Szabó)
[16:10:38] <logmsgbot>	 !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1181669|[Growth] wikidata: Preconfigure for limited Growth features release (T400937)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[16:10:43] <stashbot>	 T400937: Investigate Feasibility of Enabling Growth Features on Wikidata - https://phabricator.wikimedia.org/T400937
[16:11:15] <logmsgbot>	 !log urbanecm@deploy1003 urbanecm: Continuing with sync
[16:12:27] <wikibugs>	 (03PS3) 10Urbanecm: [beta] Growth: Enable on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937)
[16:13:50] <logmsgbot>	 !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cr1-codfw with reason: suppress alerts so we can re-seat one of the PSUs
[16:13:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11115746 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=47d79845-d3b9-4b1e-af6c-788acd3f696b) set by cmooney@cumin1003 f...
[16:16:34] <logmsgbot>	 !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181669|[Growth] wikidata: Preconfigure for limited Growth features release (T400937)]] (duration: 11m 49s)
[16:16:39] <stashbot>	 T400937: Investigate Feasibility of Enabling Growth Features on Wikidata - https://phabricator.wikimedia.org/T400937
[16:17:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm)
[16:17:22] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Point webproxy in drmrs to install6003 [dns] - 10https://gerrit.wikimedia.org/r/1181736 (https://phabricator.wikimedia.org/T396487) (owner: 10Muehlenhoff)
[16:17:57] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Growth: Enable on beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181731 (https://phabricator.wikimedia.org/T400937) (owner: 10Urbanecm)
[16:19:10] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wikimedia.org: add hcaptcha-sentry CNAME [dns] - 10https://gerrit.wikimedia.org/r/1181739 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[16:19:49] <wikibugs>	 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to v1.26.8 and drop buster - https://phabricator.wikimedia.org/T402584#11115783 (10RLazarus) >>! In T402584#11113754, @MoritzMuehlenhoff wrote: > We also have 237 baremetal hosts with Envoy, how shall we handle these? We could e.g. add a profile parame...
[16:28:32] <wikibugs>	 (03CR) 10Volans: "I've left some suggestions on the potential abstractions to be more DRY inline. None of the suggestions is a blocker and feel free to igno" [homer/public] - 10https://gerrit.wikimedia.org/r/1180562 (https://phabricator.wikimedia.org/T402511) (owner: 10Cathal Mooney)
[16:34:22] <wikibugs>	 (03PS1) 10Ssingh: wikidata.org: adding additional TXT verification [dns] - 10https://gerrit.wikimedia.org/r/1181742
[16:34:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] wikimedia.org: add hcaptcha-sentry CNAME [dns] - 10https://gerrit.wikimedia.org/r/1181739 (https://phabricator.wikimedia.org/T397841) (owner: 10Hnowlan)
[16:35:07] <logmsgbot>	 !log hnowlan@dns1004 START - running authdns-update
[16:35:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11115886 (10Jclark-ctr) The correction and console connect to scs-f8-eqiad has been completed. NetBox records have been updated accordingly. An IP address has been successfully assigned....
[16:35:23] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181743
[16:35:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11115890 (10Jclark-ctr)
[16:36:21] <logmsgbot>	 !log hnowlan@dns1004 END - running authdns-update
[16:36:28] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wikidata.org: adding additional TXT verification [dns] - 10https://gerrit.wikimedia.org/r/1181742 (owner: 10Ssingh)
[16:36:57] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wikidata.org: adding additional TXT verification [dns] - 10https://gerrit.wikimedia.org/r/1181742 (owner: 10Ssingh)
[16:37:06] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[16:37:29] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6715/co" [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó)
[16:38:15] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[16:38:56] <wikibugs>	 (03CR) 10Hnowlan: [V:03+1 C:03+1] hcaptcha: Implement more restrictive CSP [puppet] - 10https://gerrit.wikimedia.org/r/1181644 (owner: 10Máté Szabó)
[16:43:10] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: updating new f servers in codfw - jhancock@cumin1003"
[16:43:10] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:45:01] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netbox: codfw:cr* router power not balance on all 4 PEM's - https://phabricator.wikimedia.org/T401937#11115927 (10Papaul) Case open with Juniper ` Case Number  2025-0825-829681
[16:46:06] <wikibugs>	 (03PS2) 10Vgutierrez: ncredir,benthos: Move processors to the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1030010 (https://phabricator.wikimedia.org/T364379)
[16:46:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ncredir,benthos: Move processors to the pipeline section [puppet] - 10https://gerrit.wikimedia.org/r/1030010 (https://phabricator.wikimedia.org/T364379) (owner: 10Vgutierrez)
[16:51:07] <wikibugs>	 (03PS2) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595)
[16:51:49] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] hcaptcha: Only pass hmt_id cookie to upstream [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[16:51:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[16:52:31] <wikibugs>	 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11115983 (10FCeratto-WMF) Hello @Miriam, sorry for the recurrent ask, could you please approve @diego's request for membership in analytics-research-admins? Thank you
[16:56:41] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host deploy2003.codfw.wmnet with OS bookworm
[16:56:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm
[16:59:40] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1181745
[17:00:05] <jouncebot>	 swfrench-wmf: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki infrastructure (UTC late) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1700).
[17:00:05] <jouncebot>	 ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T1700).
[17:00:13] <swfrench-wmf>	 o/
[17:01:19] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181149 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French)
[17:01:23] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mediawiki: clean up php.version overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181149 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French)
[17:03:56] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835 (10phaultfinder) 03NEW
[17:04:32] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: clean up php.version overrides [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181149 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French)
[17:05:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11116039 (10FCeratto-WMF) 05Open→03In progress p:05Triage→03Medium
[17:06:18] * swfrench-wmf is waiting for chartmuseum ...
[17:06:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw netbox cable cleanup - https://phabricator.wikimedia.org/T402535#11116042 (10Jhancock.wm) 05Open→03Resolved
[17:07:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:08:45] <wikibugs>	 (03PS5) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084)
[17:08:57] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11116046 (10phaultfinder)
[17:09:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang)
[17:13:54] <wikibugs>	 (03PS1) 10Scott French: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181747 (https://phabricator.wikimedia.org/T401721)
[17:14:17] * swfrench-wmf shakes fist at chart version
[17:16:44] <wikibugs>	 (03CR) 10RLazarus: [C:03+1] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181747 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French)
[17:17:01] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181747 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French)
[17:17:30] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] "When could we deploy this?" [puppet] - 10https://gerrit.wikimedia.org/r/1181128 (https://phabricator.wikimedia.org/T402713) (owner: 10Máté Szabó)
[17:19:55] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181747 (https://phabricator.wikimedia.org/T401721) (owner: 10Scott French)
[17:20:06] * swfrench-wmf is *actually* waiting for chartmuseum ...
[17:21:48] <wikibugs>	 (03CR) 10Ssingh: "I think it looks good but let's run PCC on both the hosts (dns1004, doh1001) to confirm." [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[17:22:28] <wikibugs>	 (03PS10) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246)
[17:24:08] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Helmfile-only deployment for php.version override cleanup - T401721
[17:24:13] <stashbot>	 T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721
[17:24:27] <wikibugs>	 (03CR) 10Bking: opensearch-operator: Add chart for review (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[17:26:29] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Helmfile-only deployment for php.version override cleanup - T401721 (duration: 03m 34s)
[17:27:10] <swfrench-wmf>	 no additional items planned on my end for this infra window
[17:33:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T399249)', diff saved to https://phabricator.wikimedia.org/P81735 and previous config saved to /var/cache/conftool/dbconfig/20250825-173358-fceratto.json
[17:34:03] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[17:44:53] <wikibugs>	 (03CR) 10Ssingh: "I think we can abandon this?" [puppet] - 10https://gerrit.wikimedia.org/r/1172056 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[17:49:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P81736 and previous config saved to /var/cache/conftool/dbconfig/20250825-174905-fceratto.json
[18:02:06] <wikibugs>	 (03PS1) 10Jasmine: wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859)
[18:02:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikikube: Add wikikube-worker2[248-330] [puppet] - 10https://gerrit.wikimedia.org/r/1181753 (https://phabricator.wikimedia.org/T390859) (owner: 10Jasmine)
[18:04:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P81737 and previous config saved to /var/cache/conftool/dbconfig/20250825-180413-fceratto.json
[18:16:29] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host deploy2003.codfw.wmnet with OS bookworm
[18:16:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116267 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm executed with errors: - deploy2003 (**...
[18:19:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T399249)', diff saved to https://phabricator.wikimedia.org/P81739 and previous config saved to /var/cache/conftool/dbconfig/20250825-181920-fceratto.json
[18:19:26] <stashbot>	 T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249
[18:23:45] <wikibugs>	 (03PS1) 10David Caro: wmcs-enc-cli: update client params [puppet] - 10https://gerrit.wikimedia.org/r/1181756
[18:25:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+1] "I don't know what work 'enabled' was doing there but we tend to delete/replace endpoints so this should be fine in our setup regardless." [puppet] - 10https://gerrit.wikimedia.org/r/1181756 (owner: 10David Caro)
[18:45:02] <wikibugs>	 (03CR) 10David Caro: [C:03+2] wmcs-enc-cli: update client params [puppet] - 10https://gerrit.wikimedia.org/r/1181756 (owner: 10David Caro)
[18:50:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11116357 (10SCherukuwada) Ollie's status in Dayforce is not up-to-date. Skip-level manager approving.
[18:57:56] <jinxer-wm>	 FIRING: [4x] KubernetesRsyslogDown: rsyslog on dse-k8s-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:58:04] <wikibugs>	 (03CR) 10Ssingh: "Leaving to Valenti.n for the final say; some initial thoughts:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1180137 (owner: 10CDobbins)
[18:58:12] <wikibugs>	 (03CR) 10CDobbins: dnsrecursor: add recursor.yml.erb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[19:01:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349) (owner: 10Arlolra)
[19:01:33] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host deploy2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:04:34] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[19:04:34] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[19:09:12] <wikibugs>	 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on <domain> - https://phabricator.wikimedia.org/T402846 (10GuidoSP) 03NEW Closing this task as invalid due to missing information.
[19:09:41] <logmsgbot>	 !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host deploy2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED
[19:10:04] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1169156/6727/doh1001.wikimedia.org/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[19:12:26] <wikibugs>	 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frdata2002 & frmx2002 - https://phabricator.wikimedia.org/T400275#11116443 (10Jgreen) Hi @Jhancock.wm I'm not able to ping frdata2002's management interface, is it up on the IP that is in DNS?  I'm able to ssh to frmx2002, but where can get the p...
[19:12:44] <sukhe>	 I suspect Gerrit is unhappy
[19:14:09] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['deploy2003']
[19:14:28] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['deploy2003']
[19:14:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:14:34] <sukhe>	 yeah...
[19:14:54] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[19:15:01] <logmsgbot>	 !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host deploy2003.codfw.wmnet with OS bookworm
[19:15:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[19:15:16] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116454 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm
[19:18:39] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:20:23] <thcipriani>	 oh good.
[19:21:06] <thcipriani>	 !log restart apache gerrit1003
[19:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:22:19] <sukhe>	 thcipriani: much better thanks. I guess I will just do it in future instead of waiting :]
[19:22:52] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.209.0" for 169 host(s)
[19:23:17] <thcipriani>	 sukhe: it's typically only apache that needs a kick there
[19:23:39] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:24:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:26:27] <logmsgbot>	 !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1045
[19:26:55] <logmsgbot>	 !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1045
[19:26:57] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.209.0" completed for 169 hosts
[19:29:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:29:32] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11116548 (10VRiley-WMF) @cmooney Thanks! The second link on cloudcephosd1045 in port 23 in cloudsw1-d5-eqiad. I also made a few changes to the cable itself. I pushed out the update as well. I...
[19:29:39] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[19:31:16] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:31:17] <mutante>	 sukhe: thcipriani: wasnt here. it was from google cloud this time :(
[19:31:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11116590 (10Jclark-ctr)
[19:31:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11116598 (10Jclark-ctr) Both pdu's have been configured and added to librenms
[19:32:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Q4: eqiad: (4) PDUs for future fundraising racks - https://phabricator.wikimedia.org/T400779#11116601 (10Jclark-ctr) 05Open→03Resolved
[19:32:55] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] k8s-ops: add disk space check overrides (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite)
[19:34:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:35:30] <wikibugs>	 (03Merged) 10jenkins-bot: k8s-ops: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite)
[19:36:09] <wikibugs>	 (03PS2) 10Cwhite: logstash: alert on unassigned shards and cluster status [alerts] - 10https://gerrit.wikimedia.org/r/1179226
[19:39:02] <wikibugs>	 (03PS1) 10Andrew Bogott: Replace cloudvirt1045 [puppet] - 10https://gerrit.wikimedia.org/r/1181767 (https://phabricator.wikimedia.org/T401693)
[19:40:19] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test
[19:40:24] <wikibugs>	 (03PS1) 10Dzahn: gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847)
[19:40:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn)
[19:40:54] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn)
[19:41:03] <wikibugs>	 (03PS2) 10Dzahn: gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847)
[19:41:35] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: adding prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181768 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn)
[19:43:40] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[19:43:41] <wikibugs>	 (03PS2) 10Ebernhardson: EventStream: Enable hive ingestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300
[19:43:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 (owner: 10Ebernhardson)
[19:43:58] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 25 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083) (owner: 10Ebernhardson)
[19:44:25] <wikibugs>	 (03PS3) 10Ebernhardson: EventStream: Enable hive ingestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300
[19:45:24] <wikibugs>	 (03PS41) 10CDobbins: dnsrecursor: add recursor.yml.erb [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608)
[19:45:34] <wikibugs>	 (03PS4) 10Ebernhardson: EventStream: Enable hive ingestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 (https://phabricator.wikimedia.org/T391383)
[19:48:14] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1179226 (owner: 10Cwhite)
[19:48:22] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus: Enable phrase suggester variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083)
[19:48:42] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[19:49:10] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[19:50:56] <icinga-wm>	 PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100%
[19:51:01] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181769 (https://phabricator.wikimedia.org/T402847)
[19:51:09] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[19:51:54] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181769 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn)
[19:52:00] <wikibugs>	 (03PS2) 10Dzahn: gerrit: add prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181769 (https://phabricator.wikimedia.org/T402847)
[19:52:24] <icinga-wm>	 RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms
[19:53:09] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181769 (https://phabricator.wikimedia.org/T402847) (owner: 10Dzahn)
[19:54:46] <sukhe>	 mutante: :(
[19:54:59] <sukhe>	 where do you check this out of curiosity?
[19:55:09] <sukhe>	 just access logs?
[19:58:15] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add more prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181770
[19:58:15] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[19:58:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: add more prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181770 (owner: 10Dzahn)
[19:58:54] <wikibugs>	 (03PS3) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595)
[19:59:06] <wikibugs>	 (03PS2) 10Dzahn: gerrit: add more prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181770
[19:59:41] <wikibugs>	 (03PS8) 10Krinkle: [WIP] varnish: Improve 08-mobile-hostnames-rewrite.vtc [puppet] - 10https://gerrit.wikimedia.org/r/1180969
[19:59:46] <icinga-wm>	 PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100%
[19:59:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595) (owner: 10Krinkle)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2000).
[20:00:05] <jouncebot>	 arlolra and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:19] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[20:00:34] <arlolra>	 here.  I can handle my deploy
[20:00:47] <ebernhardson>	 \o
[20:00:51] <ebernhardson>	 i can do mine after
[20:01:05] <arlolra>	 I'll get started
[20:01:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116698 (10Jhancock.wm) @Papaul this one is going to fail again. looks like there might be a missmatch between hardware and the site.pp or preseed. I'm not sure which, but they both exi...
[20:01:16] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: alert on unassigned shards and cluster status [alerts] - 10https://gerrit.wikimedia.org/r/1179226 (owner: 10Cwhite)
[20:01:24] <icinga-wm>	 RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 30.45 ms
[20:01:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116699 (10Jhancock.wm)
[20:01:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349) (owner: 10Arlolra)
[20:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: logstash: alert on unassigned shards and cluster status [alerts] - 10https://gerrit.wikimedia.org/r/1179226 (owner: 10Cwhite)
[20:02:36] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: add more prefixes to abusers list [puppet] - 10https://gerrit.wikimedia.org/r/1181770 (owner: 10Dzahn)
[20:02:49] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to ~20 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180229 (https://phabricator.wikimedia.org/T402349) (owner: 10Arlolra)
[20:03:07] <logmsgbot>	 !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1180229|Deploy Parsoid Read Views to ~20 Wikipedias (T402349)]]
[20:03:11] <stashbot>	 T402349: Parsoid Read Views to Wikipedia deploy ~2025-08-25 - https://phabricator.wikimedia.org/T402349
[20:06:34] <wikibugs>	 (03PS4) 10Krinkle: Enable wmgUseMdotRouting in Beta Cluster for testwiki only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181310 (https://phabricator.wikimedia.org/T401595)
[20:08:52] <logmsgbot>	 !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1180229|Deploy Parsoid Read Views to ~20 Wikipedias (T402349)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:08:56] <stashbot>	 T402349: Parsoid Read Views to Wikipedia deploy ~2025-08-25 - https://phabricator.wikimedia.org/T402349
[20:10:25] <logmsgbot>	 !log arlolra@deploy1003 arlolra: Continuing with sync
[20:14:47] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:14:48] <logmsgbot>	 !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:14:55] <stashbot>	 jhathaway@cumin1002: Failed to log message to wiki. Somebody should check the error logs.
[20:15:46] <logmsgbot>	 !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1180229|Deploy Parsoid Read Views to ~20 Wikipedias (T402349)]] (duration: 12m 40s)
[20:15:51] <stashbot>	 T402349: Parsoid Read Views to Wikipedia deploy ~2025-08-25 - https://phabricator.wikimedia.org/T402349
[20:15:57] <arlolra>	 ebernhardson: all yours
[20:16:24] <ebernhardson>	 arlolra: thanks
[20:17:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 (https://phabricator.wikimedia.org/T391383) (owner: 10Ebernhardson)
[20:17:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ebernhardson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083) (owner: 10Ebernhardson)
[20:18:09] <wikibugs>	 (03Merged) 10jenkins-bot: EventStream: Enable hive ingestion for wcqs-external.sparql-query [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154300 (https://phabricator.wikimedia.org/T391383) (owner: 10Ebernhardson)
[20:18:16] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Enable phrase suggester variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1180610 (https://phabricator.wikimedia.org/T397083) (owner: 10Ebernhardson)
[20:18:31] <logmsgbot>	 !log ebernhardson@deploy1003 Started scap sync-world: Backport for [[gerrit:1154300|EventStream: Enable hive ingestion for wcqs-external.sparql-query (T391383)]], [[gerrit:1180610|cirrus: Enable phrase suggester variant (T397083)]]
[20:18:38] <stashbot>	 T391383: Metrics for federated querying - https://phabricator.wikimedia.org/T391383
[20:18:38] <stashbot>	 T397083: Add a second suggest field to the CirrusSearch mapping - https://phabricator.wikimedia.org/T397083
[20:19:26] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:19:32] <logmsgbot>	 !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:20:52] <icinga-wm>	 PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:21:52] <wikibugs>	 (03PS1) 10Santiago Faci: xLab: Deploy v0.8.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181773 (https://phabricator.wikimedia.org/T380592)
[20:22:03] <wikibugs>	 (03PS2) 10Santiago Faci: xLab: Deploy v0.8.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181773 (https://phabricator.wikimedia.org/T380592)
[20:23:24] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6729/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[20:23:54] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Backport for [[gerrit:1154300|EventStream: Enable hive ingestion for wcqs-external.sparql-query (T391383)]], [[gerrit:1180610|cirrus: Enable phrase suggester variant (T397083)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[20:23:58] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] xLab: Deploy v0.8.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181773 (https://phabricator.wikimedia.org/T380592) (owner: 10Santiago Faci)
[20:24:00] <stashbot>	 T391383: Metrics for federated querying - https://phabricator.wikimedia.org/T391383
[20:24:00] <stashbot>	 T397083: Add a second suggest field to the CirrusSearch mapping - https://phabricator.wikimedia.org/T397083
[20:24:18] <icinga-wm>	 RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 31.62 ms
[20:24:59] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6730/co" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins)
[20:25:44] <wikibugs>	 (03Merged) 10jenkins-bot: xLab: Deploy v0.8.4 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181773 (https://phabricator.wikimedia.org/T380592) (owner: 10Santiago Faci)
[20:26:29] <logmsgbot>	 !log ebernhardson@deploy1003 ebernhardson: Continuing with sync
[20:27:18] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Superset dashboards for mszwarc - https://phabricator.wikimedia.org/T402779#11116795 (10FCeratto-WMF)
[20:31:36] <logmsgbot>	 !log ebernhardson@deploy1003 Finished scap sync-world: Backport for [[gerrit:1154300|EventStream: Enable hive ingestion for wcqs-external.sparql-query (T391383)]], [[gerrit:1180610|cirrus: Enable phrase suggester variant (T397083)]] (duration: 13m 04s)
[20:31:42] <stashbot>	 T391383: Metrics for federated querying - https://phabricator.wikimedia.org/T391383
[20:31:43] <stashbot>	 T397083: Add a second suggest field to the CirrusSearch mapping - https://phabricator.wikimedia.org/T397083
[20:34:40] <logmsgbot>	 !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host deploy2003.codfw.wmnet with OS bookworm
[20:34:46] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm executed with errors: - deploy2003 (**...
[20:35:25] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:37:20] <icinga-wm>	 PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:38:07] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:39:55] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es2049-es2057 - https://phabricator.wikimedia.org/T400195#11116917 (10FCeratto-WMF)
[20:40:18] <icinga-wm>	 RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms
[20:45:14] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:46:07] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2003.codfw.wmnet with reason: sleep test
[20:48:07] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[20:48:30] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[20:50:16] <icinga-wm>	 PROBLEM - Host sretest2010 is DOWN: PING CRITICAL - Packet loss = 100%
[20:50:29] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2010.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[20:51:24] <icinga-wm>	 RECOVERY - Host sretest2010 is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms
[20:51:31] <wikibugs>	 (03PS1) 10Cwhite: logstash: bugfix: add missing threshold [alerts] - 10https://gerrit.wikimedia.org/r/1181777
[20:53:00] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: bugfix: add missing threshold [alerts] - 10https://gerrit.wikimedia.org/r/1181777 (owner: 10Cwhite)
[20:53:15] <logmsgbot>	 !log rzl@deploy1003 mwscript-k8s job started: foreachwikiindblist mwscript.dblist Version.php  # dblist: https://phabricator.wikimedia.org/P81742
[20:53:20] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2001.codfw.wmnet with reason: sleep test
[20:54:11] <wikibugs>	 (03Merged) 10jenkins-bot: logstash: bugfix: add missing threshold [alerts] - 10https://gerrit.wikimedia.org/r/1181777 (owner: 10Cwhite)
[20:54:18] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[20:58:20] <wikibugs>	 (03PS1) 10RLazarus: deployment_server: Include --local_dblist contents when logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1181779 (https://phabricator.wikimedia.org/T401737)
[20:59:17] <sbassett>	 Hey all - is the late backport window wrapped up yet?
[20:59:20] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[20:59:27] <sbassett>	 We’ve definitely got a few sec patches to get out during the window.
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2100).
[21:00:44] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2006.codfw.wmnet with reason: sleep test
[21:02:58] <wikibugs>	 (03CR) 10RLazarus: "Tested:" [puppet] - 10https://gerrit.wikimedia.org/r/1181779 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus)
[21:03:40] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:03:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Replace cloudvirt1045 [puppet] - 10https://gerrit.wikimedia.org/r/1181767 (https://phabricator.wikimedia.org/T401693) (owner: 10Andrew Bogott)
[21:07:36] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2006.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:07:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:08:09] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11116995 (10Papaul) @Jhancock.wm  no entry on the wrong puppet server for this server. Please check site.pp. Thanks
[21:08:25] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest2009.codfw.wmnet with reason: sleep test
[21:08:45] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[21:08:50] <wikibugs>	 10ops-magru: Alert for device ps1-b3-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402835#11116999 (10phaultfinder)
[21:10:24] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[21:13:50] <wikibugs>	 10ops-magru: Alert for device ps1-b4-magru.mgmt.magru.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402582#11117005 (10phaultfinder)
[21:13:51] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[21:15:08] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: PHPSessionHandler: Better handle objects stored in the session [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181782 (https://phabricator.wikimedia.org/T402602)
[21:15:41] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2009.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
[21:16:11] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1002.eqiad.wmnet with reason: sleep test
[21:17:02] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1003.eqiad.wmnet with reason: sleep test
[21:17:06] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:17:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181782 (https://phabricator.wikimedia.org/T402602) (owner: 10Bartosz Dziewoński)
[21:18:47] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1002.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:19:23] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:19:46] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Add maint script to fix global edit count of renamed users [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181788 (https://phabricator.wikimedia.org/T313900)
[21:19:59] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Add maint script to fix wrong actors in local log entries for global renames [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181789 (https://phabricator.wikimedia.org/T398177)
[21:20:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181788 (https://phabricator.wikimedia.org/T313900) (owner: 10Bartosz Dziewoński)
[21:20:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 26 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/CentralAuth] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181789 (https://phabricator.wikimedia.org/T398177) (owner: 10Bartosz Dziewoński)
[21:21:51] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[21:21:53] <wikibugs>	 (03PS11) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246)
[21:23:26] <wikibugs>	 (03CR) 10CI reject: [V:04-1] opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking)
[21:23:51] <sbassett>	 !log Deployed security mitigations for T402146, T402077, T402095, T400525
[21:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:56] <wikibugs>	 (03PS2) 10Ladsgroup: Move update of category members count to a dedicated job [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181786 (https://phabricator.wikimedia.org/T365303)
[21:33:24] <wikibugs>	 (03PS12) 10Bking: opensearch-operator: Add chart for review [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246)
[21:38:07] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Move update of category members count to a dedicated job [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181786 (https://phabricator.wikimedia.org/T365303) (owner: 10Ladsgroup)
[21:38:13] <Amir1>	 jouncebot: nowandnext
[21:38:13] <jouncebot>	 For the next 1 hour(s) and 21 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2100)
[21:38:14] <jouncebot>	 In 1 hour(s) and 21 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2300)
[21:39:30] <wikibugs>	 (03PS6) 10Jdlrobson: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (https://phabricator.wikimedia.org/T397084) (owner: 10Bernard Wang)
[21:41:40] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove manifests/files/templaces for openstack 'Caracal' [puppet] - 10https://gerrit.wikimedia.org/r/1181790 (https://phabricator.wikimedia.org/T390914)
[21:42:06] <wikibugs>	 (03Merged) 10jenkins-bot: Move update of category members count to a dedicated job [core] (wmf/1.45.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1181786 (https://phabricator.wikimedia.org/T365303) (owner: 10Ladsgroup)
[21:44:57] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181790 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott)
[21:45:00] <wikibugs>	 (03PS1) 10Cwhite: opensearch: selectively enable cluster health check [puppet] - 10https://gerrit.wikimedia.org/r/1181791 (https://phabricator.wikimedia.org/T321808)
[21:47:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Remove manifests/files/templaces for openstack 'Caracal' [puppet] - 10https://gerrit.wikimedia.org/r/1181790 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott)
[21:47:28] <sbassett>	 !log Deployed updated security mitigations for T399627
[21:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:34] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]]
[21:47:39] <stashbot>	 T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303
[21:49:07] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack: switch libvirt live migration uri to cloud-private hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1181676 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi)
[21:49:51] <Amir1>	 sbassett: my patch went immediately after yours, I'll be quick. Need to fix this UBN
[21:51:15] <maryum>	 preparing to do further security deploys
[21:51:23] <maryum>	 Amir1 are you done with yours?
[21:51:32] <Amir1>	 nope, it's running
[21:51:53] <Amir1>	 I ping you once I'm done
[21:53:36] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[21:53:41] <stashbot>	 T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303
[21:56:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11117170 (10VRiley-WMF)
[21:56:39] <maryum>	 great just let me know
[22:00:01] <wikibugs>	 (03PS1) 10JHathaway: provision: poll for reboot via Redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795
[22:05:24] <logmsgbot>	 !log ladsgroup@deploy1003 Sync cancelled.
[22:05:50] <Amir1>	 maryum: is anything you're doing in core? if so, then let me revert my patch
[22:06:04] <Amir1>	 if not, then I can spend time to fix it
[22:06:05] <maryum>	 I was going to deploy a core patch, but it's not working right now
[22:06:19] <Amir1>	 ah okay
[22:06:37] <maryum>	 I have a patch to deploy for abuse filter and one for cirrus search
[22:07:12] <maryum>	 I have a patch you wrote to deploy as well
[22:07:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] provision: poll for reboot via Redfish [cookbooks] - 10https://gerrit.wikimedia.org/r/1181795 (owner: 10JHathaway)
[22:07:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:07:45] <maryum>	 wait, is that the core patch you're trying to deploy?
[22:08:03] <maryum>	 Amir1 are working on the same thing possibly
[22:08:07] <maryum>	 *we
[22:08:24] <Amir1>	 nope, mine is different 
[22:08:28] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]]
[22:08:33] <stashbot>	 T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303
[22:08:39] <Amir1>	 I'm pushing it again, I realized what was wrong
[22:08:55] <maryum>	 well I do have a core patch that you wrote that I also want to deploy
[22:09:00] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:09:03] <maryum>	 so just let me know when I can get started
[22:09:04] <wikibugs>	 (03CR) 10Cwhite: "PCC: OK https://puppet-compiler.wmflabs.org/output/1181791/6732/" [puppet] - 10https://gerrit.wikimedia.org/r/1181791 (https://phabricator.wikimedia.org/T321808) (owner: 10Cwhite)
[22:09:55] <Amir1>	 Ah I remember which patch 
[22:10:02] <Amir1>	 it's different :D
[22:10:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: eqiad netbox cable cleanup - https://phabricator.wikimedia.org/T402536#11117185 (10VRiley-WMF)
[22:10:21] <Amir1>	 I'll be done quickly, sorry for barging in the security window, I'm having a very fun time
[22:11:04] <wikibugs>	 (03PS1) 10Bking: [WIP]:dse-k8s-worker: Add sysctl setting that's required for OpenSearch [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T362105)
[22:11:19] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1181797 (https://phabricator.wikimedia.org/T362105) (owner: 10Bking)
[22:13:55] <perryprog>	 To be fair, "making cat-a-lot not crash wikipedia" is arguably security related. 
[22:14:08] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
[22:14:09] <maryum>	 Amir1 there's still time in the window that's fine
[22:14:13] <stashbot>	 T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303
[22:15:44] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[22:16:36] <Amir1>	 perryprog: xD Indeeeeed
[22:20:55] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1181786|Move update of category members count to a dedicated job (T365303)]] (duration: 12m 26s)
[22:21:00] <stashbot>	 T365303: Move update of category members count to CategoryMembershipChangeJob - https://phabricator.wikimedia.org/T365303
[22:21:16] <Amir1>	 maryum: I'm done, feel free to move forward
[22:21:24] <maryum>	 awesome, thanks!!
[22:23:08] <maryum>	 preparing to deploy the core security patch first
[22:27:07] <maryum>	 !log Deployed security fix for T298690
[22:27:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:41] <wikibugs>	 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117263 (10Ladsgroup) >>! In T402749#11113595, @Zache wrote: > @Ladsgroup : Just FYI, from the Cat-a-lot code side, the user was using a pre-August 18, 2024 ve...
[22:41:22] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1181799 (https://phabricator.wikimedia.org/T402870)
[22:41:27] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870)
[22:42:35] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1181801 (https://phabricator.wikimedia.org/T402871)
[22:42:39] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181802 (https://phabricator.wikimedia.org/T402871)
[22:44:20] <maryum>	 ran into some issues, running scap again
[22:51:41] <maryum>	 have one more scap to run after this, will go over this window for a slight bit
[22:54:25] <wikibugs>	 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117358 (10JJMC89) >>! In T402749#11117263, @Ladsgroup wrote: >>>! In T402749#11113595, @Zache wrote: >> @Ladsgroup : Just FYI, from the Cat-a-lot code side, t...
[22:55:49] <maryum>	 !log Deploy security fix for T401220
[22:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:36] <wikibugs>	 (03PS1) 10Dzahn: zuul: add a provider and zookeeper server to nodepool config [puppet] - 10https://gerrit.wikimedia.org/r/1181804 (https://phabricator.wikimedia.org/T401614)
[22:56:38] <maryum>	 running the last of the scaps
[22:59:02] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1181804/6733/zuul1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1181804 (https://phabricator.wikimedia.org/T401614) (owner: 10Dzahn)
[22:59:56] <maryum>	 finished with scap
[23:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2300)
[23:00:37] <maryum>	 !log Deploy security fix for T397396
[23:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:04] <Jdlrobson>	 maryum I just need to deploy a beta cluster only change. Are you done with your deploys?
[23:04:34] <jinxer-wm>	 FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag
[23:04:35] <jinxer-wm>	 FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
[23:07:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:08:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdlrobson@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181743 (owner: 10Jdlrobson)
[23:09:00] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:09:15] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Temporarily use production for summary endpoint" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1181743 (owner: 10Jdlrobson)
[23:14:54] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:14:54] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:14:54] <jinxer-wm>	 FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-esams and LibertyGlobal (2001:730:2209:1::d52e:ba09) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status  - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
[23:15:06] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-esams:xe-0/1/7 (Transit: Liberty Global (BB00088) {#021468}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[23:15:26] <wikibugs>	 (03PS1) 10RLazarus: mathoid: Upgrade to envoy-future:1.26.8-2 for validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1181806 (https://phabricator.wikimedia.org/T402584)
[23:17:55] <logmsgbot>	 !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on sretest1003.eqiad.wmnet with reason: sleep test
[23:21:02] <logmsgbot>	 !log jhathaway@cumin1002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[23:23:29] <logmsgbot>	 !log jhathaway@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[23:24:41] <Amir1>	 jouncebot: nowandnext
[23:24:41] <jouncebot>	 For the next 0 hour(s) and 35 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250825T2300)
[23:24:41] <jouncebot>	 In 2 hour(s) and 35 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250826T0200)
[23:29:29] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1181779 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus)
[23:29:34] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[23:29:40] <wikibugs>	 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117413 (10Ladsgroup) >>! In T402749#11117358, @JJMC89 wrote: >>>! In T402749#11117263, @Ladsgroup wrote: >>>>! In T402749#11113595, @Zache wrote: >>> @Ladsgro...
[23:30:55] <logmsgbot>	 !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T402871
[23:30:59] <stashbot>	 T402871: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T402871
[23:31:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set db1160 with weight 0 T402871', diff saved to https://phabricator.wikimedia.org/P81743 and previous config saved to /var/cache/conftool/dbconfig/20250825-233128-ladsgroup.json
[23:33:40] <wikibugs>	 06SRE, 06Commons, 06DBA, 06Traffic: Unable to save edits or delete pages on Commons – database lag - https://phabricator.wikimedia.org/T402749#11117423 (10Josve05a) >>! In T402749#11117413, @Ladsgroup wrote: > [...] Maybe someone should mention it to them?  There is https://commons.wikimedia.org/wiki/U...
[23:35:02] <wikibugs>	 (03CR) 10RLazarus: [C:03+2] deployment_server: Include --local_dblist contents when logging to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1181779 (https://phabricator.wikimedia.org/T401737) (owner: 10RLazarus)
[23:37:11] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1181801 (https://phabricator.wikimedia.org/T402871)
[23:37:16] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1181801 (https://phabricator.wikimedia.org/T402871) (owner: 10Gerrit maintenance bot)
[23:38:18] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181807
[23:38:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181807 (owner: 10TrainBranchBot)
[23:39:20] <Amir1>	 !log Starting s4 eqiad failover from db1244 to db1160 - T402871
[23:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:24] <stashbot>	 T402871: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T402871
[23:39:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T402871', diff saved to https://phabricator.wikimedia.org/P81744 and previous config saved to /var/cache/conftool/dbconfig/20250825-233934-ladsgroup.json
[23:42:43] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2025.08.16 - 2025.09.05): decommission an-druid100[1-2] - https://phabricator.wikimedia.org/T402814#11117433 (10Jclark-ctr)
[23:43:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Promote db1160 to s4 primary and set section read-write T402871', diff saved to https://phabricator.wikimedia.org/P81745 and previous config saved to /var/cache/conftool/dbconfig/20250825-234303-ladsgroup.json
[23:43:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:45:32] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181802 (https://phabricator.wikimedia.org/T402871) (owner: 10Gerrit maintenance bot)
[23:45:46] <logmsgbot>	 !log ladsgroup@dns1004 START - running authdns-update
[23:47:01] <logmsgbot>	 !log ladsgroup@dns1004 END - running authdns-update
[23:48:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[23:48:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depool db1244 T402871', diff saved to https://phabricator.wikimedia.org/P81746 and previous config saved to /var/cache/conftool/dbconfig/20250825-234856-ladsgroup.json
[23:49:02] <stashbot>	 T402871: Switchover s4 master (db1244 -> db1160) - https://phabricator.wikimedia.org/T402871
[23:50:08] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1181800 (https://phabricator.wikimedia.org/T402870) (owner: 10Gerrit maintenance bot)
[23:51:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1181807 (owner: 10TrainBranchBot)
[23:54:53] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9235 bytes in 7.379 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:54:53] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 54682 bytes in 7.525 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:59:13] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for db1244.eqiad.wmnet
[23:59:21] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.depool db1244 - Upgrading db1244.eqiad.wmnet
[23:59:29] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db1244 - Upgrading db1244.eqiad.wmnet