Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-operations

Filter:
Start date
End date

Displaying 1056 items:

2025-10-20 00:07:53 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1197059'
2025-10-20 00:07:53 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1197059 (owner: ''TrainBranchBot)'
2025-10-20 00:11:39 <wikibugs> 'SRE, ''Data-Engineering, ''Traffic-Icebox, ''MobileFrontend (Tracking): RFC: Serve mobile and desktop variants through the same URL (unified mobile routing) - https://phabricator.wikimedia.org/T214998#11288227 (''toni.stoev) Now that mobile and desktop are served from the same URL, I am kind of satisfied...'
2025-10-20 00:43:44 <jinxer-wm> FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2025-10-20 00:45:03 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1197059 (owner: ''TrainBranchBot)'
2025-10-20 00:52:34 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-20 01:00:39 <logmsgbot> !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
2025-10-20 01:14:32 <logmsgbot> !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 52s)
2025-10-20 01:28:11 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 01:32:11 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 01:35:07 <jinxer-wm> FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
2025-10-20 01:37:11 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 02:09:24 <jinxer-wm> FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 02:10:03 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2025-10-20 02:16:35 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1180523 (https://phabricator.wikimedia.org/T401288) (owner: ''Seanleong-wmde)'
2025-10-20 02:17:44 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-20 03:07:11 <jinxer-wm> FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 03:10:03 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2025-10-20 03:45:21 <jinxer-wm> FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
2025-10-20 03:46:18 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196492 (https://phabricator.wikimedia.org/T389409) (owner: ''BPirkle)'
2025-10-20 03:51:48 <jinxer-wm> FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2025-10-20 04:06:51 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-20 04:26:51 <jinxer-wm> RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-20 04:43:44 <jinxer-wm> FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2025-10-20 04:52:34 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-20 04:54:35 <wikibugs> ('PS1) ''Marostegui: mariadb: Productionize db2245 [puppet] - ''https://gerrit.wikimedia.org/r/1197061 (https://phabricator.wikimedia.org/T406551)'
2025-10-20 04:56:23 <wikibugs> ('CR) ''Marostegui: [C:''+2] mariadb: Productionize db2245 [puppet] - ''https://gerrit.wikimedia.org/r/1197061 (https://phabricator.wikimedia.org/T406551) (owner: ''Marostegui)'
2025-10-20 05:03:27 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db2248.codfw.wmnet onto db2245.codfw.wmnet
2025-10-20 05:03:32 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.mysql.depool db2248 - Depool db2248.codfw.wmnet to then clone it to db2245.codfw.wmnet - marostegui@cumin1003
2025-10-20 05:04:44 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2248 - Depool db2248.codfw.wmnet to then clone it to db2245.codfw.wmnet - marostegui@cumin1003
2025-10-20 05:04:44 <logmsgbot> !log marostegui@cumin1003 END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2248.codfw.wmnet onto db2245.codfw.wmnet
2025-10-20 05:05:15 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.mysql.clone of db2248.codfw.wmnet onto db2245.codfw.wmnet
2025-10-20 05:08:29 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-20 05:08:51 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-20 05:14:11 <wikibugs> ('PS1) ''Marostegui: db1206: Migrate to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1197062 (https://phabricator.wikimedia.org/T407463)'
2025-10-20 05:14:58 <wikibugs> ('CR) ''Marostegui: [C:''+2] db1206: Migrate to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1197062 (https://phabricator.wikimedia.org/T407463) (owner: ''Marostegui)'
2025-10-20 05:17:09 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1206.eqiad.wmnet with reason: Maintenance
2025-10-20 05:17:13 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1206 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84069 and previous config saved to /var/cache/conftool/dbconfig/20251020-051712-marostegui.json
2025-10-20 05:19:27 <wikibugs> ('PS1) ''Marostegui: instances.yaml: Remove es1027 from dbctl [puppet] - ''https://gerrit.wikimedia.org/r/1197065 (https://phabricator.wikimedia.org/T407595)'
2025-10-20 05:19:56 <wikibugs> ('CR) ''Marostegui: [C:''+2] instances.yaml: Remove es1027 from dbctl [puppet] - ''https://gerrit.wikimedia.org/r/1197065 (https://phabricator.wikimedia.org/T407595) (owner: ''Marostegui)'
2025-10-20 05:20:58 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1027 from dbctl T407595', diff saved to https://phabricator.wikimedia.org/P84070 and previous config saved to /var/cache/conftool/dbconfig/20251020-052057-marostegui.json
2025-10-20 05:21:03 <stashbot> T407595: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595
2025-10-20 05:24:39 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84071 and previous config saved to /var/cache/conftool/dbconfig/20251020-052438-root.json
2025-10-20 05:28:11 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 05:28:51 <jinxer-wm> RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-20 05:30:31 <wikibugs> ('PS1) ''Marostegui: mariadb: Decommission es1027 [puppet] - ''https://gerrit.wikimedia.org/r/1197066 (https://phabricator.wikimedia.org/T407595)'
2025-10-20 05:33:32 <wikibugs> ('CR) ''Marostegui: [C:''+2] mariadb: Decommission es1027 [puppet] - ''https://gerrit.wikimedia.org/r/1197066 (https://phabricator.wikimedia.org/T407595) (owner: ''Marostegui)'
2025-10-20 05:34:23 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.decommission for hosts es1027.eqiad.wmnet
2025-10-20 05:34:24 <jinxer-wm> RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-20 05:34:32 <wikibugs> ('CR) ''CI reject: [V:''-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - ''https://gerrit.wikimedia.org/r/1197067 (owner: ''L10n-bot)'
2025-10-20 05:35:07 <jinxer-wm> FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
2025-10-20 05:39:45 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84072 and previous config saved to /var/cache/conftool/dbconfig/20251020-053944-root.json
2025-10-20 05:39:59 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.dns.netbox
2025-10-20 05:43:17 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1027.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
2025-10-20 05:43:36 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: es1027.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1003"
2025-10-20 05:43:37 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-20 05:43:37 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts es1027.eqiad.wmnet
2025-10-20 05:46:48 <wikibugs> 'ops-eqiad, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595#11288401 (''Marostegui) a:''Marostegui''None'
2025-10-20 05:46:50 <wikibugs> 'ops-eqiad, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595#11288405 (''Marostegui) This is ready for DCOps'
2025-10-20 05:48:52 <wikibugs> ('PS1) ''Marostegui: db1261: Enable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1197071 (https://phabricator.wikimedia.org/T406550)'
2025-10-20 05:49:53 <wikibugs> ('CR) ''Marostegui: [C:''+2] db1261: Enable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1197071 (https://phabricator.wikimedia.org/T406550) (owner: ''Marostegui)'
2025-10-20 05:54:51 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84074 and previous config saved to /var/cache/conftool/dbconfig/20251020-055450-root.json
2025-10-20 05:56:31 <wikibugs> ('PS1) ''Marostegui: instances.yaml: Add db1261 to dbctl [puppet] - ''https://gerrit.wikimedia.org/r/1197074 (https://phabricator.wikimedia.org/T406550)'
2025-10-20 05:57:01 <wikibugs> ('CR) ''Marostegui: [C:''+2] instances.yaml: Add db1261 to dbctl [puppet] - ''https://gerrit.wikimedia.org/r/1197074 (https://phabricator.wikimedia.org/T406550) (owner: ''Marostegui)'
2025-10-20 05:59:00 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db1261 depooled T406550', diff saved to https://phabricator.wikimedia.org/P84075 and previous config saved to /var/cache/conftool/dbconfig/20251020-055859-marostegui.json
2025-10-20 05:59:04 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 05:59:43 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 1%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84076 and previous config saved to /var/cache/conftool/dbconfig/20251020-055942-root.json
2025-10-20 06:09:57 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84077 and previous config saved to /var/cache/conftool/dbconfig/20251020-060956-root.json
2025-10-20 06:14:49 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 5%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84078 and previous config saved to /var/cache/conftool/dbconfig/20251020-061449-root.json
2025-10-20 06:14:54 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 06:17:44 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-20 06:29:55 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 7%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84079 and previous config saved to /var/cache/conftool/dbconfig/20251020-062955-root.json
2025-10-20 06:29:59 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 06:45:01 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 10%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84080 and previous config saved to /var/cache/conftool/dbconfig/20251020-064501-root.json
2025-10-20 06:45:06 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 07:00:05 <jouncebot> Amir1, Urbanecm, and awight: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T0700).
2025-10-20 07:00:05 <jouncebot> cormacparle, sergi0, and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2025-10-20 07:00:08 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 20%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84081 and previous config saved to /var/cache/conftool/dbconfig/20251020-070007-root.json
2025-10-20 07:00:12 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 07:00:27 <Superpes> o/
2025-10-20 07:04:38 <wikibugs> ('CR) ''Jelto: [V:''+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7301/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) (owner: ''FNegri)'
2025-10-20 07:07:11 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 07:08:14 <wikibugs> ('CR) ''Jelto: [V:''+1 C:''+1] "lgtm from the gitlab-runner side. Also `docker::network` is just used by gitlab-runners afaict." [puppet] - ''https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) (owner: ''FNegri)'
2025-10-20 07:15:14 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 25%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84082 and previous config saved to /var/cache/conftool/dbconfig/20251020-071513-root.json
2025-10-20 07:15:18 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 07:20:00 <wikibugs> ('PS1) ''Marostegui: db1218: Migration to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1197080 (https://phabricator.wikimedia.org/T407463)'
2025-10-20 07:20:31 <wikibugs> ('CR) ''Marostegui: [C:''+2] db1218: Migration to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1197080 (https://phabricator.wikimedia.org/T407463) (owner: ''Marostegui)'
2025-10-20 07:21:50 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1218.eqiad.wmnet with reason: Maintenance
2025-10-20 07:21:54 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1218 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84083 and previous config saved to /var/cache/conftool/dbconfig/20251020-072153-marostegui.json
2025-10-20 07:22:36 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
2025-10-20 07:23:11 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
2025-10-20 07:23:38 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'.
2025-10-20 07:24:10 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'.
2025-10-20 07:27:03 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on es2032.codfw.wmnet,sretest2003.codfw.wmnet with reason: Cloning
2025-10-20 07:28:33 <marostegui> !log Stop MariaDB on es2032 to clone sretest2003 T407472
2025-10-20 07:28:36 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-20 07:28:37 <stashbot> T407472: Install a testing db with Debian Trixie - https://phabricator.wikimedia.org/T407472
2025-10-20 07:29:21 <sergi0> meh, I overslept, I will move my change for later window
2025-10-20 07:29:41 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1218 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84084 and previous config saved to /var/cache/conftool/dbconfig/20251020-072939-root.json
2025-10-20 07:29:52 <wikibugs> ('CR) ''Marostegui: [C:''+1] "We are not removing the check from icinga yet, right?" [puppet] - ''https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: ''Tiziano Fogli)'
2025-10-20 07:30:15 <wikibugs> ('CR) ''Marostegui: [C:''+1] site.pp: set role for db-test* hosts [puppet] - ''https://gerrit.wikimedia.org/r/1196910 (https://phabricator.wikimedia.org/T400056) (owner: ''Federico Ceratto)'
2025-10-20 07:30:20 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 30%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84085 and previous config saved to /var/cache/conftool/dbconfig/20251020-073019-root.json
2025-10-20 07:30:24 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 07:34:38 <wikibugs> ('CR) ''Majavah: [C:''+2] remote: Support timezone-aware objects [software/spicerack] - ''https://gerrit.wikimedia.org/r/1196139 (https://phabricator.wikimedia.org/T401581) (owner: ''Majavah)'
2025-10-20 07:35:30 <marostegui> !log Stop MariaDB on es2032 to clone sretest2003 T407352
2025-10-20 07:35:31 <wikibugs> ('PS1) ''Marostegui: sretest2003: Add note [puppet] - ''https://gerrit.wikimedia.org/r/1197081'
2025-10-20 07:35:34 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-20 07:35:35 <stashbot> T407352: Test config H 1P in external store - https://phabricator.wikimedia.org/T407352
2025-10-20 07:35:50 <wikibugs> ('CR) ''Majavah: [C:''+1] Remove Hiera option to disable agent forwarding [puppet] - ''https://gerrit.wikimedia.org/r/1189855 (https://phabricator.wikimedia.org/T198138) (owner: ''Muehlenhoff)'
2025-10-20 07:36:16 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2248 gradually with 4 steps - Pool db2248.codfw.wmnet in after cloning
2025-10-20 07:37:13 <wikibugs> ('CR) ''Marostegui: [C:''+2] sretest2003: Add note [puppet] - ''https://gerrit.wikimedia.org/r/1197081 (owner: ''Marostegui)'
2025-10-20 07:41:41 <Superpes> Uhm so no one is available to deploy?
2025-10-20 07:43:57 <wikibugs> ('Merged) ''jenkins-bot: remote: Support timezone-aware objects [software/spicerack] - ''https://gerrit.wikimedia.org/r/1196139 (https://phabricator.wikimedia.org/T401581) (owner: ''Majavah)'
2025-10-20 07:44:47 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1218 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84088 and previous config saved to /var/cache/conftool/dbconfig/20251020-074446-root.json
2025-10-20 07:45:21 <jinxer-wm> FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
2025-10-20 07:45:26 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 50%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84089 and previous config saved to /var/cache/conftool/dbconfig/20251020-074525-root.json
2025-10-20 07:45:30 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 07:46:38 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to analytics-privatedata-users for vicaplet - https://phabricator.wikimedia.org/T407605#11288528 (''WMDECyn) Approved from WMDE side'
2025-10-20 07:51:48 <jinxer-wm> FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2025-10-20 07:53:27 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: ''Seanleong-wmde)'
2025-10-20 07:53:30 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: ''Seanleong-wmde)'
2025-10-20 07:56:29 <wikibugs> ('CR) ''Federico Ceratto: [C:''+2] site.pp: set role for db-test* hosts [puppet] - ''https://gerrit.wikimedia.org/r/1196910 (https://phabricator.wikimedia.org/T400056) (owner: ''Federico Ceratto)'
2025-10-20 07:59:53 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1218 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84091 and previous config saved to /var/cache/conftool/dbconfig/20251020-075952-root.json
2025-10-20 08:00:32 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 60%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84092 and previous config saved to /var/cache/conftool/dbconfig/20251020-080031-root.json
2025-10-20 08:00:35 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 08:04:12 <logmsgbot> !log cmooney@cumin1003 START - Cookbook sre.network.peering with action 'configure' for AS: 60051
2025-10-20 08:04:13 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: ''Phuedx)'
2025-10-20 08:04:38 <logmsgbot> !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 60051
2025-10-20 08:07:21 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Promote es1051 to es3 primary as es1028 will be decommissioned T406690 T407720', diff saved to https://phabricator.wikimedia.org/P84094 and previous config saved to /var/cache/conftool/dbconfig/20251020-080721-marostegui.json
2025-10-20 08:07:27 <stashbot> T406690: Decommission es1026 - es1034 - https://phabricator.wikimedia.org/T406690
2025-10-20 08:07:27 <stashbot> T407720: decommission es1028.eqiad.wmnet - https://phabricator.wikimedia.org/T407720
2025-10-20 08:08:05 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es1028 T407720', diff saved to https://phabricator.wikimedia.org/P84095 and previous config saved to /var/cache/conftool/dbconfig/20251020-080804-marostegui.json
2025-10-20 08:08:50 <wikibugs> ('PS1) ''Marostegui: es1028: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1197191 (https://phabricator.wikimedia.org/T407720)'
2025-10-20 08:09:34 <wikibugs> ('CR) ''Marostegui: [C:''+2] es1028: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1197191 (https://phabricator.wikimedia.org/T407720) (owner: ''Marostegui)'
2025-10-20 08:11:43 <marostegui> federico3: I think your puppet-merge is waiting for your answer, as it's been locked for 20 mins now, can you double check?
2025-10-20 08:14:59 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1218 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84096 and previous config saved to /var/cache/conftool/dbconfig/20251020-081458-root.json
2025-10-20 08:15:38 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 75%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84097 and previous config saved to /var/cache/conftool/dbconfig/20251020-081537-root.json
2025-10-20 08:15:42 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 08:17:59 <wikibugs> 'SRE, ''Cloud-VPS, ''DC-Ops, ''cloud-services-team (FY2025/26-Q1): Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11288584 (''dcaro) Nice! I'm eager to see the results of adding it to the cluster, as now a single NIC might be a...'
2025-10-20 08:20:02 <marostegui> federico3: ping
2025-10-20 08:21:02 <wikibugs> ('CR) ''Volans: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: ''Jcrespo)'
2025-10-20 08:21:45 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2248 gradually with 4 steps - Pool db2248.codfw.wmnet in after cloning
2025-10-20 08:21:48 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2248.codfw.wmnet onto db2245.codfw.wmnet
2025-10-20 08:26:13 <wikibugs> 'SRE: Sendemail network error (deployment) - https://phabricator.wikimedia.org/T407723 (''MKopec) ''NEW'
2025-10-20 08:28:43 <wikibugs> 'SRE: Sendmail network error (deployment) - https://phabricator.wikimedia.org/T407723#11288627 (''MKopec)'
2025-10-20 08:30:01 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2032 gradually with 4 steps - Pool es2032.codfw.wmnet in after cloning
2025-10-20 08:30:44 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1261 (re)pooling @ 100%: Host provisioned T406550', diff saved to https://phabricator.wikimedia.org/P84100 and previous config saved to /var/cache/conftool/dbconfig/20251020-083043-root.json
2025-10-20 08:30:48 <stashbot> T406550: Productionize db126[0-3] - https://phabricator.wikimedia.org/T406550
2025-10-20 08:31:44 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on es2028.codfw.wmnet,sretest2003.codfw.wmnet with reason: Cloning
2025-10-20 08:33:35 <wikibugs> ('PS4) ''Jcrespo: cumin: Migrate cumin1002 mariadb remote backups to cumin1003 [puppet] - ''https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380)'
2025-10-20 08:34:21 <logmsgbot> !log fceratto@cumin1003 END (ERROR) - Cookbook sre.mysql.pool (exit_code=97) es2032 gradually with 4 steps - Pool es2032.codfw.wmnet in after cloning
2025-10-20 08:36:03 <logmsgbot> !log fceratto@cumin1003 END (FAIL) - Cookbook sre.mysql.clone_es (exit_code=99) of es2032.codfw.wmnet onto es2055.codfw.wmnet
2025-10-20 08:37:01 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2032 - Cloning issue
2025-10-20 08:37:09 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2032 - Cloning issue
2025-10-20 08:38:50 <wikibugs> ('CR) ''Jcrespo: [C:''+2] cumin: Migrate cumin1002 mariadb remote backups to cumin1003 [puppet] - ''https://gerrit.wikimedia.org/r/1196886 (https://phabricator.wikimedia.org/T389380) (owner: ''Jcrespo)'
2025-10-20 08:39:33 <icinga-wm> PROBLEM - MariaDB read only es1 on es2032 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
2025-10-20 08:39:34 <icinga-wm> PROBLEM - mysqld processes #page on es2032 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
2025-10-20 08:40:00 <federico3> the host had a glitch, looking at it
2025-10-20 08:40:05 <jynus> mmmm
2025-10-20 08:40:43 <marostegui> !incidents
2025-10-20 08:40:44 <sirenbot> 6889 (UNACKED) es2032 (paged)/mysqld processes (paged)
2025-10-20 08:40:48 <marostegui> !ack 6689
2025-10-20 08:40:48 <sirenbot> Attempt to ack incident 6689 failed.
2025-10-20 08:40:53 <marostegui> !ack 6889
2025-10-20 08:40:54 <sirenbot> 6889 (ACKED) es2032 (paged)/mysqld processes (paged)
2025-10-20 08:41:44 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool es2028 to clone sretest2003', diff saved to https://phabricator.wikimedia.org/P84102 and previous config saved to /var/cache/conftool/dbconfig/20251020-084143-marostegui.json
2025-10-20 08:42:05 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on es2032.codfw.wmnet with reason: Cloning tool bug
2025-10-20 08:42:35 <marostegui> federico3: If you will reclone it, maybe it will need more than 4 hours given the size of external store?
2025-10-20 08:42:43 <wikibugs> ('CR) ''Santiago Faci: [C:''+1] MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: ''Phuedx)'
2025-10-20 08:43:05 <federico3> marostegui: reclone *from* it or reclone es2032 itself?
2025-10-20 08:43:26 <marostegui> federico3: i don't know if you have to reclone or not, just asking if 4h is enough for anything you need
2025-10-20 08:43:44 <jinxer-wm> FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2025-10-20 08:43:50 <federico3> it's probably longer but transfer.py does not calculate ETA...
2025-10-20 08:45:01 <marostegui> federico3: then maybe extend the downtime a bit more
2025-10-20 08:45:08 <marostegui> to avoid paging
2025-10-20 08:46:31 <federico3> I don't know yet if we want to repool it now or clone it. Is the bug in transfer.py able to cause data corruption? (in theory it should be only reading from the source host, not make changes)
2025-10-20 08:47:07 <marostegui> it shouldn't make data corruption on the source host, no
2025-10-20 08:47:13 <wikibugs> 'SRE, ''Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726 (''cmooney) ''NEW p:''Triage''Low'
2025-10-20 08:48:03 <marostegui> federico3: you should be fine to repool
2025-10-20 08:48:34 <wikibugs> 'SRE, ''Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11288728 (''cmooney)'
2025-10-20 08:49:34 <federico3> odd, pigz and nc terminated by themselves
2025-10-20 08:50:34 <icinga-wm> RECOVERY - mysqld processes #page on es2032 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
2025-10-20 08:50:34 <icinga-wm> RECOVERY - MariaDB read only es1 on es2032 is OK: Version 10.11.13-MariaDB-log, Uptime 34s, read_only: True, event_scheduler: True, 24.45 QPS, connection latency: 0.032657s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
2025-10-20 08:50:54 <wikibugs> 'SRE, ''Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11288760 (''cmooney)'
2025-10-20 08:52:34 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-20 08:56:10 <wikibugs> ('PS1) ''Brouberol: airflow: enable the triggerer to hit the Kubernetes API servers with appropriate permissions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197207 (https://phabricator.wikimedia.org/T406958)'
2025-10-20 08:59:57 <wikibugs> ('CR) ''Kevin Bazira: [C:''+1] airflow: enable the triggerer to hit the Kubernetes API servers with appropriate permissions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197207 (https://phabricator.wikimedia.org/T406958) (owner: ''Brouberol)'
2025-10-20 09:00:38 <wikibugs> ('CR) ''Brouberol: [C:''+2] airflow: enable the triggerer to hit the Kubernetes API servers with appropriate permissions [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197207 (https://phabricator.wikimedia.org/T406958) (owner: ''Brouberol)'
2025-10-20 09:03:09 <logmsgbot> !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply
2025-10-20 09:05:56 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2032.codfw.wmnet
2025-10-20 09:05:57 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2032.codfw.wmnet
2025-10-20 09:06:16 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2032 gradually with 4 steps - Pooling in
2025-10-20 09:07:29 <logmsgbot> !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply
2025-10-20 09:07:54 <logmsgbot> !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mw-experimental: apply
2025-10-20 09:12:06 <logmsgbot> !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply
2025-10-20 09:16:57 <wikibugs> ('PS1) ''Jgiannelos: mw-experimental: Fix motd so user with wikidev permissions can restart the timers [puppet] - ''https://gerrit.wikimedia.org/r/1197210'
2025-10-20 09:18:16 <wikibugs> ('PS2) ''Jgiannelos: mw-experimental: Fix motd for users with wikidev permissions [puppet] - ''https://gerrit.wikimedia.org/r/1197210'
2025-10-20 09:27:55 <wikibugs> ('PS1) ''Marostegui: db2247: Enable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1197211 (https://phabricator.wikimedia.org/T406551)'
2025-10-20 09:28:11 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 09:31:13 <wikibugs> ('CR) ''Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: ''Krinkle)'
2025-10-20 09:33:50 <wikibugs> ('CR) ''Krinkle: trafficserver: Add missing REST Gateway for Beta Cluster (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1182652 (https://phabricator.wikimedia.org/T404387) (owner: ''Krinkle)'
2025-10-20 09:34:33 <wikibugs> ('CR) ''Marostegui: [C:''+2] db2247: Enable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1197211 (https://phabricator.wikimedia.org/T406551) (owner: ''Marostegui)'
2025-10-20 09:35:07 <jinxer-wm> FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
2025-10-20 09:37:12 <wikibugs> ('PS1) ''Marostegui: db2247: Add to dbctl [puppet] - ''https://gerrit.wikimedia.org/r/1197214 (https://phabricator.wikimedia.org/T406551)'
2025-10-20 09:38:23 <wikibugs> ('CR) ''Marostegui: [C:''+2] db2247: Add to dbctl [puppet] - ''https://gerrit.wikimedia.org/r/1197214 (https://phabricator.wikimedia.org/T406551) (owner: ''Marostegui)'
2025-10-20 09:42:07 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Add db2247 to dbctl T406551', diff saved to https://phabricator.wikimedia.org/P84106 and previous config saved to /var/cache/conftool/dbconfig/20251020-094207-marostegui.json
2025-10-20 09:42:12 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 09:42:12 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 1%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84107 and previous config saved to /var/cache/conftool/dbconfig/20251020-094212-root.json
2025-10-20 09:44:29 <wikibugs> ('PS1) ''Federico Ceratto: es2055.yaml: enable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859)'
2025-10-20 09:45:02 <wikibugs> ('CR) ''Marostegui: [C:''+1] "All green in icinta" [puppet] - ''https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859) (owner: ''Federico Ceratto)'
2025-10-20 09:45:35 <wikibugs> ('PS2) ''Federico Ceratto: es2055.yaml, instances.yaml: prepare es2055 [puppet] - ''https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859)'
2025-10-20 09:46:07 <wikibugs> ('CR) ''Marostegui: [C:''+1] es2055.yaml, instances.yaml: prepare es2055 [puppet] - ''https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859) (owner: ''Federico Ceratto)'
2025-10-20 09:51:43 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2032 gradually with 4 steps - Pooling in
2025-10-20 09:54:32 <wikibugs> ('CR) ''FNegri: [C:''+2] docker::network allow custom MTU value [puppet] - ''https://gerrit.wikimedia.org/r/1196929 (https://phabricator.wikimedia.org/T405742) (owner: ''FNegri)'
2025-10-20 09:57:18 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 5%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84109 and previous config saved to /var/cache/conftool/dbconfig/20251020-095718-root.json
2025-10-20 09:57:23 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 10:00:04 <jouncebot> Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1000)
2025-10-20 10:01:42 <wikibugs> ('CR) ''Federico Ceratto: [C:''+2] es2055.yaml, instances.yaml: prepare es2055 [puppet] - ''https://gerrit.wikimedia.org/r/1197216 (https://phabricator.wikimedia.org/T402859) (owner: ''Federico Ceratto)'
2025-10-20 10:04:19 <logmsgbot> !log fceratto@cumin1003 dbctl commit (dc=all): 'Add es2055 T402859', diff saved to https://phabricator.wikimedia.org/P84110 and previous config saved to /var/cache/conftool/dbconfig/20251020-100419-fceratto.json
2025-10-20 10:04:24 <stashbot> T402859: Productionize es2049-es2057 - https://phabricator.wikimedia.org/T402859
2025-10-20 10:04:33 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.hosts.remove-downtime for es2055.codfw.wmnet
2025-10-20 10:04:34 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es2055.codfw.wmnet
2025-10-20 10:10:28 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.mysql.pool es2055 gradually with 4 steps - Pooling in new host
2025-10-20 10:12:24 <cormacparle> gah! sorry folks, mixed up the times for that deployment I had scheduled - will schedule for this afternoon instead
2025-10-20 10:12:25 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 7%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84111 and previous config saved to /var/cache/conftool/dbconfig/20251020-101224-root.json
2025-10-20 10:12:29 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 10:14:40 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) (owner: ''Cparle)'
2025-10-20 10:17:44 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-20 10:20:24 <wikibugs> ('CR) ''Hnowlan: [C:''+1] [DNM] Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) (owner: ''Aaron Schulz)'
2025-10-20 10:20:31 <wikibugs> ('CR) ''Hnowlan: [C:''+1] Set wgRestSandboxSpecs['wmf-restbase'] to use the static specs everywhere [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1190743 (https://phabricator.wikimedia.org/T396805) (owner: ''Aaron Schulz)'
2025-10-20 10:27:30 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 10%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84112 and previous config saved to /var/cache/conftool/dbconfig/20251020-102730-root.json
2025-10-20 10:27:35 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 10:30:17 <wikibugs> ('PS1) ''Effie Mouzeli: mw-experimental-mediawiki-image-update: support environment in release [puppet] - ''https://gerrit.wikimedia.org/r/1197225 (https://phabricator.wikimedia.org/T405110)'
2025-10-20 10:31:52 <wikibugs> ('CR) ''Effie Mouzeli: [C:''+1] mw-experimental-mediawiki-image-update: support environment in release [puppet] - ''https://gerrit.wikimedia.org/r/1197225 (https://phabricator.wikimedia.org/T405110) (owner: ''Effie Mouzeli)'
2025-10-20 10:32:11 <wikibugs> ('CR) ''Effie Mouzeli: [C:''+1] mw-experimental: Fix motd for users with wikidev permissions [puppet] - ''https://gerrit.wikimedia.org/r/1197210 (owner: ''Jgiannelos)'
2025-10-20 10:33:19 <wikibugs> ('CR) ''Jgiannelos: [C:''+1] mw-experimental-mediawiki-image-update: support environment in release [puppet] - ''https://gerrit.wikimedia.org/r/1197225 (https://phabricator.wikimedia.org/T405110) (owner: ''Effie Mouzeli)'
2025-10-20 10:34:04 <wikibugs> ('CR) ''Effie Mouzeli: [C:''+2] mw-experimental: Fix motd for users with wikidev permissions [puppet] - ''https://gerrit.wikimedia.org/r/1197210 (owner: ''Jgiannelos)'
2025-10-20 10:34:18 <wikibugs> ('CR) ''Effie Mouzeli: [C:''+2] mw-experimental-mediawiki-image-update: support environment in release [puppet] - ''https://gerrit.wikimedia.org/r/1197225 (https://phabricator.wikimedia.org/T405110) (owner: ''Effie Mouzeli)'
2025-10-20 10:42:36 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 20%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84114 and previous config saved to /var/cache/conftool/dbconfig/20251020-104236-root.json
2025-10-20 10:42:39 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''Mail: Sendmail network error (deployment) - https://phabricator.wikimedia.org/T407723#11289002 (''Aklapper)'
2025-10-20 10:42:41 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 10:42:56 <wikibugs> 'sre-alert-triage, ''SRE Observability: Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) - https://phabricator.wikimedia.org/T407484#11289007 (''tappof) a:''tappof'
2025-10-20 10:48:17 <wikibugs> ('PS1) ''Marostegui: db1219: Migrate to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1197228 (https://phabricator.wikimedia.org/T407463)'
2025-10-20 10:49:11 <wikibugs> ('CR) ''Marostegui: [C:''+2] db1219: Migrate to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1197228 (https://phabricator.wikimedia.org/T407463) (owner: ''Marostegui)'
2025-10-20 10:49:58 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1219.eqiad.wmnet with reason: Maintenance
2025-10-20 10:50:03 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db1219 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84115 and previous config saved to /var/cache/conftool/dbconfig/20251020-105002-marostegui.json
2025-10-20 10:53:36 <wikibugs> ('PS1) ''Effie Mouzeli: proxoid: fix healthchecks [puppet] - ''https://gerrit.wikimedia.org/r/1197230 (https://phabricator.wikimedia.org/T407615)'
2025-10-20 10:57:42 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 25%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84117 and previous config saved to /var/cache/conftool/dbconfig/20251020-105742-root.json
2025-10-20 10:57:48 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 10:57:55 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1219 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84118 and previous config saved to /var/cache/conftool/dbconfig/20251020-105754-root.json
2025-10-20 10:58:04 <wikibugs> ('PS1) ''Slyngshede: P::cache::haproxy enable x-is-browser everywhere [puppet] - ''https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161)'
2025-10-20 11:02:40 <wikibugs> ('CR) ''Slyngshede: [V:''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7302/console"; [puppet] - ''https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: ''Slyngshede)'
2025-10-20 11:02:56 <wikibugs> ('CR) ''Hnowlan: [C:''+1] "I think this is enough of a general concern for SRE at large (and beyond) that keeping SRE as the team here makes sense to me." [puppet] - ''https://gerrit.wikimedia.org/r/1196943 (https://phabricator.wikimedia.org/T407120) (owner: ''Tiziano Fogli)'
2025-10-20 11:06:44 <wikibugs> ('CR) ''Slyngshede: [V:''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7303/console"; [puppet] - ''https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: ''Slyngshede)'
2025-10-20 11:07:11 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 11:12:27 <wikibugs> ('CR) ''Slyngshede: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7304/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: ''Slyngshede)'
2025-10-20 11:12:48 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 30%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84120 and previous config saved to /var/cache/conftool/dbconfig/20251020-111248-root.json
2025-10-20 11:12:53 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 11:13:01 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1219 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84121 and previous config saved to /var/cache/conftool/dbconfig/20251020-111300-root.json
2025-10-20 11:16:43 <icinga-wm> PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos
2025-10-20 11:18:06 <wikibugs> ('PS1) ''Jelto: admin: remove legacy ssh key for jelto [puppet] - ''https://gerrit.wikimedia.org/r/1197233 (https://phabricator.wikimedia.org/T407606)'
2025-10-20 11:19:33 <icinga-wm> RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Thanos
2025-10-20 11:21:02 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2055 gradually with 4 steps - Pooling in new host
2025-10-20 11:23:31 <wikibugs> ('CR) ''Vgutierrez: [C:''+1] P::cache::haproxy enable x-is-browser everywhere [puppet] - ''https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: ''Slyngshede)'
2025-10-20 11:24:17 <jinxer-wm> FIRING: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-20 11:27:54 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 50%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84123 and previous config saved to /var/cache/conftool/dbconfig/20251020-112754-root.json
2025-10-20 11:27:59 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 11:28:04 <wikibugs> ('CR) ''Vgutierrez: [C:''+1] "tested against the 4 realservers using `curl --connect-to ::$(dig +short hcaptcha1001.wikimedia.org):4260 https://hcaptcha.wikimedia.org/h"; [puppet] - ''https://gerrit.wikimedia.org/r/1197230 (https://phabricator.wikimedia.org/T407615) (owner: ''Effie Mouzeli)'
2025-10-20 11:28:07 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1219 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84124 and previous config saved to /var/cache/conftool/dbconfig/20251020-112806-root.json
2025-10-20 11:29:17 <jinxer-wm> RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-20 11:31:06 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595#11289124 (''Jclark-ctr) a:''Jclark-ctr'
2025-10-20 11:31:22 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1027.eqiad.wmnet - https://phabricator.wikimedia.org/T407595#11289127 (''Jclark-ctr) ''Open''Resolved'
2025-10-20 11:33:05 <wikibugs> ('PS1) ''Federico Ceratto: site.pp, es2056.yaml, preseed.yaml: Prepare es2056 for es2 [puppet] - ''https://gerrit.wikimedia.org/r/1197238 (https://phabricator.wikimedia.org/T402859)'
2025-10-20 11:34:47 <wikibugs> ('CR) ''Fabfur: [C:''+1] P::cache::haproxy enable x-is-browser everywhere [puppet] - ''https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: ''Slyngshede)'
2025-10-20 11:39:57 <wikibugs> ('CR) ''Slyngshede: [V:''+1 C:''+2] P::cache::haproxy enable x-is-browser everywhere [puppet] - ''https://gerrit.wikimedia.org/r/1197231 (https://phabricator.wikimedia.org/T398161) (owner: ''Slyngshede)'
2025-10-20 11:42:44 <wikibugs> ('PS1) ''Majavah: admin: home: Add mux alias for taavi [puppet] - ''https://gerrit.wikimedia.org/r/1197240'
2025-10-20 11:43:00 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 60%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84125 and previous config saved to /var/cache/conftool/dbconfig/20251020-114300-root.json
2025-10-20 11:43:04 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 11:43:13 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db1219 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84126 and previous config saved to /var/cache/conftool/dbconfig/20251020-114312-root.json
2025-10-20 11:44:08 <wikibugs> ('CR) ''Marostegui: [C:''+1] site.pp, es2056.yaml, preseed.yaml: Prepare es2056 for es2 [puppet] - ''https://gerrit.wikimedia.org/r/1197238 (https://phabricator.wikimedia.org/T402859) (owner: ''Federico Ceratto)'
2025-10-20 11:45:21 <jinxer-wm> FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
2025-10-20 11:45:39 <wikibugs> ('CR) ''Federico Ceratto: [C:''+2] site.pp, es2056.yaml, preseed.yaml: Prepare es2056 for es2 [puppet] - ''https://gerrit.wikimedia.org/r/1197238 (https://phabricator.wikimedia.org/T402859) (owner: ''Federico Ceratto)'
2025-10-20 11:47:04 <wikibugs> ('CR) ''Hnowlan: [C:''+1] Route transform/wikitext/to/lint(.*) to the gateway on test2wiki [puppet] - ''https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: ''Aaron Schulz)'
2025-10-20 11:48:57 <wikibugs> ('CR) ''Hnowlan: [C:''+1] "I can get this one out for you today if you'd like." [puppet] - ''https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: ''Aaron Schulz)'
2025-10-20 11:51:48 <jinxer-wm> FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2025-10-20 11:52:43 <godog> !log add cloudcephosd1051 to the cluster via wmcs.ceph.osd.bootstrap_and_add - T405478
2025-10-20 11:52:47 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-20 11:52:48 <stashbot> T405478: Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478
2025-10-20 11:54:01 <wikibugs> 'SRE, ''Cloud-VPS, ''DC-Ops, ''cloud-services-team (FY2025/26-Q1): Experiment with cloudcephosd1050 and cloudcephosd1051 in single-nic configuration - https://phabricator.wikimedia.org/T405478#11289156 (''fgiunchedi) >>! In T405478#11288584, @dcaro wrote: > Nice! I'm eager to see the results of adding it...'
2025-10-20 11:56:19 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11289159 (''Jclark-ctr) a:''Jclark-ctr''Eevans'
2025-10-20 11:58:05 <wikibugs> ('CR) ''Brouberol: "I think we won't need to, cf the WIP work in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1196700"; [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196505 (https://phabricator.wikimedia.org/T406876) (owner: ''Btullis)'
2025-10-20 11:58:06 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 75%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84127 and previous config saved to /var/cache/conftool/dbconfig/20251020-115805-root.json
2025-10-20 11:58:11 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 12:02:15 <jinxer-wm> FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.169s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2025-10-20 12:05:52 <Dreamy_Jazz> Is it just me or does gerrit feel slow?
2025-10-20 12:06:15 <Dreamy_Jazz> Like refreshing the page gets a slow response and my last attempt gets a `ERR_CONNECTION_RESET` error
2025-10-20 12:06:56 <wikibugs> ('PS1) ''Majavah: toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - ''https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558)'
2025-10-20 12:06:58 <wikibugs> ('PS1) ''Majavah: toolforge: toolviews: Add initial HAProxy support [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558)'
2025-10-20 12:07:15 <jinxer-wm> RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 2.103s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2025-10-20 12:07:36 <wikibugs> ('CR) ''CI reject: [V:''-1] toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - ''https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:07:42 <wikibugs> ('CR) ''CI reject: [V:''-1] toolforge: toolviews: Add initial HAProxy support [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:08:45 <Dreamy_Jazz> Gerrit seems to be back to normal for me now
2025-10-20 12:09:03 <wikibugs> ('PS1) ''Filippo Giunchedi: cloudceph: set mtu only when interfaces exist [puppet] - ''https://gerrit.wikimedia.org/r/1197245 (https://phabricator.wikimedia.org/T405478)'
2025-10-20 12:12:20 <wikibugs> ('PS2) ''Majavah: toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - ''https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558)'
2025-10-20 12:12:24 <wikibugs> ('PS2) ''Majavah: toolforge: toolviews: Add initial HAProxy support [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558)'
2025-10-20 12:13:12 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2247 (re)pooling @ 100%: Host provisioned T406551', diff saved to https://phabricator.wikimedia.org/P84128 and previous config saved to /var/cache/conftool/dbconfig/20251020-121311-root.json
2025-10-20 12:13:17 <stashbot> T406551: Productionize db224[5-8] - https://phabricator.wikimedia.org/T406551
2025-10-20 12:13:54 <wikibugs> ('CR) ''Majavah: [V:''+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/"; [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:14:33 <logmsgbot> !log ozge@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
2025-10-20 12:15:47 <wikibugs> ('CR) ''Majavah: [V:''+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7306/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:35:37 <wikibugs> ('CR) ''Slyngshede: [C:''+1] admin: remove legacy ssh key for jelto [puppet] - ''https://gerrit.wikimedia.org/r/1197233 (https://phabricator.wikimedia.org/T407606) (owner: ''Jelto)'
2025-10-20 12:36:23 <wikibugs> ('CR) ''Marostegui: "This is an interesting discussion, and I understand both sides" [puppet] - ''https://gerrit.wikimedia.org/r/1184544 (https://phabricator.wikimedia.org/T402859) (owner: ''Federico Ceratto)'
2025-10-20 12:41:33 <logmsgbot> !log fceratto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on es2056.codfw.wmnet with reason: Setting up new ES host
2025-10-20 12:43:44 <jinxer-wm> FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2025-10-20 12:43:46 <wikibugs> ('CR) ''Filippo Giunchedi: "LGTM, see also inline" [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:44:56 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - ''https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:50:40 <wikibugs> ('PS3) ''Majavah: toolforge: toolviews: Add initial HAProxy support [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558)'
2025-10-20 12:50:53 <wikibugs> ('CR) ''Majavah: toolforge: toolviews: Add initial HAProxy support (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:51:25 <wikibugs> ('CR) ''Majavah: [C:''+2] toolforge: toolviews: Move nginx-specific parts to nginx profile [puppet] - ''https://gerrit.wikimedia.org/r/1197242 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:51:46 <wikibugs> ('CR) ''Majavah: [V:''+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7307/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:51:53 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] toolforge: toolviews: Add initial HAProxy support [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:52:04 <wikibugs> ('CR) ''Majavah: [V:''+1 C:''+2] toolforge: toolviews: Add initial HAProxy support [puppet] - ''https://gerrit.wikimedia.org/r/1197243 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 12:52:18 <wikibugs> ('CR) ''Majavah: [C:''+2] admin: home: Add mux alias for taavi [puppet] - ''https://gerrit.wikimedia.org/r/1197240 (owner: ''Majavah)'
2025-10-20 12:52:34 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-20 12:53:21 <wikibugs> ('CR) ''Kamila Součková: [C:''+1] "Thank you Effie!" [puppet] - ''https://gerrit.wikimedia.org/r/1197230 (https://phabricator.wikimedia.org/T407615) (owner: ''Effie Mouzeli)'
2025-10-20 12:55:34 <wikibugs> ('CR) ''Kamila Součková: "Not really needed given Ie8a088958116fd9db24c3c678540f3dc3ff65281 ." [puppet] - ''https://gerrit.wikimedia.org/r/1196954 (https://phabricator.wikimedia.org/T407615) (owner: ''Kamila Součková)'
2025-10-20 12:57:21 <wikibugs> ('CR) ''Jelto: [C:''+2] admin: remove legacy ssh key for jelto [puppet] - ''https://gerrit.wikimedia.org/r/1197233 (https://phabricator.wikimedia.org/T407606) (owner: ''Jelto)'
2025-10-20 12:57:36 <wikibugs> ('CR) ''Majavah: "q: Is there a risk of an ordering issue here where the MTU is not set at all? i.e. is it fine to not run the command, or should this have " [puppet] - ''https://gerrit.wikimedia.org/r/1197245 (https://phabricator.wikimedia.org/T405478) (owner: ''Filippo Giunchedi)'
2025-10-20 12:59:32 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11289265 (''Jclark-ctr) After discussing this with @cmooney over IRC, I reviewed the moves on the Eqiad side and noted that we had one fr...'
2025-10-20 13:00:05 <jouncebot> Lucas_WMDE, Urbanecm, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1300).
2025-10-20 13:00:05 <jouncebot> edsanders, bpirkle, sergi0, seanleong-wmde, phuedx, and cormacparle: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2025-10-20 13:00:07 <edsanders> o/
2025-10-20 13:00:09 <Lucas_WMDE> o/
2025-10-20 13:00:10 <edsanders> I can self deploy
2025-10-20 13:00:13 <cormacparle> o/
2025-10-20 13:00:26 <bpirkle> o/
2025-10-20 13:00:52 <Lucas_WMDE> edsanders: go ahead :)
2025-10-20 13:01:04 <cormacparle> erm ... my wikimedia debug extension says "unspecified backend"
2025-10-20 13:01:07 <Lucas_WMDE> (looks like Flow backport CI is pretty fast, so no need to put a config change ahead of it I think)
2025-10-20 13:01:08 <cormacparle> is this expected?
2025-10-20 13:01:24 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by esanders@deploy2002 using scap backport" [extensions/Flow] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1196884 (https://phabricator.wikimedia.org/T407357) (owner: ''Esanders)'
2025-10-20 13:01:24 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.mysql.clone_es of es2033.codfw.wmnet onto es2056.codfw.wmnet
2025-10-20 13:01:25 <Lucas_WMDE> cormacparle: are you on a WMF production domain?
2025-10-20 13:01:29 <logmsgbot> !log fceratto@cumin1003 START - Cookbook sre.mysql.depool es2033 - Depool es2033.codfw.wmnet to then clone it to es2056.codfw.wmnet - fceratto@cumin1003
2025-10-20 13:01:34 <Lucas_WMDE> (the dropdown contents change depending on which domain you’re on)
2025-10-20 13:01:48 <logmsgbot> !log fceratto@cumin1003 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2033 - Depool es2033.codfw.wmnet to then clone it to es2056.codfw.wmnet - fceratto@cumin1003
2025-10-20 13:02:05 <seanleong-wmde> o/
2025-10-20 13:02:21 <cormacparle> Lucas_WMDE: no, on beta
2025-10-20 13:02:30 <cormacparle> (which seems to be down :( )
2025-10-20 13:02:40 <cormacparle> it's just a beta config change I want to deploy
2025-10-20 13:02:56 <Lucas_WMDE> you can’t use WikimediaDebug on beta afaik
2025-10-20 13:03:09 <Lucas_WMDE> the config change will just be deployed, and ca. 10 minutes later you can check if it worked or not
2025-10-20 13:03:14 <cormacparle> aha ok grand
2025-10-20 13:03:21 <Lucas_WMDE> (beta WFM)
2025-10-20 13:04:03 <wikibugs> ('Merged) ''jenkins-bot: Follow-up I6698875: Set insert-ignore on all insert queries [extensions/Flow] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1196884 (https://phabricator.wikimedia.org/T407357) (owner: ''Esanders)'
2025-10-20 13:04:23 <wikibugs> ('CR) ''Filippo Giunchedi: "I'm not aware of ordering issues no, if the interface is down when interface::setting runs then mtu will be set the next time the interfac" [puppet] - ''https://gerrit.wikimedia.org/r/1197245 (https://phabricator.wikimedia.org/T405478) (owner: ''Filippo Giunchedi)'
2025-10-20 13:04:48 <logmsgbot> !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1196884|Follow-up I6698875: Set insert-ignore on all insert queries (T407357)]]
2025-10-20 13:04:48 <logmsgbot> fceratto@cumin1003 clone_es (PID 1381498) is awaiting input
2025-10-20 13:04:53 <stashbot> T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357
2025-10-20 13:09:41 <wikibugs> ('CR) ''Lucas Werkmeister (WMDE): [C:''+1] "Can confirm that this is unused in wmf.23:" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) (owner: ''Sergio Gimeno)'
2025-10-20 13:10:06 <Lucas_WMDE> once the current deploy is done I think we can do the changes for bpirkle, sergi0 and cormacparle together
2025-10-20 13:10:15 <Lucas_WMDE> one actual change, one cleanup that should be a no-op, and one beta change
2025-10-20 13:10:26 <bpirkle> sounds good to me
2025-10-20 13:10:40 <cormacparle> 👍
2025-10-20 13:11:26 <wikibugs> ('CR) ''CI reject: [V:''-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - ''https://gerrit.wikimedia.org/r/1197247 (owner: ''L10n-bot)'
2025-10-20 13:15:24 <Lucas_WMDE> scap is taking a while building those container images
2025-10-20 13:20:40 <Lucas_WMDE> “Waiting 300 seconds for swift after full mediawiki image build (T390251)”
2025-10-20 13:20:40 <stashbot> T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251
2025-10-20 13:20:49 <Lucas_WMDE> (that was 13:19:12 UTC)
2025-10-20 13:20:53 <edsanders> yeah
2025-10-20 13:21:26 <Lucas_WMDE> not sure why it was a full image build, your backport doesn’t include i18n changes
2025-10-20 13:21:55 <Lucas_WMDE> maybe because it’s the first backport this week? there was an earlier window this morning but it seemingly only deployed config changes, maybe that’s different
2025-10-20 13:24:37 <edsanders> hmm - finished now at least
2025-10-20 13:28:11 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 13:29:08 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Infrastructure-Foundations: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579#11289373 (''cmooney) >>! In T405579#11289265, @Jclark-ctr wrote: > After discussing this with @cmooney over IRC, I reviewed the moves on...'
2025-10-20 13:30:05 <logmsgbot> !log esanders@deploy2002 esanders: Backport for [[gerrit:1196884|Follow-up I6698875: Set insert-ignore on all insert queries (T407357)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-20 13:30:09 <stashbot> T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357
2025-10-20 13:30:26 <logmsgbot> !log esanders@deploy2002 esanders: Continuing with sync
2025-10-20 13:35:07 <jinxer-wm> FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
2025-10-20 13:37:49 <phuedx> Sorry I'm late
2025-10-20 13:37:56 <phuedx> o/
2025-10-20 13:38:08 <Lucas_WMDE> no worries, you didn’t miss anything yet ^^
2025-10-20 13:38:12 <Lucas_WMDE> we’re still in the first deployment
2025-10-20 13:38:33 <phuedx> That is both good and bad
2025-10-20 13:38:38 <Lucas_WMDE> (:
2025-10-20 13:39:00 <Lucas_WMDE> (kinda tempted to !bash that, with timestamps, ngl)
2025-10-20 13:39:30 <phuedx> reads the scrollback
2025-10-20 13:39:46 <phuedx> D:
2025-10-20 13:41:00 <seanleong-wmde> Lucas_WMDE Hii, the config changes will be at the last?
2025-10-20 13:41:21 <Lucas_WMDE> I was planning to do the config changes for bpirkle, sergi0 and cormacparle together next
2025-10-20 13:41:30 <Lucas_WMDE> and then yours and that by phuedx afterwards, not yet sure if together or separately
2025-10-20 13:42:01 <Lucas_WMDE> actually, is sergi0 around?
2025-10-20 13:42:25 <phuedx> Mine is a NOOP. It can be bundled
2025-10-20 13:43:05 <Lucas_WMDE> ok
2025-10-20 13:43:24 <logmsgbot> !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196884|Follow-up I6698875: Set insert-ignore on all insert queries (T407357)]] (duration: 38m 36s)
2025-10-20 13:43:28 <stashbot> T407357: Ignore duplicate key errors when creating Flow posts from LQT - https://phabricator.wikimedia.org/T407357
2025-10-20 13:44:10 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196492 (https://phabricator.wikimedia.org/T389409) (owner: ''BPirkle)'
2025-10-20 13:44:11 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) (owner: ''Sergio Gimeno)'
2025-10-20 13:44:11 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: ''Phuedx)'
2025-10-20 13:44:12 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) (owner: ''Cparle)'
2025-10-20 13:45:59 <wikibugs> ('Merged) ''jenkins-bot: Enable REST Sandbox on all wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196492 (https://phabricator.wikimedia.org/T389409) (owner: ''BPirkle)'
2025-10-20 13:46:01 <wikibugs> ('Merged) ''jenkins-bot: Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1192913 (https://phabricator.wikimedia.org/T396382) (owner: ''Sergio Gimeno)'
2025-10-20 13:46:26 <wikibugs> ('Merged) ''jenkins-bot: MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196857 (https://phabricator.wikimedia.org/T406332) (owner: ''Phuedx)'
2025-10-20 13:46:28 <wikibugs> ('Merged) ''jenkins-bot: Enable Special:EditWatchlist pagination on beta [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196703 (https://phabricator.wikimedia.org/T41510) (owner: ''Cparle)'
2025-10-20 13:46:46 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1196492|Enable REST Sandbox on all wikis (T389409)]], [[gerrit:1192913|Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled (T396382)]], [[gerrit:1196857|MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames (T406332)]], [[gerrit:1196703|Enable Special:EditWatchlist pagination on beta (T41510)]]
2025-10-20 13:46:56 <stashbot> T389409: Release REST API Sandbox on all remaining wikis - https://phabricator.wikimedia.org/T389409
2025-10-20 13:46:56 <stashbot> T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382
2025-10-20 13:46:57 <stashbot> T406332: Make XLAB_STREAMS allowlist configurable - https://phabricator.wikimedia.org/T406332
2025-10-20 13:46:57 <stashbot> T41510: Opening Special:EditWatchlist with a large watchlist hits server timeout (Create watchlist pager) - https://phabricator.wikimedia.org/T41510
2025-10-20 13:49:16 <wikibugs> ('PS8) ''Federico Ceratto: clone_es.py: clone readonly es* hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/1183646'
2025-10-20 13:49:33 <wikibugs> ('CR) ''Federico Ceratto: "(see comments)" [cookbooks] - ''https://gerrit.wikimedia.org/r/1183646 (owner: ''Federico Ceratto)'
2025-10-20 13:51:18 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 sgimeno, bpirkle, phuedx, lucaswerkmeister-wmde, cparle: Backport for [[gerrit:1196492|Enable REST Sandbox on all wikis (T389409)]], [[gerrit:1192913|Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled (T396382)]], [[gerrit:1196857|MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames (T406332)]], [[gerrit:1196703|Enable Special:EditWatchlist paginati
2025-10-20 13:51:18 <logmsgbot> on on beta (T41510)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-20 13:51:40 <Lucas_WMDE> sergi0, bpirkle, phuedx: please test :)
2025-10-20 13:52:39 <bpirkle> Mine looks good, thank you!
2025-10-20 13:54:30 <phuedx> Lucas_WMDE: LGTM. As I said, it's a NOP. I did take a moment to confirm the name though :)
2025-10-20 13:54:39 <Lucas_WMDE> ok :)
2025-10-20 13:54:59 <topranks> !log enable 2x40G lag from asw2-c-eqiad to ssw1-dX-eqiad T405579
2025-10-20 13:55:01 <Lucas_WMDE> sergi0’s should be a no-op as well but i wouldn’t mind if he could confirm it ^^
2025-10-20 13:55:02 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-20 13:55:03 <stashbot> T405579: Eqiad C/D refresh: move asw2-c-eqiad CR uplinks to Nokia switches - https://phabricator.wikimedia.org/T405579
2025-10-20 13:55:09 <Lucas_WMDE> but otherwise I’ll just click the “yes” button in a moment
2025-10-20 13:55:42 <wikibugs> ('PS1) ''Majavah: toolforge: toolviews: Fix parsing HAProxy logs [puppet] - ''https://gerrit.wikimedia.org/r/1197270 (https://phabricator.wikimedia.org/T284558)'
2025-10-20 13:56:04 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 sgimeno, bpirkle, phuedx, lucaswerkmeister-wmde, cparle: Continuing with sync
2025-10-20 13:56:08 <seanleong-wmde> Lucas_WMDE, is there still space for this backport to revert the qual and ref change?
2025-10-20 13:56:25 <Lucas_WMDE> there’s always change to deploy reverts that fix UBNs ;)
2025-10-20 13:56:27 <Lucas_WMDE> jouncebot: next
2025-10-20 13:56:27 <jouncebot> In 0 hour(s) and 33 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1430)
2025-10-20 13:56:35 <Lucas_WMDE> and there’s a half-hour break before the next window, so sure
2025-10-20 13:56:40 <cormacparle> beta is still down so I can't test anything :/
2025-10-20 13:56:58 <Lucas_WMDE> it’s still working for me
2025-10-20 13:57:09 <Lucas_WMDE> what does “down” look like?
2025-10-20 13:57:30 <cormacparle> https://usercontent.irccloud-cdn.com/file/r2fCFMVm/image.png
2025-10-20 13:57:48 <wikibugs> ('CR) ''Majavah: [C:''+2] toolforge: toolviews: Fix parsing HAProxy logs [puppet] - ''https://gerrit.wikimedia.org/r/1197270 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 13:57:53 <cdanis> cormacparle: your IP might be blocked from beta heh
2025-10-20 13:57:55 <Lucas_WMDE> please see the bottom of the screen
2025-10-20 13:58:06 <Lucas_WMDE> (not included in the screenshot but I’m making an educated guess at what might be there :P)
2025-10-20 13:58:22 <wikibugs> ('PS1) ''Brouberol: deployment_server: create kubeconfigs to deploy postgresql-growthbook [puppet] - ''https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578)'
2025-10-20 13:58:25 <Lucas_WMDE> (what cdanis said)
2025-10-20 13:58:28 <wikibugs> ('PS1) ''Brouberol: cloudnative-pg-operator: watch the growthbook namespace [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197272 (https://phabricator.wikimedia.org/T406578)'
2025-10-20 13:58:30 <wikibugs> ('PS1) ''Brouberol: Deploy a postgresql-growthbook cluster in dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197273 (https://phabricator.wikimedia.org/T406578)'
2025-10-20 13:58:46 <cormacparle> Error: 403, Requests from your IP have been blocked, please see https://wikitech.wikimedia.org/wiki/Beta/Blocked for more information. at Mon, 20 Oct 2025 13:57:53 GMT
2025-10-20 13:58:48 <cormacparle> hah!
2025-10-20 13:58:51 <cormacparle> ok
2025-10-20 13:59:02 <Lucas_WMDE> yeah, that :)
2025-10-20 13:59:04 <_joe_> cormacparle: you naughty boy what did you do with beta to get banned?
2025-10-20 13:59:04 <cormacparle> how do I get unblocked?
2025-10-20 13:59:13 <taavi> I would start from that link :-)
2025-10-20 13:59:14 <cormacparle> looks innocent
2025-10-20 14:00:25 <seanleong-wmde> okay, doing the revert and scheduling it now Lucas_WMDE, thanks
2025-10-20 14:00:36 <Lucas_WMDE> alright, thanks!
2025-10-20 14:01:57 <wikibugs> ('CR) ''Lucas Werkmeister (WMDE): [C:''-1] Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: ''Seanleong-wmde)'
2025-10-20 14:02:31 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1196492|Enable REST Sandbox on all wikis (T389409)]], [[gerrit:1192913|Growth: remove no longer in use GENewcomerTasksStarterDifficultyEnabled (T396382)]], [[gerrit:1196857|MetricsPlatform: Initialize $wgMetricsPlatformExperimentStreamNames (T406332)]], [[gerrit:1196703|Enable Special:EditWatchlist pagination on beta (T41510)]] (duration
2025-10-20 14:02:31 <logmsgbot> : 15m 45s)
2025-10-20 14:02:40 <stashbot> T389409: Release REST API Sandbox on all remaining wikis - https://phabricator.wikimedia.org/T389409
2025-10-20 14:02:40 <stashbot> T396382: Deployment Plan: Allow limiting "Add a Link" to new editors - https://phabricator.wikimedia.org/T396382
2025-10-20 14:02:41 <stashbot> T406332: Make XLAB_STREAMS allowlist configurable - https://phabricator.wikimedia.org/T406332
2025-10-20 14:02:41 <stashbot> T41510: Opening Special:EditWatchlist with a large watchlist hits server timeout (Create watchlist pager) - https://phabricator.wikimedia.org/T41510
2025-10-20 14:04:16 <Lucas_WMDE> (backport+config window is still open, waiting to deploy a Wikibase revert)
2025-10-20 14:04:44 <seanleong-wmde> np! Lucas_WMDE, regarding the feature flag for visual changes, the patch is here https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1196896 and will be +2 today for the train tmr. Is it possible to deploy the config change now or do you prefer it tmr?
2025-10-20 14:05:20 <seanleong-wmde> Currently waiting
2025-10-20 14:05:20 <bpirkle> Thank you @Lucas_WMDE
2025-10-20 14:05:43 <Lucas_WMDE> seanleong-wmde: config changes should only be deployed once the code using the config has rolled out with the train
2025-10-20 14:05:51 <seanleong-wmde> > (backport+config window is still open, waiting to deploy a Wikibase revert)
2025-10-20 14:05:51 <seanleong-wmde> Currently waiting* for the tests to pass and will be on it's way to backport
2025-10-20 14:05:56 <Lucas_WMDE> so that any potential issues can be checked when the config change is deployed, and not when the train rolls out
2025-10-20 14:06:01 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11289616 (''Eevans) >>! In T407414#11285096, @Jclark-ctr wrote: > @Eevans are you able to reimage the server i have had no luck due to no root partition error. and preseed file has -efi for raid configuration for a s...'
2025-10-20 14:06:19 <Lucas_WMDE> seanleong-wmde: I would cherry-pick https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197274 to the wmf branch and +2 it immediately
2025-10-20 14:06:33 <Lucas_WMDE> (it’ll still have to go through CI there and that will take long enough anyway. no need to wait for that on the master branch imho)
2025-10-20 14:07:44 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11289657 (''Eevans) >>! In T407414#11289616, @Eevans wrote: >>>! In T407414#11285096, @Jclark-ctr wrote: >> @Eevans are you able to reimage the server i have had no luck due to no root partition error. and preseed fi...'
2025-10-20 14:09:05 <vgutierrez> !log cleaning up IPVS leftovers from HTTPS migration of wdqs-internal services - T193473
2025-10-20 14:09:10 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-20 14:09:10 <stashbot> T193473: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473
2025-10-20 14:10:03 <seanleong-wmde> got it, for the config we will schedule another backport afterwards, for the cherry pick Lucas_WMDE, to this branch wmf/1.45.0-wmf.23?
2025-10-20 14:10:22 <Lucas_WMDE> yes
2025-10-20 14:10:34 <hnowlan> jouncebot: nowandnext
2025-10-20 14:10:34 <jouncebot> No deployments scheduled for the next 0 hour(s) and 19 minute(s)
2025-10-20 14:10:34 <jouncebot> In 0 hour(s) and 19 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1430)
2025-10-20 14:10:43 <Lucas_WMDE> hnowlan: I’m about to deploy a Wikibase revert
2025-10-20 14:10:53 <wikibugs> ('PS1) ''Neslihan Turan: Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197276 (https://phabricator.wikimedia.org/T401290)'
2025-10-20 14:10:56 <hnowlan> Lucas_WMDE: ack, no worries
2025-10-20 14:11:15 <wikibugs> ('CR) ''Hnowlan: [C:''+1] "I think this looks good to go. Let me know when you'd like to try the rollout." [deployment-charts] - ''https://gerrit.wikimedia.org/r/1189447 (https://phabricator.wikimedia.org/T405574) (owner: ''Daniel Kinzler)'
2025-10-20 14:11:22 <Lucas_WMDE> let’s try it
2025-10-20 14:11:31 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197276 (https://phabricator.wikimedia.org/T401290) (owner: ''Neslihan Turan)'
2025-10-20 14:13:09 <icinga-wm> RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-20 14:13:27 <icinga-wm> RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-20 14:13:43 <icinga-wm> RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-20 14:16:09 <seanleong-wmde> Lucas_WMDE I can't add more patch into this timeslot https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197276
2025-10-20 14:16:39 <Lucas_WMDE> https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1300
2025-10-20 14:16:44 <Lucas_WMDE> you can edit the wiki page manually
2025-10-20 14:16:57 <Lucas_WMDE> (sorry, those message were supposed to be the other way around but my IRC client eated them)
2025-10-20 14:17:44 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-20 14:18:25 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: eqiad row C/D DC Ops host migrations - https://phabricator.wikimedia.org/T405021#11289791 (''Jclark-ctr) T405560 2 servers where racked previously on this ticket and are cabled to nokia switches'
2025-10-20 14:21:29 <seanleong-wmde> hahaha no worries, added now, thanks
2025-10-20 14:21:37 <seanleong-wmde> Lucas_WMDE o7
2025-10-20 14:21:43 <Lucas_WMDE> nice, thanks!
2025-10-20 14:22:03 <wikibugs> ('CR) ''Krinkle: [C:''+1] Add virtual domain mapping for OAuth (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1196441 (https://phabricator.wikimedia.org/T348485) (owner: ''D3r1ck01)'
2025-10-20 14:27:29 <wikibugs> ('Merged) ''jenkins-bot: Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197276 (https://phabricator.wikimedia.org/T401290) (owner: ''Neslihan Turan)'
2025-10-20 14:27:50 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1197276|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]]
2025-10-20 14:27:57 <stashbot> T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290
2025-10-20 14:27:58 <stashbot> T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684
2025-10-20 14:27:58 <stashbot> T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744
2025-10-20 14:28:43 <Lucas_WMDE> let’s see how it goes
2025-10-20 14:30:05 <jouncebot> Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1430)
2025-10-20 14:31:17 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11289937 (''bking) {F66767261} Thanks Luca, I'm learning a lot about the process. A few more questions. > If you are reimaging a node...'
2025-10-20 14:31:24 <Lucas_WMDE> I’m still deploying, sorry xLab’ers
2025-10-20 14:31:56 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, neslihanturan: Backport for [[gerrit:1197276|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-20 14:32:07 <Lucas_WMDE> seanleong-wmde: please test!
2025-10-20 14:32:10 <Lucas_WMDE> also looks
2025-10-20 14:32:30 <Lucas_WMDE> https://fi.wikipedia.org/wiki/Vantaa looks okay again on WikimediaDebug, phew
2025-10-20 14:33:02 <seanleong-wmde> In the meantime, Lucas_WMDE, our config change for visual change will be on hewiki, cawiki (group1), ukwiki (group2), in this case can we schedule the config deployment on this Thursday?
2025-10-20 14:33:03 <Lucas_WMDE> (it’s that place what where lentokenttä is!)
2025-10-20 14:33:10 <seanleong-wmde> Lucas_WMDE testing now
2025-10-20 14:33:56 <Lucas_WMDE> https://no.wikipedia.org/wiki/Roberta_Williams also has four references on WikimediaDebug
2025-10-20 14:34:16 <Lucas_WMDE> hm, nevermind, it also has four references without it (even after purging)
2025-10-20 14:34:35 <Lucas_WMDE> ah, they worked around it https://phabricator.wikimedia.org/T407684#11287349
2025-10-20 14:35:17 <Lucas_WMDE> okay, with those instructions I can see a difference between WikimediaDebug and normal
2025-10-20 14:35:49 <seanleong-wmde> yea they change it from ipairs to pairs
2025-10-20 14:35:56 <seanleong-wmde> but it's working fine now
2025-10-20 14:36:08 <Lucas_WMDE> looks like it
2025-10-20 14:36:15 <Lucas_WMDE> okay to continue? or do you want to test anything else?
2025-10-20 14:36:16 <seanleong-wmde> I think it's due to the schema change of the table masking
2025-10-20 14:36:27 <seanleong-wmde> okay to continue
2025-10-20 14:36:31 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, neslihanturan: Continuing with sync
2025-10-20 14:36:49 <seanleong-wmde> thanks for helping with the tests as well Lucas_WMDE o/
2025-10-20 14:36:52 <Lucas_WMDE> and about the config change, I think it would be okay to do it on Wednesday (you just wouldn’t be able to test it on ukwiki then)
2025-10-20 14:37:03 <seanleong-wmde> got it
2025-10-20 14:37:28 <Lucas_WMDE> also depends on whether the train happens at 10:00 or 20:00 CEST this week, I guess
2025-10-20 14:37:32 <Lucas_WMDE> I never know how to tell
2025-10-20 14:37:38 <Lucas_WMDE> both windows are in the deployment calendar and idk which is the “real” one
2025-10-20 14:38:43 <seanleong-wmde> got it! we will schedule it accordingly this week
2025-10-20 14:38:46 <wikibugs> ('PS1) ''Dreamy Jazz: Define CheckUser SuggestedInvestigations event stream [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1197278 (https://phabricator.wikimedia.org/T404177)'
2025-10-20 14:38:47 <Lucas_WMDE> o_O scap died, what
2025-10-20 14:38:53 <Lucas_WMDE> canary checks failed
2025-10-20 14:38:59 <Lucas_WMDE> retrying them…
2025-10-20 14:39:19 <Lucas_WMDE> oh no
2025-10-20 14:39:34 <Lucas_WMDE> Top 1 errors: InvalidArgumentException: $aspect must use one of the XXX_USAGE constants, "CQR" given
2025-10-20 14:39:56 <Lucas_WMDE> that’s bad news
2025-10-20 14:40:23 <seanleong-wmde> yea, because we introduced a new aspect to the DB
2025-10-20 14:40:26 <seanleong-wmde> oh no
2025-10-20 14:40:30 <wikibugs> ('PS2) ''Dreamy Jazz: Define CheckUser Suggested Investigations event stream [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1197278 (https://phabricator.wikimedia.org/T404177)'
2025-10-20 14:40:36 <Lucas_WMDE> oh, to the database!
2025-10-20 14:40:39 <Lucas_WMDE> ah fuck
2025-10-20 14:41:00 <seanleong-wmde> yea C is further granularized to C and CQR
2025-10-20 14:41:20 <Lucas_WMDE> oh god that’s already 1070 hits in logstash
2025-10-20 14:41:33 <Lucas_WMDE> across all sorts of wikis
2025-10-20 14:41:41 <Lucas_WMDE> shit
2025-10-20 14:41:45 <Lucas_WMDE> I don’t think we can deploy that then
2025-10-20 14:42:05 <seanleong-wmde> can we stop the deployment now?
2025-10-20 14:42:15 <jinxer-wm> FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2025-10-20 14:42:37 <wikibugs> ('PS1) ''Lucas Werkmeister (WMDE): Restore "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197281'
2025-10-20 14:42:37 <seanleong-wmde> we can only fix the patch now
2025-10-20 14:42:55 <seanleong-wmde> reverting will only work unless we retouch all the affected pages
2025-10-20 14:42:57 <wikibugs> ('CR) ''LSobanski: "Approved in the IF meeting." [puppet] - ''https://gerrit.wikimedia.org/r/1196090 (https://phabricator.wikimedia.org/T402511) (owner: ''Cathal Mooney)'
2025-10-20 14:43:04 <wikibugs> ('PS2) ''Lucas Werkmeister (WMDE): Restore "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197281 (https://phabricator.wikimedia.org/T401290)'
2025-10-20 14:43:16 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Copied votes on follow-up patch sets have been updated:" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197281 (https://phabricator.wikimedia.org/T401290) (owner: ''Lucas Werkmeister (WMDE))'
2025-10-20 14:43:19 <seanleong-wmde> sorry Lucas_WMDE
2025-10-20 14:43:42 <Lucas_WMDE> I’m reverting the revert
2025-10-20 14:43:46 <wikibugs> ('CR) ''TrainBranchBot: "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197281 (https://phabricator.wikimedia.org/T401290) (owner: ''Lucas Werkmeister (WMDE))'
2025-10-20 14:43:56 <Lucas_WMDE> because right now the revert is still sitting on the canary servers, soaking up user traffic and causing errors
2025-10-20 14:44:05 <Lucas_WMDE> so that’s my top priority right now
2025-10-20 14:44:08 <seanleong-wmde> okay
2025-10-20 14:44:22 <Lucas_WMDE> meanwhile, please try to put together a version of the revert that won’t have this InvalidArgumentException
2025-10-20 14:44:54 <Lucas_WMDE> probably still most of the revert code, but some code that reads the usage from the DB, whenever it sees "CQR", just, idk, ignore it or something
2025-10-20 14:45:00 <Lucas_WMDE> and then we can try rolling that out
2025-10-20 14:45:38 <Lucas_WMDE> or "retouch all the affected pages" as you said
2025-10-20 14:45:48 <Lucas_WMDE> but I’m skeptical that that’s realistic
2025-10-20 14:45:49 <seanleong-wmde> retouch is probably not possible
2025-10-20 14:45:54 <Lucas_WMDE> seemed to affect a lot of pages looking at logstash
2025-10-20 14:45:54 <Lucas_WMDE> yeah
2025-10-20 14:45:55 <seanleong-wmde> will do the first suggestion
2025-10-20 14:46:01 <Lucas_WMDE> thank you
2025-10-20 14:46:04 <seanleong-wmde> then patch it back asap afterwards
2025-10-20 14:46:11 <seanleong-wmde> sorry for the inconveniejnce
2025-10-20 14:46:21 <wikibugs> ('CR) ''Lucas Werkmeister (WMDE): [V:''+2 C:''+2] "skipping gate-and-submit" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197281 (https://phabricator.wikimedia.org/T401290) (owner: ''Lucas Werkmeister (WMDE))'
2025-10-20 14:46:33 <wikibugs> ('PS1) ''Majavah: toolforge: toolviews: Ignore requests for *.svc.toolforge.org [puppet] - ''https://gerrit.wikimedia.org/r/1197283'
2025-10-20 14:46:49 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1197281|Restore "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]]
2025-10-20 14:46:57 <stashbot> T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290
2025-10-20 14:46:57 <stashbot> T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684
2025-10-20 14:46:57 <stashbot> T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744
2025-10-20 14:47:07 <wikibugs> ('PS1) ''Scott French: hieradata: enable analytics-web listener in mediawiki [puppet] - ''https://gerrit.wikimedia.org/r/1196733 (https://phabricator.wikimedia.org/T309738)'
2025-10-20 14:47:09 <wikibugs> ('PS1) ''Scott French: hieradata: allow access to analytics-web from wikikube [puppet] - ''https://gerrit.wikimedia.org/r/1196734 (https://phabricator.wikimedia.org/T309738)'
2025-10-20 14:47:10 <wikibugs> ('PS1) ''Scott French: mw-*: update network policy for access to analytics-web [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196735 (https://phabricator.wikimedia.org/T309738)'
2025-10-20 14:47:15 <jinxer-wm> FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2025-10-20 14:47:23 <Lucas_WMDE> that ^ *might* be me
2025-10-20 14:47:34 <Lucas_WMDE> *looks at logstash*
2025-10-20 14:47:36 <Lucas_WMDE> oh god oh fuck
2025-10-20 14:47:37 <Lucas_WMDE> yeah definitely
2025-10-20 14:47:41 <Lucas_WMDE> fix is already rolling out
2025-10-20 14:47:51 <Lucas_WMDE> at https://spiderpig.wikimedia.org/jobs/776
2025-10-20 14:48:02 <Lucas_WMDE> wha ta day
2025-10-20 14:48:33 <Lucas_WMDE> why does scap not have an option “yes, the canary servers were correct, this code should be immediately undeployed, please roll back to the previous replicaset of the deployment”
2025-10-20 14:48:34 <wikibugs> ('CR) ''CDanis: [C:''+1] multirootca: add the client auth usage to the dse_k8s discovery issuer profile [puppet] - ''https://gerrit.wikimedia.org/r/1196920 (https://phabricator.wikimedia.org/T406876) (owner: ''Brouberol)'
2025-10-20 14:49:18 <Lucas_WMDE> that logstash volume is *just from the canary servers*
2025-10-20 14:49:21 <Lucas_WMDE> (I think)
2025-10-20 14:49:39 <Lucas_WMDE> yeah the Top Hosts table all says mw-api-int.codfw.canary-[hex]
2025-10-20 14:50:13 <taavi> because that wasn't possible pre-mw-on-k8s ('previous' deployment was not a thing then), and I guess no-one implemented an easy option for that afterwards
2025-10-20 14:50:24 <hnowlan> yeah looks like it
2025-10-20 14:50:37 <ihurbain> achievement unlocked: make logstash alert on quantity with only canary logs :P
2025-10-20 14:50:42 <ihurbain> (congratulations.)
2025-10-20 14:50:47 <Lucas_WMDE> /o\
2025-10-20 14:50:55 <Lucas_WMDE> :blobfoxnotlikethis:
2025-10-20 14:50:59 <hnowlan> interesting that it's across everything (-web, -api-ext, -api, even -jobrunner)
2025-10-20 14:51:02 <Lucas_WMDE> I can haz sticker?
2025-10-20 14:51:05 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1197281|Restore "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-20 14:51:14 <Lucas_WMDE> just waiting for the testservers check
2025-10-20 14:51:25 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync
2025-10-20 14:51:25 <Lucas_WMDE> I’m not manually testing this, I’ll just
2025-10-20 14:51:27 <Lucas_WMDE> trust the revert
2025-10-20 14:51:35 <ihurbain> gives a sticker and a :pat: :pat to Lucas_WMDE
2025-10-20 14:51:55 <Lucas_WMDE> curious what the canaries will say now
2025-10-20 14:53:08 <Lucas_WMDE> they were happy!
2025-10-20 14:53:16 <Lucas_WMDE> sync-prod-k8s is running
2025-10-20 14:53:35 <Lucas_WMDE> “Counted 0 error(s) in the last 20 seconds.” X doubt
2025-10-20 14:53:39 <Lucas_WMDE> (I guess it means 0 *new* errors ^^)
2025-10-20 14:53:44 <wikibugs> ('PS1) ''Esanders: Follow-up Iedb6361: Set insert-ignore on all insertSelect queries [extensions/Flow] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357)'
2025-10-20 14:54:34 <Lucas_WMDE> (https://spiderpig.wikimedia.org/jobs/775 is an interesting scap crash btw, I’ll report that later)
2025-10-20 14:54:41 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it"; [extensions/Flow] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197284 (https://phabricator.wikimedia.org/T407357) (owner: ''Esanders)'
2025-10-20 14:54:56 <Lucas_WMDE> volume appears to be going down again
2025-10-20 14:55:31 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197281|Restore "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] (duration: 08m 43s)
2025-10-20 14:55:40 <stashbot> T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290
2025-10-20 14:55:40 <stashbot> T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684
2025-10-20 14:55:40 <stashbot> T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744
2025-10-20 14:55:47 <Lucas_WMDE> right.
2025-10-20 14:55:50 <Lucas_WMDE> looks at alerts
2025-10-20 14:56:08 <Lucas_WMDE> does not understand the alerts website
2025-10-20 14:56:42 <Lucas_WMDE> https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate is empty, but jinxer-wm didn’t say anything about it resolving yet…
2025-10-20 14:57:15 <jinxer-wm> RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
2025-10-20 14:57:19 <Lucas_WMDE> yay
2025-10-20 14:57:33 <Lucas_WMDE> jouncebot: nowandnext
2025-10-20 14:57:33 <jouncebot> For the next 0 hour(s) and 2 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1430)
2025-10-20 14:57:33 <jouncebot> In 0 hour(s) and 32 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1530)
2025-10-20 14:57:59 <Lucas_WMDE> so. ideally we’d still deploy a version of that revert which won’t cause a flood of production errors
2025-10-20 14:58:13 <Lucas_WMDE> but I don’t know how long it would take to put that version of the change together
2025-10-20 14:58:54 <wikibugs> 'SRE, ''SRE-Access-Requests: Enroll Jeltos YubiKey for production access - https://phabricator.wikimedia.org/T407606#11290070 (''Jelto) ''Open''Resolved p:''Triage''Medium My new FIDO ssh key was added and works and the old ssh key was removed. I'll resolve the task.'
2025-10-20 14:59:16 <wikibugs> 'SRE, ''SRE-swift-storage, ''Infrastructure-Foundations, ''Patch-For-Review: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11290073 (''LSobanski) p:''Triage''Medium'
2025-10-20 15:03:19 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netops: Arelion 100G transport cr1-eqiad:et-1/1/2 <-> cr1-codfw:et-1/0/2 flapping on eqiad side [Oct 2025] - https://phabricator.wikimedia.org/T407578#11290097 (''cmooney) p:''Triage''Low a:''cmooney Gonna leave this a few days before closing, we've had a few fla...'
2025-10-20 15:03:31 <Lucas_WMDE> posted a summary at https://phabricator.wikimedia.org/T407684#11290101
2025-10-20 15:07:12 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 15:08:18 <logmsgbot> !log jhancock@cumin1003 START - Cookbook sre.dns.netbox
2025-10-20 15:08:29 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-20 15:11:39 <logmsgbot> !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding franio2004 to codfw - jhancock@cumin1003"
2025-10-20 15:11:44 <logmsgbot> !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding franio2004 to codfw - jhancock@cumin1003"
2025-10-20 15:11:44 <logmsgbot> !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2025-10-20 15:11:55 <logmsgbot> !log bking@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply
2025-10-20 15:11:58 <seanleong-wmde> Lucas_WMDE Hi sry, got disconnected, we did a quick temp fix patch, pushing it now
2025-10-20 15:12:04 <seanleong-wmde> is it still possible?
2025-10-20 15:12:19 <Lucas_WMDE> I think so
2025-10-20 15:12:20 <Lucas_WMDE> jouncebot: nowandnext
2025-10-20 15:12:20 <jouncebot> No deployments scheduled for the next 0 hour(s) and 17 minute(s)
2025-10-20 15:12:21 <jouncebot> In 0 hour(s) and 17 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1530)
2025-10-20 15:12:35 <Lucas_WMDE> I *think* mediawiki deploys don’t usually conflict with portals deploys
2025-10-20 15:13:06 <Lucas_WMDE> jan_drewniak: just checking, is it okay to deploy a MediaWiki backport (revert, hopefully fixes UBNs) even if it runs into the portals window?
2025-10-20 15:13:31 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11290148 (''Jhancock.wm)'
2025-10-20 15:13:48 <Lucas_WMDE> seanleong-wmde: can you link the change? (I also left a comment at https://phabricator.wikimedia.org/T407684#11290101, idk if you saw that yet)
2025-10-20 15:14:30 <wikibugs> ('CR) ''Scott French: "Many thanks for the follow-up on the task, Balthazar. If I could have your review on this when you get a chance, that would be greatly app" [puppet] - ''https://gerrit.wikimedia.org/r/1196734 (https://phabricator.wikimedia.org/T309738) (owner: ''Scott French)'
2025-10-20 15:14:34 <jan_drewniak> Lucas_WMDE: yes, go ahead, I'm not planning a portal deployment this week
2025-10-20 15:14:40 <Lucas_WMDE> great, thank you :)
2025-10-20 15:15:07 <logmsgbot> !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
2025-10-20 15:15:33 <wikibugs> ('CR) ''Btullis: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-20 15:16:20 <wikibugs> ('CR) ''Btullis: [C:''+1] cloudnative-pg-operator: watch the growthbook namespace [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197272 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-20 15:16:50 <seanleong-wmde> yes I just read it Lucas_WMDE, for now the new C usage will be remain as normal like last time but only the current CQR aspect currently in the DB will show as the new Ref and Aliases
2025-10-20 15:17:12 <Lucas_WMDE> ok
2025-10-20 15:17:15 <logmsgbot> !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/kartotherian: apply
2025-10-20 15:17:54 <logmsgbot> !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/kartotherian: apply
2025-10-20 15:18:09 <wikibugs> ('CR) ''Btullis: [C:''+1] Deploy a postgresql-growthbook cluster in dse-k8s-eqiad [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197273 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-20 15:18:37 <Lucas_WMDE> created T407767 for the scap error I mentioned above btw
2025-10-20 15:18:38 <stashbot> T407767: scap crash in SpiderPig job #775 (change was edited after creating job): TypeError: prompt_for_approval_or_exit() missing 1 required positional argument: 'exit_message' - https://phabricator.wikimedia.org/T407767
2025-10-20 15:19:12 <wikibugs> ('CR) ''Btullis: "PCC failure appears unrelated, so +1 in principle from me." [puppet] - ''https://gerrit.wikimedia.org/r/1197271 (https://phabricator.wikimedia.org/T406578) (owner: ''Brouberol)'
2025-10-20 15:20:39 <seanleong-wmde> will include that phab ticket into the revert patch as well
2025-10-20 15:20:44 <seanleong-wmde> gimme a few more min
2025-10-20 15:20:58 <Lucas_WMDE> the scap one? no need imho, that could’ve happened with any change (I assume)
2025-10-20 15:21:15 <Lucas_WMDE> and thanks :)
2025-10-20 15:21:34 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''fundraising-tech-ops: Q2:rack/setup/install franio2004 - https://phabricator.wikimedia.org/T405981#11290204 (''Jhancock.wm) a:''Jgreen @Jgreen this is ready for you. please lemme know if you need anything.'
2025-10-20 15:22:54 <wikibugs> ('CR) ''CDanis: [C:''+2] varnish: WMF-Uniq -> Analytics: fix frequency bug [puppet] - ''https://gerrit.wikimedia.org/r/1196154 (https://phabricator.wikimedia.org/T405783) (owner: ''CDanis)'
2025-10-20 15:27:11 <seanleong-wmde> Lucas_WMDE nope, the CQR aspects introduction
2025-10-20 15:27:53 <Lucas_WMDE> yeah, but the fourth Bug: line in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197289/2 is not necessary IMHO
2025-10-20 15:27:56 <Lucas_WMDE> (the first three are useful)
2025-10-20 15:28:00 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to deployment for VolkerE - https://phabricator.wikimedia.org/T406243#11290252 (''Raine)'
2025-10-20 15:28:08 <Lucas_WMDE> the changes in there look good to me so far btw
2025-10-20 15:28:29 <Lucas_WMDE> (but it will need to be squashed into the parent change, at least for deployment)
2025-10-20 15:28:45 <wikibugs> 'SRE, ''Traffic-Icebox: Improve how we build the 'haproxy_allowed_healthcheck_sources' list of IPs - https://phabricator.wikimedia.org/T407769 (''cmooney) ''NEW p:''Triage''Low'
2025-10-20 15:30:05 <jouncebot> jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1530).
2025-10-20 15:30:08 <seanleong-wmde> Lucas_WMDE can you guide on how to squash it?
2025-10-20 15:30:25 <Lucas_WMDE> yeah
2025-10-20 15:30:31 <Lucas_WMDE> I’m not sure if the Gerrit UI has an option for it
2025-10-20 15:30:55 <Lucas_WMDE> I would, in a local terminal, run something like `git rebase -i master` (assuming you’re currently on a branch with those changes)
2025-10-20 15:31:11 <Lucas_WMDE> and then, in the “todo list”, change the beginning of the second line from “pick” to “squash”
2025-10-20 15:31:23 <Lucas_WMDE> and then git should squash them together and let you edit the commit messagce
2025-10-20 15:31:46 <wikibugs> 'SRE, ''Traffic: Improve how we build the 'haproxy_allowed_healthcheck_sources' list of IPs - https://phabricator.wikimedia.org/T407769#11290283 (''ssingh) Thanks for filing this task! I think this is a good idea to reduce the manual updates to this list, and something we have failed to keep updated. We will...'
2025-10-20 15:33:51 <wikibugs> ('PS5) ''Aaron Schulz: [DNM] Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805)'
2025-10-20 15:34:17 <wikibugs> ('PS6) ''Aaron Schulz: Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805)'
2025-10-20 15:36:37 <wikibugs> ('CR) ''Alexandros Kosiaris: [C:''+1] hieradata: enable analytics-web listener in mediawiki [puppet] - ''https://gerrit.wikimedia.org/r/1196733 (https://phabricator.wikimedia.org/T309738) (owner: ''Scott French)'
2025-10-20 15:37:25 <wikibugs> ('CR) ''Alexandros Kosiaris: [C:''+1] "I am wondering whether this makes sense to put only in mw-cron specific yaml values files, but I am probably over thinking this?" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196735 (https://phabricator.wikimedia.org/T309738) (owner: ''Scott French)'
2025-10-20 15:38:29 <jinxer-wm> RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-20 15:42:39 <seanleong-wmde> Lucas_WMDE okay done
2025-10-20 15:43:04 <seanleong-wmde> do we create a cherry pick now?
2025-10-20 15:43:17 <Lucas_WMDE> just a moment
2025-10-20 15:43:36 <Lucas_WMDE> the commit message shouldn’t be two commit messages pasted together ^^
2025-10-20 15:43:42 <Lucas_WMDE> I’ll fix it locally
2025-10-20 15:44:36 <seanleong-wmde> okay thanks, our changes is just adding back the lines in EntityUsage.php, but since it reverts the revert, so the file is now missing in the current patch
2025-10-20 15:45:01 <Lucas_WMDE> uploaded a new patch set at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197274
2025-10-20 15:45:03 <wikibugs> ('CR) ''Aaron Schulz: "Sounds good!" [puppet] - ''https://gerrit.wikimedia.org/r/1189936 (https://phabricator.wikimedia.org/T385066) (owner: ''Aaron Schulz)'
2025-10-20 15:45:09 <Lucas_WMDE> and there we can see the diff https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1197274/2..3/client/includes/Usage/EntityUsage.php
2025-10-20 15:45:20 <Lucas_WMDE> hm, I wonder if Gerrit will even let us cherry pick this
2025-10-20 15:45:20 <logmsgbot> !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/kartotherian: apply
2025-10-20 15:45:21 <jinxer-wm> FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
2025-10-20 15:45:26 <Lucas_WMDE> since there’s already a change with this Change-Id on the wmf.23 branch 🤔
2025-10-20 15:45:29 <Lucas_WMDE> jouncebot: nowandnext
2025-10-20 15:45:29 <jouncebot> For the next 0 hour(s) and 14 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1530)
2025-10-20 15:45:29 <jouncebot> In 1 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1700)
2025-10-20 15:45:29 <jouncebot> In 1 hour(s) and 14 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1700)
2025-10-20 15:45:36 <Lucas_WMDE> let’s try it
2025-10-20 15:45:50 <Lucas_WMDE> nope
2025-10-20 15:45:52 <Lucas_WMDE> Could not perform action: Cherry-pick with Change-Id Ib6ddef47e577a413ccc11d9cca5f71973faaeae7 could not update the existing change 1197276 in destination branch refs/heads/wmf/1.45.0-wmf.23 of project mediawiki/extensions/Wikibase, because the change was closed (MERGED)
2025-10-20 15:45:57 <Lucas_WMDE> ok, new Change-Id then
2025-10-20 15:46:00 <hnowlan> just a heads-up, I am applying some changes to kartotherian which will only affect maps. I'll be keeping an eye but if you see anything weird maps-adjacent let me know
2025-10-20 15:46:05 <logmsgbot> !log dancy@deploy2002 Installing scap version "4.215.0" for 2 host(s)
2025-10-20 15:46:07 <logmsgbot> !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/kartotherian: apply
2025-10-20 15:46:13 <Lucas_WMDE> ack
2025-10-20 15:47:00 <wikibugs> ('PS1) ''Lucas Werkmeister (WMDE): Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197294 (https://phabricator.wikimedia.org/T401290)'
2025-10-20 15:47:12 <Lucas_WMDE> there’s our cherry-pick to deploy
2025-10-20 15:47:35 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197294 (https://phabricator.wikimedia.org/T401290) (owner: ''Lucas Werkmeister (WMDE))'
2025-10-20 15:47:52 <logmsgbot> !log dancy@deploy2002 Installation of scap version "4.215.0" completed for 2 hosts
2025-10-20 15:47:55 <Lucas_WMDE> I’ll let CI run normally on this, it’s not as urgent as the revert-revert earlier
2025-10-20 15:48:39 <seanleong-wmde> okay
2025-10-20 15:49:26 <logmsgbot> !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/kartotherian: apply
2025-10-20 15:50:05 <wikibugs> 'ops-codfw, ''DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772 (''phaultfinder) ''NEW'
2025-10-20 15:50:36 <logmsgbot> !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/kartotherian: apply
2025-10-20 15:50:49 <logmsgbot> !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:aqs-codfw
2025-10-20 15:51:06 <wikibugs> ('PS2) ''DLynch: Edit check: fix some eslint warnings [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197295'
2025-10-20 15:51:08 <logmsgbot> !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS trixie
2025-10-20 15:51:48 <jinxer-wm> FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2025-10-20 15:52:50 <wikibugs> ('PS3) ''DLynch: Edit check: fix some eslint warnings [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747)'
2025-10-20 15:54:24 <jinxer-wm> FIRING: [4x] ProbeDown: Service aqs2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-20 15:54:43 <icinga-wm> PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-20 15:55:15 <Kemayo> I have a pretty urgent editing-fix that I'm going to deploy, if nobody has any objections: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1197295
2025-10-20 15:55:47 <Kemayo> (Once Lucas_WMDE is done, I mean.)
2025-10-20 15:57:14 <Lucas_WMDE> looks
2025-10-20 15:57:30 <Lucas_WMDE> ack
2025-10-20 15:59:52 <wikibugs> ('CR) ''DLynch: "The commit message sounds very non-severe because the original patch didn't *realize* that it was fixing a bug which breaks editcheck pre-" [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747) (owner: ''DLynch)'
2025-10-20 16:00:22 <wikibugs> 'sre-alert-triage, ''SRE Observability: Alert in need of triage: PuppetConstantChange (instance prometheus2007:9100) - https://phabricator.wikimedia.org/T407484#11290430 (''tappof) I found that the certificates used by Prometheus to authenticate against Kubernetes are being renewed every hour. I believe the r...'
2025-10-20 16:01:16 <wikibugs> ('CR) ''Lucas Werkmeister (WMDE): "And here I thought it meant something like “we’re accidentally showing lots of fake eslint warnings to people who are CodeMirror’ing on-wi" [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747) (owner: ''DLynch)'
2025-10-20 16:02:10 <wikibugs> ('Merged) ''jenkins-bot: Revert "Implement new usage types for statement with qualifiers and references" [extensions/Wikibase] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197294 (https://phabricator.wikimedia.org/T401290) (owner: ''Lucas Werkmeister (WMDE))'
2025-10-20 16:02:30 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1197294|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]]
2025-10-20 16:02:37 <stashbot> T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290
2025-10-20 16:02:37 <stashbot> T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684
2025-10-20 16:02:38 <stashbot> T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744
2025-10-20 16:05:53 <wikibugs> ('CR) ''Hnowlan: Set wgRestSandboxSpecs['wmf-restbase'] on testwiki to use the static specs [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1190742 (https://phabricator.wikimedia.org/T396805) (owner: ''Aaron Schulz)'
2025-10-20 16:07:13 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:1197294|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-20 16:07:23 <Lucas_WMDE> seanleong-wmde: please test
2025-10-20 16:07:24 <seanleong-wmde> Testing it now
2025-10-20 16:07:36 <Lucas_WMDE> https://en.wikipedia.org/w/index.php?title=Samuel_Freeman_(philosopher)&action=info doesn’t crash, which is promising
2025-10-20 16:07:42 <Kemayo> (I have to make a doctor appointment, so I will do my backport when I get back instead.)
2025-10-20 16:07:48 <Lucas_WMDE> good luck!
2025-10-20 16:07:53 <Lucas_WMDE> it even still shows “Some statements (with qualifiers and references)”, I guess the revert didn’t remove the i18n message
2025-10-20 16:08:40 <Lucas_WMDE> https://fi.wikipedia.org/wiki/Vantaa is fixed
2025-10-20 16:08:44 <seanleong-wmde> nope, that's another patch, for the curr fix we just make sure that the current CQR entities will remain
2025-10-20 16:08:55 <seanleong-wmde> I'll find some crashing stuff in the report to check
2025-10-20 16:09:16 <icinga-wm> RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-20 16:09:29 <Lucas_WMDE> the “preview page with this template” bit looks like the ipairs() references issue is fixed too, so far so good
2025-10-20 16:09:54 <wikibugs> ('CR) ''Scott French: "I was wondering the same, yeah. In an ideal world, there would be a straightforward way to both enable the listener and open up the networ" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196735 (https://phabricator.wikimedia.org/T309738) (owner: ''Scott French)'
2025-10-20 16:10:16 <Lucas_WMDE> trying some URLs from logstash
2025-10-20 16:11:06 <Lucas_WMDE> hm, https://arz.wikipedia.org/w/rest.php/v1/page/1990_%D8%A8%D8%B7%D9%88%D9%84%D8%A9_%D8%A7%D9%88%D8%B1%D9%88%D8%A8%D8%A7_%D9%84%D8%A7%D9%84%D8%B9%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D9%88%D9%89_1500_%D9%85%D8%AA%D8%B1_%D8%B3%D9%8A%D8%AF%D8%A7%D8%AA/html shows “خطأ لوا في وحدة:External_links على السطر 843: bad argument #1 to
2025-10-20 16:11:06 <Lucas_WMDE> 'ipairs' (table expected, got nil).”
2025-10-20 16:11:12 <Lucas_WMDE> not sure what to make of that
2025-10-20 16:11:43 <seanleong-wmde> ah I think that ticket have typo
2025-10-20 16:11:50 <Lucas_WMDE> but it seems to show the same thing without WikimediaDebug
2025-10-20 16:11:53 <seanleong-wmde> if you copy and paste the pairs one
2025-10-20 16:11:56 <Lucas_WMDE> and also the message quickly vanishes on https://arz.wikipedia.org/wiki/1990_%D8%A8%D8%B7%D9%88%D9%84%D8%A9_%D8%A7%D9%88%D8%B1%D9%88%D8%A8%D8%A7_%D9%84%D8%A7%D9%84%D8%B9%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D9%88%D9%89_1500_%D9%85%D8%AA%D8%B1_%D8%B3%D9%8A%D8%AF%D8%A7%D8%AA
2025-10-20 16:11:58 <seanleong-wmde> and just add an i manually
2025-10-20 16:12:03 <seanleong-wmde> it should work
2025-10-20 16:12:11 <seanleong-wmde> manually add an i before the pairs
2025-10-20 16:12:40 <Lucas_WMDE> ok, https://arz.wikipedia.org/wiki/1990_%D8%A8%D8%B7%D9%88%D9%84%D8%A9_%D8%A7%D9%88%D8%B1%D9%88%D8%A8%D8%A7_%D9%84%D8%A7%D9%84%D8%B9%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D9%88%D9%89_1500_%D9%85%D8%AA%D8%B1_%D8%B3%D9%8A%D8%AF%D8%A7%D8%AA?safemode=1 shows the lua error
2025-10-20 16:12:48 <Lucas_WMDE> I guess they have some site JS that hides lua errors by default 🤷
2025-10-20 16:12:59 <Lucas_WMDE> but it happens with or without WikimediaDebug, so not the revert’s fault
2025-10-20 16:14:12 <seanleong-wmde> okay, not sure about that issue
2025-10-20 16:14:14 <Lucas_WMDE> I think we should be good to go
2025-10-20 16:14:22 <seanleong-wmde> but so far the bug report ones are fixed
2025-10-20 16:14:24 <Lucas_WMDE> I tried some more URLs and found no errors
2025-10-20 16:14:29 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to "analytics-admins" and "deployment" groups for a-pizzata - https://phabricator.wikimedia.org/T407228#11290474 (''Raine) a:''Ahoelzl Assigning to @Ahoelzl for approval.'
2025-10-20 16:14:32 <Lucas_WMDE> nothing in mwdebug logstash either
2025-10-20 16:14:51 <seanleong-wmde> okay, let's go
2025-10-20 16:15:00 <Lucas_WMDE> (well, plenty of boring debug messages, like our two accounts being autocreated on arzwiki :P but no errors)
2025-10-20 16:15:07 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Continuing with sync
2025-10-20 16:15:09 <Lucas_WMDE> let’s roll
2025-10-20 16:15:10 <seanleong-wmde> xD
2025-10-20 16:15:41 <Lucas_WMDE> (I’m slightly surprised I didn’t have an account yet, I thought I visited arzwiki before ^^)
2025-10-20 16:16:01 <Lucas_WMDE> now can’t read arzwiki without thinking of https://de.wikipedia.org/wiki/So_klingt%E2%80%99s_bei_uns_im_Arzgebirg
2025-10-20 16:16:43 <wikibugs> ('CR) ''Andrea Denisse: [C:''+2] alertmanager: Add Slack route for the rweb team [puppet] - ''https://gerrit.wikimedia.org/r/1196533 (https://phabricator.wikimedia.org/T406689) (owner: ''Andrea Denisse)'
2025-10-20 16:17:01 <wikibugs> ('PS1) ''Tiziano Fogli: k8s/client_cert: adjust Prometheus certificate renewal timing [puppet] - ''https://gerrit.wikimedia.org/r/1197303 (https://phabricator.wikimedia.org/T407484)'
2025-10-20 16:18:24 <Lucas_WMDE> wow, spread out over the past 24 hours, the $aspect error is actually less common than the one from T402548
2025-10-20 16:18:24 <stashbot> T402548: PHP Warning: DOMNode::appendChild(): Document Fragment is empty - https://phabricator.wikimedia.org/T402548
2025-10-20 16:18:37 <Lucas_WMDE> that one has 5k, $aspect 4.6k
2025-10-20 16:18:55 <logmsgbot> !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-ulsfo and not P{cp4037*} and A:cp
2025-10-20 16:18:56 <Lucas_WMDE> anyway, nothing concerning in mediawiki-errors so far as this rolls out
2025-10-20 16:19:13 <logmsgbot> !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197294|Revert "Implement new usage types for statement with qualifiers and references" (T401290 T407684 T407744)]] (duration: 16m 43s)
2025-10-20 16:19:14 <logmsgbot> !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage
2025-10-20 16:19:21 <stashbot> T401290: Implement new usage types for qualifiers and references - https://phabricator.wikimedia.org/T401290
2025-10-20 16:19:21 <stashbot> T407684: Lua's ipairs() function can no longer iterate over Wikidata references - https://phabricator.wikimedia.org/T407684
2025-10-20 16:19:21 <stashbot> T407744: Wikibase\DataModel\Entity\EntityIdParsingException: The serialization "Q42902012 " is not recognized by the configured id builders - https://phabricator.wikimedia.org/T407744
2025-10-20 16:19:38 <wikibugs> 'ops-eqiad, ''DC-Ops: Power Supply - Status - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407774 (''phaultfinder) ''NEW'
2025-10-20 16:19:55 <seanleong-wmde> hahaha that sounds like a more serious issue
2025-10-20 16:20:27 <Lucas_WMDE> last occurrence of T407744 is at 16:16:08 UTC
2025-10-20 16:21:25 <seanleong-wmde> fingers crossed
2025-10-20 16:21:32 <seanleong-wmde> no more after the patch
2025-10-20 16:22:19 <seanleong-wmde> what a great incident to start the week
2025-10-20 16:22:21 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''Essential-Work: Reimage failed after prompt...is prompt needed? - https://phabricator.wikimedia.org/T406656#11290513 (''Dzahn) I just wanted to add that I still just see a logical conflict between 2 statements around this. The first is made by...'
2025-10-20 16:22:41 <Lucas_WMDE> yeah
2025-10-20 16:23:32 <Lucas_WMDE> I think that qualifies you for one of those “I broke Wikipedia but then I fixed it” stickers (t-shirts?) but I have no idea where to get those
2025-10-20 16:23:54 <seanleong-wmde> I'll stay for a bit to monitor, but thank you for the help Lucas_WMDE! appreciate it, it was a great journey o7
2025-10-20 16:24:15 <Lucas_WMDE> thank you too!
2025-10-20 16:24:16 <seanleong-wmde> I def will ask around
2025-10-20 16:24:19 <wikibugs> 'ops-eqiad, ''DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775 (''phaultfinder) ''NEW'
2025-10-20 16:24:53 <wikibugs> ('CR) ''Alexandros Kosiaris: [C:''+1] "ack and agreed." [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196735 (https://phabricator.wikimedia.org/T309738) (owner: ''Scott French)'
2025-10-20 16:27:50 <wikibugs> ('CR) ''Tiziano Fogli: "More details on the task." [puppet] - ''https://gerrit.wikimedia.org/r/1197303 (https://phabricator.wikimedia.org/T407484) (owner: ''Tiziano Fogli)'
2025-10-20 16:28:04 <logmsgbot> !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6001.drmrs.wmnet with reason: host reimage
2025-10-20 16:29:03 <Lucas_WMDE> still nothing new in logstash, I’ll close the window
2025-10-20 16:29:24 <Lucas_WMDE> !log UTC afternoon backport+config window (belatedly, more or less) done
2025-10-20 16:29:27 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-20 16:29:41 <Lucas_WMDE> Kemayo: I’m done, feel free to deploy when you’re back :)
2025-10-20 16:30:01 <wikibugs> ('CR) ''Tiziano Fogli: "Yes, right." [puppet] - ''https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: ''Tiziano Fogli)'
2025-10-20 16:32:00 <seanleong-wmde> Lucas_WMDE \o/
2025-10-20 16:32:15 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4038.ulsfo.wmnet
2025-10-20 16:32:30 <wikibugs> ('CR) ''Tiziano Fogli: [C:''+2] haproxy: enable nrpe2nodexp wrapper on haproxy_alive check [puppet] - ''https://gerrit.wikimedia.org/r/1196918 (https://phabricator.wikimedia.org/T407137) (owner: ''Tiziano Fogli)'
2025-10-20 16:32:54 <wikibugs> ('CR) ''Tiziano Fogli: [C:''+2] mariadb::proxy::master: enable nrpe2ndoexp wrapper on haproxy_failover [puppet] - ''https://gerrit.wikimedia.org/r/1196925 (https://phabricator.wikimedia.org/T407137) (owner: ''Tiziano Fogli)'
2025-10-20 16:32:59 <wikibugs> ('PS1) ''Alexandros Kosiaris: Remove wmf.volumes from all charts [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197304'
2025-10-20 16:34:14 <wikibugs> ('PS1) ''Majavah: toolforge: toolviews: Remove obsolete version check [puppet] - ''https://gerrit.wikimedia.org/r/1197305 (https://phabricator.wikimedia.org/T407750)'
2025-10-20 16:36:51 <wikibugs> ('PS2) ''Tiziano Fogli: monitoring: enable nrpe2nodexp wrapper on <dir>_owned [puppet] - ''https://gerrit.wikimedia.org/r/1196943 (https://phabricator.wikimedia.org/T407120)'
2025-10-20 16:37:32 <wikibugs> ('CR) ''Tiziano Fogli: [C:''+2] monitoring: enable nrpe2nodexp wrapper on <dir>_owned [puppet] - ''https://gerrit.wikimedia.org/r/1196943 (https://phabricator.wikimedia.org/T407120) (owner: ''Tiziano Fogli)'
2025-10-20 16:37:54 <wikibugs> ('PS1) ''Dzahn: zuul: stop using path including hardcode host name [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671)'
2025-10-20 16:38:11 <wikibugs> ('CR) ''CI reject: [V:''-1] zuul: stop using path including hardcode host name [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) (owner: ''Dzahn)'
2025-10-20 16:38:14 <wikibugs> ('CR) ''CI reject: [V:''-1] Remove wmf.volumes from all charts [deployment-charts] - ''https://gerrit.wikimedia.org/r/1197304 (owner: ''Alexandros Kosiaris)'
2025-10-20 16:39:00 <wikibugs> ('PS1) ''Majavah: P:toolforge: Move toolviews processing to HAProxy [puppet] - ''https://gerrit.wikimedia.org/r/1197308 (https://phabricator.wikimedia.org/T284558)'
2025-10-20 16:39:16 <wikibugs> ('PS2) ''Dzahn: zuul: stop using path including hardcode host name [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671)'
2025-10-20 16:39:34 <wikibugs> ('CR) ''CI reject: [V:''-1] zuul: stop using path including hardcode host name [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) (owner: ''Dzahn)'
2025-10-20 16:40:38 <wikibugs> 'SRE, ''Domains, ''Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11290600 (''BCornwall) Thank you, all. :) This has been migrated and things should continue to behave as expected. If that's not true, please re-open this ticket so we can look into it!'
2025-10-20 16:40:43 <wikibugs> ('CR) ''Marostegui: [C:''+1] clone_es.py: clone readonly es* hosts [cookbooks] - ''https://gerrit.wikimedia.org/r/1183646 (owner: ''Federico Ceratto)'
2025-10-20 16:40:47 <wikibugs> 'SRE, ''Domains, ''Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#11290601 (''BCornwall) ''In progress''Resolved'
2025-10-20 16:41:44 <icinga-wm> RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-20 16:44:17 <jinxer-wm> FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2025-10-20 16:44:44 <icinga-wm> PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-20 16:45:01 <sukhe> ^ expected
2025-10-20 16:45:44 <icinga-wm> RECOVERY - BFD status on asw1-b12-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-20 16:46:24 <logmsgbot> !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6001.drmrs.wmnet with OS trixie
2025-10-20 16:46:54 <wikibugs> ('PS3) ''Dzahn: zuul: stop using path including hardcode host name [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671)'
2025-10-20 16:48:48 <logmsgbot> !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS trixie
2025-10-20 16:53:44 <icinga-wm> PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-20 16:53:52 <sukhe> ^ expected
2025-10-20 16:54:12 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-20 16:54:52 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772#11290631 (''phaultfinder)'
2025-10-20 16:55:09 <wikibugs> ('PS6) ''Ssingh: P:cache::haproxy: exempt releases.wikimedia.org from UA policy [puppet] - ''https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165)'
2025-10-20 16:55:56 <wikibugs> ('CR) ''Ssingh: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7308/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1192210 (https://phabricator.wikimedia.org/T405165) (owner: ''Ssingh)'
2025-10-20 17:00:05 <jouncebot> Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1700)
2025-10-20 17:00:05 <jouncebot> ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T1700).
2025-10-20 17:00:44 <wikibugs> ('CR) ''Dzahn: [V:''-1 C:''-1] "https://puppet-compiler.wmflabs.org/output/1197306/7309/zuul2001.codfw.wmnet/index.html"; [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) (owner: ''Dzahn)'
2025-10-20 17:03:04 <wikibugs> 'SRE, ''Infrastructure-Foundations: Increase net.nf_conntrack_max on kerberos hosts if needed - https://phabricator.wikimedia.org/T407726#11290653 (''jhathaway) From a brief look, most of these conntrack entries are from `an-coord1003.eqiad.wmnet`, along with log entries of the form: ` presto/an-coord1003.eq...'
2025-10-20 17:04:27 <wikibugs> ('PS4) ''Dzahn: zuul: stop using path including hardcode host name [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671)'
2025-10-20 17:06:18 <wikibugs> ('PS5) ''Dzahn: zuul: stop using path including hardcode host name [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671)'
2025-10-20 17:07:12 <wikibugs> ('CR) ''Bking: [C:''+2] ganeti-jumbo: Add hosts and partman recipe [puppet] - ''https://gerrit.wikimedia.org/r/1196952 (https://phabricator.wikimedia.org/T405964) (owner: ''Bking)'
2025-10-20 17:07:54 <wikibugs> ('CR) ''Bking: [C:''+2] "self-merging in the interest of time. These are net-new hosts, so I'm not aware of any risks that are involved here." [puppet] - ''https://gerrit.wikimedia.org/r/1196952 (https://phabricator.wikimedia.org/T405964) (owner: ''Bking)'
2025-10-20 17:08:26 <wikibugs> ('PS6) ''Dzahn: zuul: stop using path including hardcode host name [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671)'
2025-10-20 17:09:39 <wikibugs> ('CR) ''FNegri: [C:''+1] toolforge: toolviews: Ignore requests for *.svc.toolforge.org [puppet] - ''https://gerrit.wikimedia.org/r/1197283 (owner: ''Majavah)'
2025-10-20 17:10:11 <wikibugs> ('CR) ''FNegri: [C:''+1] toolforge: toolviews: Remove obsolete version check [puppet] - ''https://gerrit.wikimedia.org/r/1197305 (https://phabricator.wikimedia.org/T407750) (owner: ''Majavah)'
2025-10-20 17:13:21 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] "https://puppet-compiler.wmflabs.org/output/1197306/7311/zuul2001.codfw.wmnet/index.html"; [puppet] - ''https://gerrit.wikimedia.org/r/1197306 (https://phabricator.wikimedia.org/T407671) (owner: ''Dzahn)'
2025-10-20 17:13:38 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4039.ulsfo.wmnet
2025-10-20 17:15:57 <wikibugs> ('CR) ''Btullis: [C:''+2] "This is actually a no-op, since the canary-events resources are absented. I'll merge it, but then follow up with a patch to remove the res" [puppet] - ''https://gerrit.wikimedia.org/r/1195778 (https://phabricator.wikimedia.org/T402943) (owner: ''Btullis)'
2025-10-20 17:19:03 <logmsgbot> !log sukhe@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
2025-10-20 17:24:30 <logmsgbot> !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
2025-10-20 17:26:40 <wikibugs> ('CR) ''Majavah: [C:''+2] toolforge: toolviews: Ignore requests for *.svc.toolforge.org [puppet] - ''https://gerrit.wikimedia.org/r/1197283 (owner: ''Majavah)'
2025-10-20 17:26:48 <wikibugs> ('CR) ''Majavah: [C:''+2] toolforge: toolviews: Remove obsolete version check [puppet] - ''https://gerrit.wikimedia.org/r/1197305 (https://phabricator.wikimedia.org/T407750) (owner: ''Majavah)'
2025-10-20 17:29:12 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 17:29:37 <wikibugs> ('CR) ''Majavah: [C:''-2] "Holding for now, the switch needs to happen at the same time we move traffic to keep the unique IP counter happy." [puppet] - ''https://gerrit.wikimedia.org/r/1197308 (https://phabricator.wikimedia.org/T284558) (owner: ''Majavah)'
2025-10-20 17:39:12 <jinxer-wm> FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
2025-10-20 17:39:57 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772#11290753 (''phaultfinder)'
2025-10-20 17:42:44 <icinga-wm> RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-20 17:42:51 <logmsgbot> !log sukhe@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS trixie
2025-10-20 17:47:25 <wikibugs> ('PS2) ''Krinkle: varnish: Remove unused "Mobile Redirect" logic [puppet] - ''https://gerrit.wikimedia.org/r/1194558 (https://phabricator.wikimedia.org/T405931)'
2025-10-20 17:52:50 <logmsgbot> !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqsin and A:cp
2025-10-20 17:52:59 <logmsgbot> !log sukhe@cumin1003 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqsin and A:cp
2025-10-20 17:54:56 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4040.ulsfo.wmnet
2025-10-20 18:02:57 <wikibugs> ('PS19) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 18:04:08 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5025.eqsin.wmnet
2025-10-20 18:05:02 <logmsgbot> !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:aqs-codfw
2025-10-20 18:05:10 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772#11290875 (''phaultfinder)'
2025-10-20 18:06:05 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host zuul1001.eqiad.wmnet with OS trixie
2025-10-20 18:06:06 <jinxer-wm> FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
2025-10-20 18:06:10 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5017.eqsin.wmnet
2025-10-20 18:06:11 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops: Alert for device ps1-a1-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T407772#11290886 (''Jhancock.wm) ''Open''Resolved a:''Jhancock.wm'
2025-10-20 18:08:14 <wikibugs> ('PS20) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 18:11:00 <wikibugs> ('PS21) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 18:13:44 <wikibugs> ('PS22) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 18:17:31 <logmsgbot> !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on zuul1001.eqiad.wmnet with reason: host reimage
2025-10-20 18:18:20 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775#11290935 (''VRiley-WMF)'
2025-10-20 18:18:25 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Power Supply - Status - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407774#11290937 (''VRiley-WMF) →''Duplicate dup:''T407775'
2025-10-20 18:18:40 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775#11290941 (''VRiley-WMF) a:''VRiley-WMF'
2025-10-20 18:19:12 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-20 18:21:06 <jinxer-wm> RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
2025-10-20 18:21:46 <wikibugs> ('CR) ''CI reject: [V:''-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: ''CDobbins)'
2025-10-20 18:23:12 <wikibugs> ('PS1) ''Cathal Mooney: homer-diff-checker: move execution from cumin1002 to cumin1003 [puppet] - ''https://gerrit.wikimedia.org/r/1197321 (https://phabricator.wikimedia.org/T389380)'
2025-10-20 18:24:05 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on zuul1001.eqiad.wmnet with reason: host reimage
2025-10-20 18:25:09 <wikibugs> ('CR) ''Cathal Mooney: "Riccardo, sorry to put you on this one but you are probably the one who knows best if this is the correct way to do this. I'm guessing it" [puppet] - ''https://gerrit.wikimedia.org/r/1197321 (https://phabricator.wikimedia.org/T389380) (owner: ''Cathal Mooney)'
2025-10-20 18:28:50 <wikibugs> ('PS23) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 18:31:06 <jinxer-wm> FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
2025-10-20 18:36:09 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4041.ulsfo.wmnet
2025-10-20 18:41:13 <Kemayo> Okay, I'm back and will do that backport now.
2025-10-20 18:41:41 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by kemayo@deploy2002 using scap backport" [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747) (owner: ''DLynch)'
2025-10-20 18:47:08 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5026.eqsin.wmnet
2025-10-20 18:49:24 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5018.eqsin.wmnet
2025-10-20 18:51:30 <wikibugs> ('PS1) ''CDanis: varnish: WMF-Uniq -> Analytics: fix actual frequency bug [puppet] - ''https://gerrit.wikimedia.org/r/1197323 (https://phabricator.wikimedia.org/T407092)'
2025-10-20 18:52:25 <wikibugs> ('Merged) ''jenkins-bot: Edit check: fix some eslint warnings [extensions/VisualEditor] (wmf/1.45.0-wmf.23) - ''https://gerrit.wikimedia.org/r/1197295 (https://phabricator.wikimedia.org/T407747) (owner: ''DLynch)'
2025-10-20 18:52:44 <logmsgbot> !log kemayo@deploy2002 Started scap sync-world: Backport for [[gerrit:1197295|Edit check: fix some eslint warnings (T407747)]]
2025-10-20 18:52:49 <stashbot> T407747: Screen freezes for new editors if no or few references are added - https://phabricator.wikimedia.org/T407747
2025-10-20 18:56:43 <logmsgbot> !log kemayo@deploy2002 kemayo: Backport for [[gerrit:1197295|Edit check: fix some eslint warnings (T407747)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-20 18:57:25 <logmsgbot> !log kemayo@deploy2002 kemayo: Continuing with sync
2025-10-20 18:59:33 <wikibugs> ('PS24) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 19:01:31 <logmsgbot> !log kemayo@deploy2002 Finished scap sync-world: Backport for [[gerrit:1197295|Edit check: fix some eslint warnings (T407747)]] (duration: 08m 46s)
2025-10-20 19:01:36 <stashbot> T407747: Screen freezes for new editors if no or few references are added - https://phabricator.wikimedia.org/T407747
2025-10-20 19:03:14 <logmsgbot> !log rzl@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/services/miscweb: apply
2025-10-20 19:03:39 <wikibugs> ('PS4) ''JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks [cookbooks] - ''https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: ''Elukey)'
2025-10-20 19:03:42 <logmsgbot> !log rzl@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/services/miscweb: apply
2025-10-20 19:05:41 <wikibugs> ('CR) ''JHathaway: sre.hardware.upgrade-firmware: improve matching for SSD checks (''1 comment) [cookbooks] - ''https://gerrit.wikimedia.org/r/1194969 (https://phabricator.wikimedia.org/T392851) (owner: ''Elukey)'
2025-10-20 19:06:10 <wikibugs> ('CR) ''CI reject: [V:''-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: ''CDobbins)'
2025-10-20 19:06:20 <logmsgbot> !log rzl@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/services/miscweb: apply
2025-10-20 19:06:52 <logmsgbot> !log rzl@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/services/miscweb: apply
2025-10-20 19:08:40 <wikibugs> ('PS25) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 19:09:12 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 19:13:48 <wikibugs> ('PS2) ''CDanis: varnish: WMF-Uniq -> Analytics: fix actual frequency bug [puppet] - ''https://gerrit.wikimedia.org/r/1197323 (https://phabricator.wikimedia.org/T407092)'
2025-10-20 19:14:38 <wikibugs> ('CR) ''Volans: [C:''+1] "LGTM, the current puppettization will take care of absenting the resource on the old host." [puppet] - ''https://gerrit.wikimedia.org/r/1197321 (https://phabricator.wikimedia.org/T389380) (owner: ''Cathal Mooney)'
2025-10-20 19:15:07 <wikibugs> ('PS26) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 19:17:06 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4042.ulsfo.wmnet
2025-10-20 19:17:19 <wikibugs> ('CR) ''Herron: [V:''+1 C:''+2] thanos-rule: add support for multiple instances [puppet] - ''https://gerrit.wikimedia.org/r/1188441 (https://phabricator.wikimedia.org/T406054) (owner: ''Herron)'
2025-10-20 19:18:53 <wikibugs> ('PS27) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 19:22:22 <wikibugs> ('CR) ''Ssingh: [C:''+1] varnish: WMF-Uniq -> Analytics: fix actual frequency bug [puppet] - ''https://gerrit.wikimedia.org/r/1197323 (https://phabricator.wikimedia.org/T407092) (owner: ''CDanis)'
2025-10-20 19:22:52 <wikibugs> ('CR) ''CDanis: [C:''+2] varnish: WMF-Uniq -> Analytics: fix actual frequency bug [puppet] - ''https://gerrit.wikimedia.org/r/1197323 (https://phabricator.wikimedia.org/T407092) (owner: ''CDanis)'
2025-10-20 19:26:06 <jinxer-wm> FIRING: [2x] MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
2025-10-20 19:30:18 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5027.eqsin.wmnet
2025-10-20 19:32:46 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5019.eqsin.wmnet
2025-10-20 19:34:12 <jinxer-wm> FIRING: [2x] SLOMetricAbsent: wdqs-scholarly-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-scholarly-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
2025-10-20 19:36:06 <jinxer-wm> FIRING: [2x] MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
2025-10-20 19:38:33 <wikibugs> ('PS28) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 19:41:23 <wikibugs> ('PS1) ''Herron: ThanosRecordingRuleGaps: update thanos-rule to thanos-rule@main [alerts] - ''https://gerrit.wikimedia.org/r/1197326 (https://phabricator.wikimedia.org/T406054)'
2025-10-20 19:43:47 <wikibugs> ('CR) ''Herron: [C:''+2] ThanosRecordingRuleGaps: update thanos-rule to thanos-rule@main [alerts] - ''https://gerrit.wikimedia.org/r/1197326 (https://phabricator.wikimedia.org/T406054) (owner: ''Herron)'
2025-10-20 19:45:00 <wikibugs> ('CR) ''CI reject: [V:''-1] sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240) (owner: ''CDobbins)'
2025-10-20 19:45:03 <wikibugs> ('PS29) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 19:45:43 <wikibugs> ('Merged) ''jenkins-bot: ThanosRecordingRuleGaps: update thanos-rule to thanos-rule@main [alerts] - ''https://gerrit.wikimedia.org/r/1197326 (https://phabricator.wikimedia.org/T406054) (owner: ''Herron)'
2025-10-20 19:49:12 <jinxer-wm> FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
2025-10-20 19:54:12 <jinxer-wm> FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2025-10-20 19:56:06 <jinxer-wm> RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (session_loss) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
2025-10-20 19:56:11 <logmsgbot> !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on P{aqs[1014-1022]*} and P{P:Cassandra}
2025-10-20 19:58:14 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4043.ulsfo.wmnet
2025-10-20 19:59:03 <wikibugs> ('PS1) ''Dzahn: zuul: use wmflib mkdir_p to ensure /var/www/zuul exists [puppet] - ''https://gerrit.wikimedia.org/r/1197327 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 19:59:12 <jinxer-wm> FIRING: ProbeDown: Service aqs1014-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#aqs1014-a:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-20 19:59:50 <wikibugs> ('CR) ''CI reject: [V:''-1] zuul: use wmflib mkdir_p to ensure /var/www/zuul exists [puppet] - ''https://gerrit.wikimedia.org/r/1197327 (https://phabricator.wikimedia.org/T395938) (owner: ''Dzahn)'
2025-10-20 20:00:05 <jouncebot> RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T2000).
2025-10-20 20:00:05 <jouncebot> edsanders: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2025-10-20 20:02:07 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775#11291315 (''VRiley-WMF) reseated cable and it came back'
2025-10-20 20:02:16 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker1268:9290 - https://phabricator.wikimedia.org/T407775#11291316 (''VRiley-WMF) ''Open''Resolved'
2025-10-20 20:04:49 <icinga-wm> PROBLEM - Host sretest2001 is DOWN: PING CRITICAL - Packet loss = 100%
2025-10-20 20:06:34 <wikibugs> ('PS2) ''Dzahn: zuul: ensure /var/www exists [puppet] - ''https://gerrit.wikimedia.org/r/1197327 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 20:07:42 <wikibugs> ('CR) ''BCornwall: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7313/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1194558 (https://phabricator.wikimedia.org/T405931) (owner: ''Krinkle)'
2025-10-20 20:09:04 <Superpes> Hi any deployer available? I scheduled 3 patches for the morning window (also mergeable together), I waited an entire hour, but there was no one active this morning...
2025-10-20 20:09:28 <wikibugs> ('PS30) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 20:09:40 <wikibugs> ('CR) ''Dzahn: [C:''+2] zuul: ensure /var/www exists [puppet] - ''https://gerrit.wikimedia.org/r/1197327 (https://phabricator.wikimedia.org/T395938) (owner: ''Dzahn)'
2025-10-20 20:13:18 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5028.eqsin.wmnet
2025-10-20 20:13:29 <logmsgbot> !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on sretest2001.codfw.wmnet with reason: T383173
2025-10-20 20:13:33 <stashbot> T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173
2025-10-20 20:13:53 <wikibugs> 'ops-esams, ''DC-Ops, ''Infrastructure-Foundations, ''netops: esams switch oritentation migration - https://phabricator.wikimedia.org/T407794 (''RobH) ''NEW p:''Triage''Medium'
2025-10-20 20:15:03 <icinga-wm> RECOVERY - Host sretest2001 is UP: PING WARNING - Packet loss = 33%, RTA = 30.46 ms
2025-10-20 20:15:07 <wikibugs> ('CR) ''BCornwall: [V:''+1 C:''+2] varnish: Remove unused "Mobile Redirect" logic [puppet] - ''https://gerrit.wikimedia.org/r/1194558 (https://phabricator.wikimedia.org/T405931) (owner: ''Krinkle)'
2025-10-20 20:15:19 <wikibugs> ('CR) ''BCornwall: [V:''+2 C:''+2] "Tests are happy" [puppet] - ''https://gerrit.wikimedia.org/r/1194558 (https://phabricator.wikimedia.org/T405931) (owner: ''Krinkle)'
2025-10-20 20:16:06 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5020.eqsin.wmnet
2025-10-20 20:19:56 <wikibugs> ('PS2) ''BCornwall: Remove wikimedia_trust ACLs from varnish/haproxy [puppet] - ''https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688)'
2025-10-20 20:21:25 <wikibugs> ('CR) ''BCornwall: [V:''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7315/co"; [puppet] - ''https://gerrit.wikimedia.org/r/1192230 (https://phabricator.wikimedia.org/T399688) (owner: ''BCornwall)'
2025-10-20 20:22:51 <logmsgbot> !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host zuul1001.eqiad.wmnet with OS trixie
2025-10-20 20:26:22 <wikibugs> ('PS31) ''CDobbins: sre.loadbalancer: modify admin.py to accept 'reboot' action [cookbooks] - ''https://gerrit.wikimedia.org/r/1180137 (https://phabricator.wikimedia.org/T395240)'
2025-10-20 20:29:39 <wikibugs> ('PS1) ''CDanis: varnish: WMF-Uniq -> Analytics: no, really this time [puppet] - ''https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092)'
2025-10-20 20:33:45 <Superpes> RoanKattouw urbanecm TheresNoTime cjming Sorry for multi-pinging, but are any of you available for deploy? otherwise I won't wait, thanks :)
2025-10-20 20:39:34 <wikibugs> ('PS2) ''CDanis: varnish: WMF-Uniq -> Analytics: no, really this time [puppet] - ''https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092)'
2025-10-20 20:41:22 <wikibugs> 'ops-esams, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops: esams switch oritentation migration - https://phabricator.wikimedia.org/T407794#11291474 (''RobH)'
2025-10-20 20:41:47 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4044.ulsfo.wmnet
2025-10-20 20:43:51 <logmsgbot> !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2025-10-20 20:44:15 <jinxer-wm> FIRING: [2x] NodeTextfileStale: Stale textfile for wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
2025-10-20 20:44:27 <wikibugs> ('PS3) ''CDanis: varnish: WMF-Uniq -> Analytics: no, really this time [puppet] - ''https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092)'
2025-10-20 20:46:51 <wikibugs> ('CR) ''BBlack: [C:''+1] "easy peasy right? 😊" [puppet] - ''https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092) (owner: ''CDanis)'
2025-10-20 20:48:16 <wikibugs> ('PS1) ''Dzahn: zookeeper: drop safety check for buster, no more buster [puppet] - ''https://gerrit.wikimedia.org/r/1197334'
2025-10-20 20:51:16 <wikibugs> ('CR) ''CDanis: [C:''+2] varnish: WMF-Uniq -> Analytics: no, really this time [puppet] - ''https://gerrit.wikimedia.org/r/1197331 (https://phabricator.wikimedia.org/T407092) (owner: ''CDanis)'
2025-10-20 20:54:09 <logmsgbot> !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART
2025-10-20 20:54:12 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-20 20:55:35 <wikibugs> 'ops-esams, ''SRE, ''DC-Ops, ''Infrastructure-Foundations, ''netops: esams switch orientation migration - https://phabricator.wikimedia.org/T407794#11291516 (''Krinkle)'
2025-10-20 20:56:30 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5029.eqsin.wmnet
2025-10-20 20:59:17 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5021.eqsin.wmnet
2025-10-20 20:59:40 <sbassett> Hey all - one security patch to get out today!
2025-10-20 21:00:04 <jouncebot> Reedy, sbassett, Maryum, and manfredi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T2100).
2025-10-20 21:03:57 <wikibugs> ('PS1) ''Dzahn: zookeeper: add support for TLS [puppet] - ''https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 21:10:19 <sbassett> !log Deployed security fix for T406639
2025-10-20 21:10:21 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-20 21:14:00 <wikibugs> ('PS2) ''Dzahn: zookeeper: add support for TLS [puppet] - ''https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 21:14:11 <wikibugs> ('PS1) ''Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite [puppet] - ''https://gerrit.wikimedia.org/r/1197341 (https://phabricator.wikimedia.org/T405931)'
2025-10-20 21:14:38 <wikibugs> ('CR) ''CI reject: [V:''-1] zookeeper: add support for TLS [puppet] - ''https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938) (owner: ''Dzahn)'
2025-10-20 21:16:58 <wikibugs> ('PS3) ''Dzahn: zookeeper: add support for TLS [puppet] - ''https://gerrit.wikimedia.org/r/1197339 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 21:22:18 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4045.ulsfo.wmnet
2025-10-20 21:22:38 <wikibugs> ('PS1) ''Dzahn: zookeeper: replace legacy facts, fix lint warnings [puppet] - ''https://gerrit.wikimedia.org/r/1197342'
2025-10-20 21:28:04 <wikibugs> ('PS2) ''Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite [puppet] - ''https://gerrit.wikimedia.org/r/1197341 (https://phabricator.wikimedia.org/T405931)'
2025-10-20 21:28:04 <wikibugs> ('PS1) ''Krinkle: varnish: Simplify m-dot rewrite and fix m.wikipedia.org bug [puppet] - ''https://gerrit.wikimedia.org/r/1197343 (https://phabricator.wikimedia.org/T405931)'
2025-10-20 21:29:12 <jinxer-wm> FIRING: [2x] SystemdUnitFailed: amd-k8s-node-labeller.service on ml-serve1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 21:31:46 <wikibugs> ('CR) ''Krinkle: varnish: Add test for m.wikisource.org x-dt-host rewrite (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1197341 (https://phabricator.wikimedia.org/T405931) (owner: ''Krinkle)'
2025-10-20 21:32:05 <sukhe> ///
2025-10-20 21:32:08 <sukhe> er
2025-10-20 21:33:41 <wikibugs> ('PS1) ''Clare Ming: Add config for xLab MW Module experiment [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705)'
2025-10-20 21:34:36 <logmsgbot> !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on P{aqs[1014-1022]*} and P{P:Cassandra}
2025-10-20 21:35:10 <wikibugs> ('CR) ''Clare Ming: Add config for xLab MW Module experiment (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1197344 (https://phabricator.wikimedia.org/T401705) (owner: ''Clare Ming)'
2025-10-20 21:39:12 <jinxer-wm> FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-d1-eqiad:ethernet-1/31 (Transport: ssw1-f1-eqiad:et-0/0/29 (Equinix, 21996480) {#0107202f1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-d1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown
2025-10-20 21:39:21 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5030.eqsin.wmnet
2025-10-20 21:42:30 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5022.eqsin.wmnet
2025-10-20 21:52:34 <wikibugs> ('PS1) ''Dzahn: zuul: move zookeeper code from base to main profile [puppet] - ''https://gerrit.wikimedia.org/r/1197349 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 21:52:51 <wikibugs> ('CR) ''CI reject: [V:''-1] zuul: move zookeeper code from base to main profile [puppet] - ''https://gerrit.wikimedia.org/r/1197349 (https://phabricator.wikimedia.org/T395938) (owner: ''Dzahn)'
2025-10-20 21:55:04 <wikibugs> ('PS2) ''Dzahn: zuul: move zookeeper code from base to main profile [puppet] - ''https://gerrit.wikimedia.org/r/1197349 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 21:55:31 <wikibugs> ('CR) ''CI reject: [V:''-1] zuul: move zookeeper code from base to main profile [puppet] - ''https://gerrit.wikimedia.org/r/1197349 (https://phabricator.wikimedia.org/T395938) (owner: ''Dzahn)'
2025-10-20 21:56:56 <logmsgbot> !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:cassandra-dev
2025-10-20 22:01:11 <wikibugs> ('PS2) ''Krinkle: varnish: Simplify m-dot rewrite and fix m.wikipedia.org bug [puppet] - ''https://gerrit.wikimedia.org/r/1197343 (https://phabricator.wikimedia.org/T405931)'
2025-10-20 22:03:08 <wikibugs> ('PS1) ''Krinkle: varnish: Implement enable_m_redir and enable in Beta Cluster [puppet] - ''https://gerrit.wikimedia.org/r/1197351 (https://phabricator.wikimedia.org/T405931)'
2025-10-20 22:03:15 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4046.ulsfo.wmnet
2025-10-20 22:04:02 <wikibugs> ('CR) ''Aaron Schulz: "Gah, nice fix :)" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196875 (https://phabricator.wikimedia.org/T397203) (owner: ''Clément Goubert)'
2025-10-20 22:19:12 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-20 22:22:27 <logmsgbot> !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:cassandra-dev
2025-10-20 22:22:32 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5031.eqsin.wmnet
2025-10-20 22:25:40 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5023.eqsin.wmnet
2025-10-20 22:31:08 <wikibugs> ('PS3) ''Dzahn: zuul: move zookeeper code from base to main profile [puppet] - ''https://gerrit.wikimedia.org/r/1197349 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 22:44:12 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4047.ulsfo.wmnet
2025-10-20 22:58:16 <wikibugs> ('PS1) ''Dzahn: zuul: move ssl_password to new parameter name [labs/private] - ''https://gerrit.wikimedia.org/r/1197355 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 22:58:34 <wikibugs> ('PS2) ''Dzahn: zuul: move ssl_password to new parameter name [labs/private] - ''https://gerrit.wikimedia.org/r/1197355 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 23:00:05 <jouncebot> Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251020T2300)
2025-10-20 23:00:32 <wikibugs> ('CR) ''Dzahn: [V:''+2 C:''+2] zuul: move ssl_password to new parameter name [labs/private] - ''https://gerrit.wikimedia.org/r/1197355 (https://phabricator.wikimedia.org/T395938) (owner: ''Dzahn)'
2025-10-20 23:05:54 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5032.eqsin.wmnet
2025-10-20 23:05:54 <logmsgbot> !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_eqsin and A:cp
2025-10-20 23:08:05 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] "https://puppet-compiler.wmflabs.org/output/1197349/7324/"; [puppet] - ''https://gerrit.wikimedia.org/r/1197349 (https://phabricator.wikimedia.org/T395938) (owner: ''Dzahn)'
2025-10-20 23:08:51 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp5024.eqsin.wmnet
2025-10-20 23:08:51 <logmsgbot> !log sukhe@cumin1003 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_eqsin and A:cp
2025-10-20 23:09:15 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-20 23:11:22 <wikibugs> ('PS1) ''Awight: Temporarily revoke ssh access for awight [puppet] - ''https://gerrit.wikimedia.org/r/1197356'
2025-10-20 23:12:03 <wikibugs> ('CR) ''Dzahn: "now or once you tell us?" [puppet] - ''https://gerrit.wikimedia.org/r/1197356 (owner: ''Awight)'
2025-10-20 23:21:26 <wikibugs> ('PS1) ''Dzahn: zuul: still need TLS cert pathes in base class [puppet] - ''https://gerrit.wikimedia.org/r/1197357 (https://phabricator.wikimedia.org/T395938)'
2025-10-20 23:25:31 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] "https://puppet-compiler.wmflabs.org/output/1197357/7325/"; [puppet] - ''https://gerrit.wikimedia.org/r/1197357 (https://phabricator.wikimedia.org/T395938) (owner: ''Dzahn)'
2025-10-20 23:27:11 <logmsgbot> !log sukhe@cumin1003 cookbooks.sre.cdn.roll-reboot finished rebooting cp4048.ulsfo.wmnet
2025-10-20 23:30:08 <wikibugs> ('Abandoned) ''TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1197058 (owner: ''TrainBranchBot)'
2025-10-20 23:38:19 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1197364'
2025-10-20 23:38:19 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1197364 (owner: ''TrainBranchBot)'
2025-10-20 23:43:51 <jinxer-wm> FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDo
2025-10-20 23:45:39 <jinxer-wm> FIRING: CoreBGPDown: Core BGP session down between cr2-eqord and cr1-eqiad (208.80.154.196) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqiad&var-device=cr2-eqord:9804&var-bgp_group=Confed_eqiad&var-bgp_neighbor=cr1-eqiad - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-20 23:48:51 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-1/0/1:0 (Transport: cr2-eqord:xe-0/1/5 (Arelion, IC-314533 24ms 10Gbps wave) {#10180823000321:0}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-20 23:49:15 <jinxer-wm> FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
2025-10-20 23:50:39 <jinxer-wm> FIRING: [4x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-20 23:54:05 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1197364 (owner: ''TrainBranchBot)'
2025-10-20 23:54:12 <jinxer-wm> FIRING: KubernetesCalicoDown: wikikube-worker2203.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2203.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2025-10-20 23:57:01 <wikibugs> ('PS1) ''Scott French: shellbox: bump image version [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196771'

This page is generated from SQL logs, you can also download static txt files from here