Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-operations

Filter:
Start date
End date

Displaying 1192 items:

2025-10-30 00:06:02 <wikibugs> ('PS1) ''Arlolra: Deploy Parsoid Read Views to 7 wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199880 (https://phabricator.wikimedia.org/T408765)'
2025-10-30 00:08:43 <jinxer-wm> FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-30 00:11:03 <icinga-wm> PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2025-10-30 00:13:18 <wikibugs> ('CR) ''Scott French: "Thanks, Fabrizio!" [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: ''Fabfur)'
2025-10-30 00:17:36 <wikibugs> 'SRE, ''SRE-Access-Requests, ''LDAP-Access-Requests: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11325872 (''SherryYang-WMF) requested wmf on IDM I think I can start with level one of analytics-privatedata-users and see...'
2025-10-30 00:25:38 <wikibugs> 'SRE: Migrate from Squid to Varnish - https://phabricator.wikimedia.org/T78911#11325883 (''Krinkle)'
2025-10-30 00:38:29 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1199885'
2025-10-30 00:38:29 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1199885 (owner: ''TrainBranchBot)'
2025-10-30 00:38:43 <wikibugs> ('PS1) ''Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - ''https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203)'
2025-10-30 00:38:54 <wikibugs> ('CR) ''Superpes15: azwiktionary: use new wordmark and tagline (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: ''Əkrəm)'
2025-10-30 00:39:10 <wikibugs> ('CR) ''Superpes15: "recheck" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: ''Əkrəm)'
2025-10-30 00:40:46 <wikibugs> ('CR) ''CI reject: [V:''-1] Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - ''https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: ''Aaron Schulz)'
2025-10-30 00:42:57 <wikibugs> ('PS2) ''Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - ''https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203)'
2025-10-30 00:43:48 <wikibugs> ('CR) ''Superpes15: azwiktionary: use new wordmark and tagline (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1198390 (https://phabricator.wikimedia.org/T408147) (owner: ''Əkrəm)'
2025-10-30 00:54:56 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/1199885 (owner: ''TrainBranchBot)'
2025-10-30 01:00:49 <logmsgbot> !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image
2025-10-30 01:02:24 <wikibugs> ('PS1) ''Tim Starling: Enable ChangesListQuery partitioning on mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798)'
2025-10-30 01:02:26 <wikibugs> ('PS1) ''Tim Starling: Enable ChangesListQuery partitioning on enwiki and commonswiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798)'
2025-10-30 01:02:28 <wikibugs> ('PS1) ''Tim Starling: Enable ChangesListQuery partitioning on all wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199892 (https://phabricator.wikimedia.org/T403798)'
2025-10-30 01:04:21 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-30 01:08:18 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1199895'
2025-10-30 01:08:18 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1199895 (owner: ''TrainBranchBot)'
2025-10-30 01:08:44 <jinxer-wm> FIRING: [6x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-30 01:11:03 <icinga-wm> RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
2025-10-30 01:14:02 <logmsgbot> !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 12s)
2025-10-30 01:31:08 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - ''https://gerrit.wikimedia.org/r/1199895 (owner: ''TrainBranchBot)'
2025-10-30 01:33:43 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-30 01:43:43 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-30 02:14:18 <wikibugs> ('CR) ''RLazarus: [C:''+1] Enroll 50% of client sessions in PHP 8.3 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199836 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-30 02:14:33 <wikibugs> ('CR) ''RLazarus: [C:''+1] mw-(api-int|jobrunner): serve 25% of traffic on PHP 8.3 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199837 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-30 02:29:21 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-30 02:34:15 <wikibugs> 'SRE, ''SRE-Access-Requests, ''LDAP-Access-Requests: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11325974 (''Dzahn) a:''SherryYang-WMF''None Thank you, sounds good. Will continue with this information.'
2025-10-30 03:09:21 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 03:18:43 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 03:34:21 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 04:06:45 <jinxer-wm> FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2025-10-30 04:23:43 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 04:28:43 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 04:46:45 <jinxer-wm> RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2025-10-30 05:04:21 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-30 05:08:43 <jinxer-wm> FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 05:33:43 <jinxer-wm> FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 05:44:21 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-30 05:44:34 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance
2025-10-30 05:44:42 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
2025-10-30 05:44:50 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1165 (T407997)', diff saved to https://phabricator.wikimedia.org/P84410 and previous config saved to /var/cache/conftool/dbconfig/20251030-054449-marostegui.json
2025-10-30 05:44:55 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 05:47:00 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T407997)', diff saved to https://phabricator.wikimedia.org/P84411 and previous config saved to /var/cache/conftool/dbconfig/20251030-054659-marostegui.json
2025-10-30 05:47:19 <wikibugs> ('PS1) ''Marostegui: db2153: Migration to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1199943 (https://phabricator.wikimedia.org/T407463)'
2025-10-30 05:48:13 <wikibugs> ('CR) ''Marostegui: [C:''+2] db2153: Migration to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1199943 (https://phabricator.wikimedia.org/T407463) (owner: ''Marostegui)'
2025-10-30 05:49:19 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2153.codfw.wmnet with reason: Maintenance
2025-10-30 05:49:25 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2153 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84412 and previous config saved to /var/cache/conftool/dbconfig/20251030-054923-marostegui.json
2025-10-30 05:51:28 <wikibugs> ('PS1) ''Marostegui: installserver: Remove es2048 [puppet] - ''https://gerrit.wikimedia.org/r/1199945'
2025-10-30 05:53:55 <wikibugs> ('CR) ''Marostegui: [C:''+2] installserver: Remove es2048 [puppet] - ''https://gerrit.wikimedia.org/r/1199945 (owner: ''Marostegui)'
2025-10-30 05:57:33 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2153 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84413 and previous config saved to /var/cache/conftool/dbconfig/20251030-055732-root.json
2025-10-30 05:58:15 <wikibugs> ('PS1) ''Marostegui: instances.yaml: Remove es1033 from dbctl [puppet] - ''https://gerrit.wikimedia.org/r/1199946 (https://phabricator.wikimedia.org/T408772)'
2025-10-30 05:58:53 <wikibugs> ('CR) ''Marostegui: [C:''+2] instances.yaml: Remove es1033 from dbctl [puppet] - ''https://gerrit.wikimedia.org/r/1199946 (https://phabricator.wikimedia.org/T408772) (owner: ''Marostegui)'
2025-10-30 06:00:05 <jouncebot> Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T0600)
2025-10-30 06:00:05 <jouncebot> marostegui, Amir1, and federico3: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T0600). Please do the needful.
2025-10-30 06:00:19 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Remove es1033 from dbctl T408772', diff saved to https://phabricator.wikimedia.org/P84414 and previous config saved to /var/cache/conftool/dbconfig/20251030-060018-marostegui.json
2025-10-30 06:00:24 <stashbot> T408772: decommission es1033.eqiad.wmnet - https://phabricator.wikimedia.org/T408772
2025-10-30 06:00:41 <wikibugs> ('PS1) ''Marostegui: es1033: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1199948 (https://phabricator.wikimedia.org/T408772)'
2025-10-30 06:01:16 <wikibugs> ('CR) ''Marostegui: [C:''+2] es1033: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1199948 (https://phabricator.wikimedia.org/T408772) (owner: ''Marostegui)'
2025-10-30 06:02:09 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P84415 and previous config saved to /var/cache/conftool/dbconfig/20251030-060208-marostegui.json
2025-10-30 06:12:39 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2153 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84416 and previous config saved to /var/cache/conftool/dbconfig/20251030-061238-root.json
2025-10-30 06:15:11 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1033.eqiad.wmnet with OS trixie
2025-10-30 06:17:16 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P84417 and previous config saved to /var/cache/conftool/dbconfig/20251030-061715-marostegui.json
2025-10-30 06:22:37 <icinga-wm> PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.078e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
2025-10-30 06:27:45 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2153 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84418 and previous config saved to /var/cache/conftool/dbconfig/20251030-062744-root.json
2025-10-30 06:29:21 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-30 06:32:24 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T407997)', diff saved to https://phabricator.wikimedia.org/P84419 and previous config saved to /var/cache/conftool/dbconfig/20251030-063223-marostegui.json
2025-10-30 06:32:29 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 06:32:40 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance
2025-10-30 06:32:48 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1168 (T407997)', diff saved to https://phabricator.wikimedia.org/P84420 and previous config saved to /var/cache/conftool/dbconfig/20251030-063247-marostegui.json
2025-10-30 06:34:58 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T407997)', diff saved to https://phabricator.wikimedia.org/P84421 and previous config saved to /var/cache/conftool/dbconfig/20251030-063457-marostegui.json
2025-10-30 06:42:51 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2153 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84422 and previous config saved to /var/cache/conftool/dbconfig/20251030-064250-root.json
2025-10-30 06:50:05 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P84423 and previous config saved to /var/cache/conftool/dbconfig/20251030-065004-marostegui.json
2025-10-30 06:50:18 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage
2025-10-30 06:54:00 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage
2025-10-30 07:00:05 <jouncebot> Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T0700).
2025-10-30 07:00:05 <jouncebot> No Gerrit patches in the queue for this window AFAICS.
2025-10-30 07:05:13 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P84424 and previous config saved to /var/cache/conftool/dbconfig/20251030-070512-marostegui.json
2025-10-30 07:10:08 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Data-Platform-SRE (2025.10.17 - 2025.11.07): Degraded RAID on an-worker1203 - https://phabricator.wikimedia.org/T408359#11326172 (''Jclark-ctr) Replacement drive has arrived @btullis'
2025-10-30 07:15:36 <icinga-wm> RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 9420 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
2025-10-30 07:18:43 <jinxer-wm> FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 07:20:21 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T407997)', diff saved to https://phabricator.wikimedia.org/P84425 and previous config saved to /var/cache/conftool/dbconfig/20251030-072020-marostegui.json
2025-10-30 07:20:26 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 07:20:37 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance
2025-10-30 07:20:44 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1173 (T407997)', diff saved to https://phabricator.wikimedia.org/P84426 and previous config saved to /var/cache/conftool/dbconfig/20251030-072043-marostegui.json
2025-10-30 07:22:54 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T407997)', diff saved to https://phabricator.wikimedia.org/P84427 and previous config saved to /var/cache/conftool/dbconfig/20251030-072253-marostegui.json
2025-10-30 07:33:17 <jinxer-wm> FIRING: ProbeDown: Service wdqs2021:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 07:34:21 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 07:38:02 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P84428 and previous config saved to /var/cache/conftool/dbconfig/20251030-073801-marostegui.json
2025-10-30 07:38:17 <jinxer-wm> FIRING: [10x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 07:38:32 <jinxer-wm> FIRING: [10x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 07:38:36 <wikibugs> ('PS8) ''Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711)'
2025-10-30 07:41:56 <wikibugs> 'SRE, ''SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11326218 (''elukey) Hi Daniel! I think full access since the kerberos identity was requested :)'
2025-10-30 07:43:17 <jinxer-wm> FIRING: [18x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 07:43:24 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11326220 (''elukey) Correct this needs an approval from Mark afaik :) @mark Hi! Looping you in to approve the ops membership for Dawid (new Staff SRE in ML).'
2025-10-30 07:46:43 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 07:48:17 <jinxer-wm> FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 07:48:44 <jinxer-wm> FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 07:50:15 <gehel> brouberol, stevemunene : could you have a look at the WDQS elevated max lag ? Ping David or Gabriele if needed.
2025-10-30 07:51:43 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 07:52:02 <jinxer-wm> FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 07:53:09 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P84429 and previous config saved to /var/cache/conftool/dbconfig/20251030-075308-marostegui.json
2025-10-30 07:53:17 <jinxer-wm> FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 07:53:34 <logmsgbot> !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1033.eqiad.wmnet with OS trixie
2025-10-30 07:53:43 <jinxer-wm> FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 07:54:09 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es1033.eqiad.wmnet with OS trixie
2025-10-30 07:54:51 <wikibugs> 'SRE, ''SRE-swift-storage, ''Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11326229 (''Marostegui)'
2025-10-30 07:57:02 <jinxer-wm> FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 07:58:43 <jinxer-wm> FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 08:03:17 <jinxer-wm> FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 08:04:45 <stevemunene> Ack gehel , though it seems to have resolved. Following up on any extra steps that might be needed
2025-10-30 08:05:34 <gehel> stevemunene: there is a discussion on slack. More context there.
2025-10-30 08:07:31 <wikibugs> ('CR) ''Slyngshede: [C:''-1] "We should add tests, like so:" [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: ''Muehlenhoff)'
2025-10-30 08:08:17 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T407997)', diff saved to https://phabricator.wikimedia.org/P84430 and previous config saved to /var/cache/conftool/dbconfig/20251030-080816-marostegui.json
2025-10-30 08:08:17 <jinxer-wm> FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 08:08:22 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 08:08:33 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance
2025-10-30 08:08:41 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1180 (T407997)', diff saved to https://phabricator.wikimedia.org/P84431 and previous config saved to /var/cache/conftool/dbconfig/20251030-080840-marostegui.json
2025-10-30 08:08:43 <jinxer-wm> FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
2025-10-30 08:10:29 <logmsgbot> !log elukey@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
2025-10-30 08:10:51 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T407997)', diff saved to https://phabricator.wikimedia.org/P84432 and previous config saved to /var/cache/conftool/dbconfig/20251030-081050-marostegui.json
2025-10-30 08:12:02 <jinxer-wm> FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 08:12:53 <wikibugs> 'SRE, ''SRE-Unowned, ''Maps, ''Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11326255 (''elukey) Repooled codfw after the eqiad-only test, I think we are good! We'll wait a couple more days to be sure, but from next week we should start decomming the old har...'
2025-10-30 08:13:17 <jinxer-wm> FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 08:15:25 <icinga-wm> PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2013 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
2025-10-30 08:15:43 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 08:16:23 <icinga-wm> RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2013 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
2025-10-30 08:17:02 <jinxer-wm> FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 08:18:17 <jinxer-wm> FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 08:18:31 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage
2025-10-30 08:18:43 <jinxer-wm> RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
2025-10-30 08:20:43 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 08:22:02 <jinxer-wm> FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 08:22:33 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1155.eqiad.wmnet with reason: Upgrade
2025-10-30 08:23:17 <jinxer-wm> FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 08:23:38 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet with reason: Fixing triggers
2025-10-30 08:23:41 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1033.eqiad.wmnet with reason: host reimage
2025-10-30 08:25:59 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P84433 and previous config saved to /var/cache/conftool/dbconfig/20251030-082558-marostegui.json
2025-10-30 08:27:03 <wikibugs> ('CR) ''Stevemunene: Deploy airflow images from airflow-dags repository build (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 08:27:09 <wikibugs> ('CR) ''Jcrespo: [C:''+1] "Yes, we don't actively backup this host (only every 5 years). Although we should migrate the backup user grants." [puppet] - ''https://gerrit.wikimedia.org/r/1199541 (https://phabricator.wikimedia.org/T408662) (owner: ''Marostegui)'
2025-10-30 08:28:17 <jinxer-wm> FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 08:28:30 <wikibugs> ('CR) ''Marostegui: "The host was cloned from the existing one, so if they were there, they should be on the new host too" [puppet] - ''https://gerrit.wikimedia.org/r/1199541 (https://phabricator.wikimedia.org/T408662) (owner: ''Marostegui)'
2025-10-30 08:28:31 <wikibugs> ('CR) ''Marostegui: [C:''+2] backup1013.cnf.erb: Change es1032 with es1055 [puppet] - ''https://gerrit.wikimedia.org/r/1199541 (https://phabricator.wikimedia.org/T408662) (owner: ''Marostegui)'
2025-10-30 08:30:59 <wikibugs> ('PS8) ''Fabfur: P:cache:haproxy: introduce ua classes [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060)'
2025-10-30 08:32:02 <jinxer-wm> FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 08:32:03 <wikibugs> ('CR) ''Fabfur: P:cache:haproxy: introduce ua classes (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: ''Fabfur)'
2025-10-30 08:32:54 <wikibugs> ('CR) ''Elukey: [C:''+2] conftool: upgrade to 6.x and above [software/spicerack] - ''https://gerrit.wikimedia.org/r/1199723 (owner: ''Giuseppe Lavagetto)'
2025-10-30 08:33:44 <jinxer-wm> FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
2025-10-30 08:39:43 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 08:41:06 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P84434 and previous config saved to /var/cache/conftool/dbconfig/20251030-084105-marostegui.json
2025-10-30 08:43:44 <jinxer-wm> RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
2025-10-30 08:47:02 <jinxer-wm> FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 08:48:17 <jinxer-wm> RESOLVED: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 08:52:02 <jinxer-wm> RESOLVED: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 08:54:43 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 08:56:02 <jinxer-wm> FIRING: [6x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 08:56:08 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-"; [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199854 (https://phabricator.wikimedia.org/T406170) (owner: ''D3r1ck01)'
2025-10-30 08:56:14 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T407997)', diff saved to https://phabricator.wikimedia.org/P84435 and previous config saved to /var/cache/conftool/dbconfig/20251030-085613-marostegui.json
2025-10-30 08:56:17 <jinxer-wm> FIRING: [12x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 08:56:19 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 08:56:26 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-"; [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199856 (https://phabricator.wikimedia.org/T406170) (owner: ''D3r1ck01)'
2025-10-30 08:56:29 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance
2025-10-30 08:56:37 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1187 (T407997)', diff saved to https://phabricator.wikimedia.org/P84436 and previous config saved to /var/cache/conftool/dbconfig/20251030-085636-marostegui.json
2025-10-30 08:58:47 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T407997)', diff saved to https://phabricator.wikimedia.org/P84437 and previous config saved to /var/cache/conftool/dbconfig/20251030-085846-marostegui.json
2025-10-30 08:59:47 <jinxer-wm> FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 09:03:38 <wikibugs> ('PS1) ''Slyngshede: data.yaml add tracking for sherryyang [puppet] - ''https://gerrit.wikimedia.org/r/1199993'
2025-10-30 09:04:21 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-30 09:06:17 <jinxer-wm> FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 09:08:43 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 09:08:47 <wikibugs> 'SRE: offline rackspace wikitech-static, online aws wikitech-static - https://phabricator.wikimedia.org/T408704#11326433 (''LSobanski) cc @akosiaris'
2025-10-30 09:10:02 <jinxer-wm> FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 09:13:43 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 09:13:54 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P84438 and previous config saved to /var/cache/conftool/dbconfig/20251030-091354-marostegui.json
2025-10-30 09:14:24 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/1199993 (owner: ''Slyngshede)'
2025-10-30 09:15:02 <jinxer-wm> FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 09:15:43 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 09:18:54 <wikibugs> ('PS1) ''Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999'
2025-10-30 09:19:08 <wikibugs> ('CR) ''Slyngshede: [C:''+2] data.yaml add tracking for sherryyang [puppet] - ''https://gerrit.wikimedia.org/r/1199993 (owner: ''Slyngshede)'
2025-10-30 09:19:26 <wikibugs> 'SRE, ''SRE-Unowned, ''Maps, ''Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11326454 (''elukey)'
2025-10-30 09:20:02 <jinxer-wm> FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 09:20:26 <wikibugs> ('CR) ''Brouberol: Deploy airflow images from airflow-dags repository build (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 09:20:50 <wikibugs> ('CR) ''Brouberol: "Also, as a general point, please render locally to ferret these issues earlier." [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 09:25:02 <jinxer-wm> FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 09:25:11 <wikibugs> ('CR) ''Tiziano Fogli: [C:''+2] nrpe2nodexp: use service description as alertname [puppet] - ''https://gerrit.wikimedia.org/r/1199242 (https://phabricator.wikimedia.org/T395446) (owner: ''Tiziano Fogli)'
2025-10-30 09:25:45 <wikibugs> ('CR) ''Brouberol: "Something like" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 09:28:36 <wikibugs> ('CR) ''Brouberol: "You can also run `rake run_locally` in your `deployment-charts` directory to run the CI job locally." [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 09:29:02 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P84439 and previous config saved to /var/cache/conftool/dbconfig/20251030-092901-marostegui.json
2025-10-30 09:30:21 <icinga-wm> PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 09:34:21 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 09:35:43 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 09:38:44 <jinxer-wm> FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 09:40:21 <icinga-wm> RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 09:41:43 <wikibugs> ('CR) ''Majavah: [C:''+2] aptrepo: Retire kubeadm/1.29 components [puppet] - ''https://gerrit.wikimedia.org/r/1199240 (owner: ''Majavah)'
2025-10-30 09:41:50 <wikibugs> ('CR) ''Majavah: [C:''+2] aptrepo: Import Kubeadm/1.31 packages [puppet] - ''https://gerrit.wikimedia.org/r/1199241 (https://phabricator.wikimedia.org/T372697) (owner: ''Majavah)'
2025-10-30 09:42:13 <wikibugs> ('PS1) ''JMeybohm: admin: Replace my ssh key with a FIDO token [puppet] - ''https://gerrit.wikimedia.org/r/1200008'
2025-10-30 09:42:13 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 09:42:58 <wikibugs> ('PS2) ''Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999'
2025-10-30 09:43:15 <wikibugs> ('PS1) ''D3r1ck01: Stats: add getLabels() function [core] (wmf/1.45.0-wmf.24) - ''https://gerrit.wikimedia.org/r/1200009 (https://phabricator.wikimedia.org/T406170)'
2025-10-30 09:44:10 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T407997)', diff saved to https://phabricator.wikimedia.org/P84440 and previous config saved to /var/cache/conftool/dbconfig/20251030-094409-marostegui.json
2025-10-30 09:44:15 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 09:44:21 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-30 09:44:26 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance
2025-10-30 09:44:54 <wikibugs> ('Abandoned) ''D3r1ck01: Stats: add getLabels() function [core] (wmf/1.45.0-wmf.24) - ''https://gerrit.wikimedia.org/r/1200009 (https://phabricator.wikimedia.org/T406170) (owner: ''D3r1ck01)'
2025-10-30 09:47:13 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 09:47:36 <wikibugs> ('PS1) ''Stevemunene: Add an opensearch-test-codfw namespace [puppet] - ''https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779)'
2025-10-30 09:48:31 <wikibugs> ('PS1) ''Majavah: aptrepo: Remove previously-missed reference to kubeadm 1.29 [puppet] - ''https://gerrit.wikimedia.org/r/1200011'
2025-10-30 09:48:43 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 09:48:47 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1231.eqiad.wmnet with reason: Maintenance
2025-10-30 09:48:55 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1231 (T407997)', diff saved to https://phabricator.wikimedia.org/P84441 and previous config saved to /var/cache/conftool/dbconfig/20251030-094854-marostegui.json
2025-10-30 09:49:13 <wikibugs> ('CR) ''Majavah: [C:''+2] aptrepo: Remove previously-missed reference to kubeadm 1.29 [puppet] - ''https://gerrit.wikimedia.org/r/1200011 (owner: ''Majavah)'
2025-10-30 09:50:30 <moritzm> !log import prometheus-statsd-exporter to trixie-wikimedia T407513
2025-10-30 09:50:35 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-30 09:50:36 <stashbot> T407513: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513
2025-10-30 09:51:04 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T407997)', diff saved to https://phabricator.wikimedia.org/P84442 and previous config saved to /var/cache/conftool/dbconfig/20251030-095103-marostegui.json
2025-10-30 09:51:09 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 09:51:32 <wikibugs> ('PS3) ''Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999'
2025-10-30 09:53:04 <wikibugs> ('PS4) ''Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999'
2025-10-30 09:54:28 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "Looks good, key has been verified out of band" [puppet] - ''https://gerrit.wikimedia.org/r/1200008 (owner: ''JMeybohm)'
2025-10-30 09:55:06 <wikibugs> ('CR) ''JMeybohm: [C:''+2] admin: Replace my ssh key with a FIDO token [puppet] - ''https://gerrit.wikimedia.org/r/1200008 (owner: ''JMeybohm)'
2025-10-30 09:56:25 <icinga-wm> PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 10:00:05 <jouncebot> Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1000)
2025-10-30 10:01:25 <icinga-wm> RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 10:03:43 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 10:05:43 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 10:05:45 <wikibugs> ('CR) ''Elukey: "Left some comments to better understand the code!" [cookbooks] - ''https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: ''Cathal Mooney)'
2025-10-30 10:06:12 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P84443 and previous config saved to /var/cache/conftool/dbconfig/20251030-100611-marostegui.json
2025-10-30 10:08:44 <jinxer-wm> RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 10:10:43 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 10:12:43 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 10:12:52 <wikibugs> ('CR) ''Cathal Mooney: sre.hosts.provision: move the switch config to parent class and run (''3 comments) [cookbooks] - ''https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: ''Cathal Mooney)'
2025-10-30 10:13:43 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 10:14:06 <wikibugs> ('PS1) ''Tiziano Fogli: haproxy: enable nrpe2nodexp wrapper on check-cinder-snapshot-leaks [puppet] - ''https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502)'
2025-10-30 10:14:06 <wikibugs> ('CR) ''Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - ''https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502) (owner: ''Tiziano Fogli)'
2025-10-30 10:14:25 <icinga-wm> PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 10:14:45 <wikibugs> ('PS2) ''Tiziano Fogli: cinder: enable nrpe2nodexp wrapper on check-cinder-snapshot-leaks [puppet] - ''https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502)'
2025-10-30 10:14:47 <jinxer-wm> FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 10:17:11 <wikibugs> ('PS5) ''Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999'
2025-10-30 10:18:43 <jinxer-wm> RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 10:19:08 <wikibugs> ('PS1) ''Tiziano Fogli: neutron: enable nrpe2nodexp wrapper on check-neutron-conntrack [puppet] - ''https://gerrit.wikimedia.org/r/1200016 (https://phabricator.wikimedia.org/T328502)'
2025-10-30 10:19:25 <icinga-wm> RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 10:21:12 <wikibugs> ('PS6) ''Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999'
2025-10-30 10:21:19 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P84444 and previous config saved to /var/cache/conftool/dbconfig/20251030-102118-marostegui.json
2025-10-30 10:21:59 <wikibugs> 'SRE-Access-Requests: Posix group membership: dpogorzelski ->ml-lab-users - https://phabricator.wikimedia.org/T408788 (''DPogorzelski-WMF) ''NEW'
2025-10-30 10:22:29 <wikibugs> ('PS1) ''Dpogorzelski: topic: add dpogorzelski to ml-lab-users [puppet] - ''https://gerrit.wikimedia.org/r/1200017 (https://phabricator.wikimedia.org/T408788)'
2025-10-30 10:22:35 <wikibugs> ('CR) ''CI reject: [V:''-1] Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999 (owner: ''Daniel Kinzler)'
2025-10-30 10:23:09 <wikibugs> ('PS1) ''Tiziano Fogli: nova: enable nrpe2nodexp wrapper on check-flavor_aggregates [puppet] - ''https://gerrit.wikimedia.org/r/1200018 (https://phabricator.wikimedia.org/T328502)'
2025-10-30 10:24:01 <wikibugs> ('PS7) ''Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999'
2025-10-30 10:24:47 <jinxer-wm> FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 10:25:00 <wikibugs> ('CR) ''Filippo Giunchedi: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502) (owner: ''Tiziano Fogli)'
2025-10-30 10:26:12 <wikibugs> ('PS9) ''Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711)'
2025-10-30 10:27:14 <wikibugs> ('CR) ''Stevemunene: "Thanks, using this for now and bookmarked the other helpfull tips" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 10:27:43 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 10:28:13 <jinxer-wm> FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 10:28:50 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1033.eqiad.wmnet with OS trixie
2025-10-30 10:29:21 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-30 10:33:53 <wikibugs> ('Abandoned) ''Dpogorzelski: topic: add dpogorzelski to ml-lab-users [puppet] - ''https://gerrit.wikimedia.org/r/1200017 (https://phabricator.wikimedia.org/T408788) (owner: ''Dpogorzelski)'
2025-10-30 10:34:46 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Patch-For-Review: Posix group membership: dpogorzelski ->ml-lab-users - https://phabricator.wikimedia.org/T408788#11326666 (''DPogorzelski-WMF) ''Open''Invalid'
2025-10-30 10:36:28 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T407997)', diff saved to https://phabricator.wikimedia.org/P84445 and previous config saved to /var/cache/conftool/dbconfig/20251030-103626-marostegui.json
2025-10-30 10:36:33 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 10:36:44 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
2025-10-30 10:39:40 <wikibugs> ('PS8) ''Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999'
2025-10-30 10:40:02 <jinxer-wm> FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 10:44:11 <icinga-wm> PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1002.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 10:44:57 <jinxer-wm> FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 10:45:16 <hnowlan> here
2025-10-30 10:45:38 <Emperor> !incidents
2025-10-30 10:45:38 <sirenbot> 6910 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
2025-10-30 10:45:41 <sobanski> We were just seeing 503s on Grafana
2025-10-30 10:45:42 <Emperor> !ack 6910
2025-10-30 10:45:43 <sirenbot> 6910 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
2025-10-30 10:46:31 <icinga-wm> PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 10:47:11 <icinga-wm> RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 10:47:31 <icinga-wm> RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
2025-10-30 10:49:57 <jinxer-wm> RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 10:50:13 <Emperor> Looking at the thanos-swift dashboard, there was a big spike of requests (resulting in 206) around the time the page fired
2025-10-30 10:52:28 <Emperor> (with consequent rise in network traffic etc)
2025-10-30 10:54:56 <hnowlan> looks like it hit 1002 a lot harder than 1001
2025-10-30 10:54:58 <wikibugs> 'SRE, ''Infrastructure-Foundations: megacli issues on Debian Trixie - https://phabricator.wikimedia.org/T408776#11326712 (''MoritzMuehlenhoff) p:''Triage''Medium'
2025-10-30 10:55:02 <jinxer-wm> FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 10:55:31 <hnowlan> memory usage spike at the same time lines up. $someone did $something expensive
2025-10-30 10:55:42 <hnowlan> the `thanos-query-log-explore` script in the docs doesn't output anything it seems unless I'm using it wrong
2025-10-30 10:56:06 <Emperor> I can't get it to either.
2025-10-30 10:56:22 <tappof> yeah, during the same window, some queries took several minutes to complete..
2025-10-30 10:57:09 <Emperor> hnowlan: shall I open a ticket about that for observability to look at?
2025-10-30 10:57:39 <Emperor> (but I think this incident probably doesn't need more work from oncall now otherwise)
2025-10-30 10:58:38 <Emperor> ah, no, I get it now, if I specify --min-range 1m then I get answers
2025-10-30 11:01:40 <_joe_> https://www.youtube.com/watch?v=M_5u3ESfFv0
2025-10-30 11:03:32 <hnowlan> :D
2025-10-30 11:03:42 <hnowlan> tappof: could you have a look to see if anything sticks out please?
2025-10-30 11:04:15 <tappof> yes hnowlan
2025-10-30 11:05:00 <hnowlan> afk for an hour
2025-10-30 11:05:06 <hnowlan> thanks m.oritzm! <3
2025-10-30 11:05:11 <moritzm> yw!
2025-10-30 11:05:15 <wikibugs> ('CR) ''Clément Goubert: [C:''+1] Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999 (owner: ''Daniel Kinzler)'
2025-10-30 11:07:06 <wikibugs> ('PS9) ''Daniel Kinzler: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999'
2025-10-30 11:07:31 <wikibugs> ('CR) ''Clément Goubert: [C:''+1] Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999 (owner: ''Daniel Kinzler)'
2025-10-30 11:07:49 <wikibugs> 'SRE, ''Data-Engineering (Q2 FY25/26 October 1st - December 31th): Move Druid realtime configuration out of Refinery into standalone repo on GitLab - https://phabricator.wikimedia.org/T407994#11326744 (''JAllemandou) The reason for which I suggested doing this task is that Druid-realtime are a specific type o...'
2025-10-30 11:08:13 <wikibugs> ('PS5) ''Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902)'
2025-10-30 11:09:54 <wikibugs> ('CR) ''CI reject: [V:''-1] Add an alert for Ganeti CA expiry [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: ''Muehlenhoff)'
2025-10-30 11:10:44 <wikibugs> ('CR) ''Clément Goubert: [C:''+2] Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999 (owner: ''Daniel Kinzler)'
2025-10-30 11:12:55 <wikibugs> ('Merged) ''jenkins-bot: Fix handling of per-route ratelimit config [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199999 (owner: ''Daniel Kinzler)'
2025-10-30 11:13:13 <jinxer-wm> RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
2025-10-30 11:14:47 <jinxer-wm> FIRING: [10x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 11:15:02 <jinxer-wm> FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 11:16:07 <wikibugs> ('CR) ''Brouberol: [C:''-1] Add an opensearch-test-codfw namespace (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779) (owner: ''Stevemunene)'
2025-10-30 11:17:56 <wikibugs> ('CR) ''Brouberol: [C:''-1] "I'm still seeing" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 11:18:22 <logmsgbot> !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2025-10-30 11:19:36 <logmsgbot> !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2025-10-30 11:20:02 <jinxer-wm> RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
2025-10-30 11:24:42 <wikibugs> ('CR) ''Stevemunene: Add an opensearch-test-codfw namespace (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779) (owner: ''Stevemunene)'
2025-10-30 11:25:17 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2025-10-30 11:25:31 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2025-10-30 11:25:49 <wikibugs> ('PS6) ''Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902)'
2025-10-30 11:27:31 <wikibugs> ('CR) ''CI reject: [V:''-1] Add an alert for Ganeti CA expiry [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: ''Muehlenhoff)'
2025-10-30 11:28:30 <wikibugs> ('PS1) ''Muehlenhoff: Re-enable monitoring for maps/bookworm [puppet] - ''https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565)'
2025-10-30 11:29:08 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Machine-Learning-Team: Promote dpogorzelski from ops-limited to ops - https://phabricator.wikimedia.org/T408702#11326775 (''DPogorzelski-WMF) a:''mark'
2025-10-30 11:33:44 <jinxer-wm> FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 11:34:05 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2025-10-30 11:34:14 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2025-10-30 11:34:21 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 11:38:49 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''decommission-hardware: decommission es2026 - https://phabricator.wikimedia.org/T408385#11326804 (''Marostegui)'
2025-10-30 11:40:19 <wikibugs> ('PS1) ''Marostegui: es2028: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1200031 (https://phabricator.wikimedia.org/T408407)'
2025-10-30 11:41:03 <wikibugs> ('CR) ''Marostegui: [C:''+2] es2028: Disable notifications [puppet] - ''https://gerrit.wikimedia.org/r/1200031 (https://phabricator.wikimedia.org/T408407) (owner: ''Marostegui)'
2025-10-30 11:41:24 <wikibugs> ('CR) ''Marostegui: [C:''-2] "Not for now, as I am using it for some Debian trixie testing." [puppet] - ''https://gerrit.wikimedia.org/r/1199825 (https://phabricator.wikimedia.org/T408407) (owner: ''Federico Ceratto)'
2025-10-30 11:41:38 <wikibugs> ('PS10) ''Stevemunene: Deploy airflow images from airflow-dags repository build [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711)'
2025-10-30 11:42:47 <wikibugs> ('PS7) ''Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902)'
2025-10-30 11:43:55 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
2025-10-30 11:44:02 <wikibugs> ('CR) ''CI reject: [V:''-1] Add an alert for Ganeti CA expiry [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: ''Muehlenhoff)'
2025-10-30 11:45:08 <tappof> > tappof: could you have a look to see if anything sticks out please?
2025-10-30 11:45:23 <moritzm> !log installing pdns-recursor security updates
2025-10-30 11:45:25 <tappof> Looks like the short outage was caused by a request on the "all clusters utilization" dashboard with a time range of a year
2025-10-30 11:45:26 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-30 11:45:41 <logmsgbot> !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database minwikisource (T408346)
2025-10-30 11:45:47 <stashbot> T408346: [wikireplicas] Create views for new wiki minwikisource - https://phabricator.wikimedia.org/T408346
2025-10-30 11:48:50 <wikibugs> 'SRE, ''Data-Engineering, ''LDAP-Access-Requests: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793 (''MatthewVernon) ''NEW'
2025-10-30 11:54:15 <wikibugs> ('PS1) ''Clément Goubert: api-gateway: Improve policy override [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200033'
2025-10-30 11:54:48 <wikibugs> ('CR) ''Effie Mouzeli: [C:''+1] Remove obsolete appserver cergen certs [puppet] - ''https://gerrit.wikimedia.org/r/1178528 (https://phabricator.wikimedia.org/T360636) (owner: ''Muehlenhoff)'
2025-10-30 11:54:49 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11326898 (''taavi)'
2025-10-30 11:55:07 <wikibugs> ('CR) ''Effie Mouzeli: [C:''+1] "woohoo" [puppet] - ''https://gerrit.wikimedia.org/r/1198952 (owner: ''Muehlenhoff)'
2025-10-30 11:55:58 <wikibugs> ('CR) ''Muehlenhoff: [C:''+2] Remove Cumin aliases for legacy mediawiki servers [puppet] - ''https://gerrit.wikimedia.org/r/1198952 (owner: ''Muehlenhoff)'
2025-10-30 11:59:09 <wikibugs> ('PS1) ''Stevemunene: druid: switch to using the druid-public-coordinator url [puppet] - ''https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955)'
2025-10-30 11:59:48 <wikibugs> ('CR) ''Clément Goubert: [C:''+2] api-gateway: Improve policy override [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200033 (owner: ''Clément Goubert)'
2025-10-30 12:00:05 <jouncebot> Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1200)
2025-10-30 12:01:03 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
2025-10-30 12:01:39 <wikibugs> ('Merged) ''jenkins-bot: api-gateway: Improve policy override [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200033 (owner: ''Clément Goubert)'
2025-10-30 12:02:32 <wikibugs> ('PS8) ''Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902)'
2025-10-30 12:03:42 <wikibugs> ('CR) ''CI reject: [V:''-1] Add an alert for Ganeti CA expiry [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: ''Muehlenhoff)'
2025-10-30 12:03:46 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
2025-10-30 12:03:53 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2025-10-30 12:04:22 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2025-10-30 12:04:44 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2025-10-30 12:05:21 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2025-10-30 12:06:05 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
2025-10-30 12:06:16 <logmsgbot> !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
2025-10-30 12:06:46 <wikibugs> ('PS9) ''Muehlenhoff: Add an alert for Ganeti CA expiry [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902)'
2025-10-30 12:10:26 <wikibugs> ('CR) ''Slyngshede: [C:''+1] "Looks good." [alerts] - ''https://gerrit.wikimedia.org/r/1199809 (https://phabricator.wikimedia.org/T382902) (owner: ''Muehlenhoff)'
2025-10-30 12:13:44 <jinxer-wm> RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-eqord:xe-0/1/3 (Transport: cr3-ulsfo:xe-0/1/1 (Arelion, IC-313592 51ms 10Gbps wave) {#11372}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
2025-10-30 12:21:18 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1032.eqiad.wmnet - https://phabricator.wikimedia.org/T408662#11326937 (''Jclark-ctr)'
2025-10-30 12:22:22 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1032.eqiad.wmnet - https://phabricator.wikimedia.org/T408662#11326938 (''Jclark-ctr) ''Open''Resolved'
2025-10-30 12:23:36 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''decommission-hardware: decommission kafka-jumbo100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T404413#11326940 (''Jclark-ctr)'
2025-10-30 12:29:31 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1198626 (https://phabricator.wikimedia.org/T408284) (owner: ''Bunnypranav)'
2025-10-30 12:31:12 <moritzm> !log installing nginx security updates
2025-10-30 12:31:15 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-30 12:36:52 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''decommission-hardware: decommission kafka-jumbo100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T404413#11326973 (''Jclark-ctr) ''Open''Resolved'
2025-10-30 12:41:24 <wikibugs> ('PS1) ''Marostegui: installserver: Format /srv/ in es2028 [puppet] - ''https://gerrit.wikimedia.org/r/1200049 (https://phabricator.wikimedia.org/T407472)'
2025-10-30 12:45:21 <wikibugs> ('CR) ''Marostegui: [C:''+2] installserver: Format /srv/ in es2028 [puppet] - ''https://gerrit.wikimedia.org/r/1200049 (https://phabricator.wikimedia.org/T407472) (owner: ''Marostegui)'
2025-10-30 12:48:37 <wikibugs> ('PS1) ''Huei Tan: alartmanager: change the lpl-team-slack-api-alerts config [puppet] - ''https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535)'
2025-10-30 12:49:05 <wikibugs> ('CR) ''CI reject: [V:''-1] alartmanager: change the lpl-team-slack-api-alerts config [puppet] - ''https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535) (owner: ''Huei Tan)'
2025-10-30 12:52:10 <logmsgbot> !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie
2025-10-30 12:53:59 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''Data-Platform-SRE (2025.10.17 - 2025.11.07), ''Essential-Work: Degraded RAID on an-presto1013 - https://phabricator.wikimedia.org/T408065#11327005 (''Jclark-ctr) Replacement drive is being ordered from dell on ticket T408572 after reviewing Available options other supplie...'
2025-10-30 12:58:11 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie
2025-10-30 13:00:05 <jouncebot> Urbanecm and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1300).
2025-10-30 13:00:05 <jouncebot> seanleong-wmde, JavierMonton, mfossati, Superpes, and Bunnypranav: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2025-10-30 13:00:13 <Superpes> o/
2025-10-30 13:00:25 <bunnypranav> o/
2025-10-30 13:00:37 <seanleong-wmde> o/
2025-10-30 13:01:48 <mfossati> o/
2025-10-30 13:02:19 <Superpes> bunnypranav Any reason why you moved on workboard the 2 tasks I was handling and change the status?
2025-10-30 13:04:21 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-30 13:04:29 <mfossati> Urbanecm, TheresNoTime: I can self-deploy
2025-10-30 13:06:48 <bunnypranav> Superpes: Nothing special, I thought the general process for work board/task management (someone did that for mine as well earlier). Apologies if you do not want it; feel free to revert and I will take a note for future.
2025-10-30 13:07:51 <Superpes> bunnypranav Nope, don't get me wrong, the process is absolutely correct! But, since they should be closed in less than an hour... well, I'd say it's a pointless change, just more unnecessary work for us, that's all :D No need to revert :)
2025-10-30 13:08:28 <Superpes> mfossati Inizia con le tue patch che qui si fa notte mi sa poi, se riesci, ci saremmo anche noi :D
2025-10-30 13:08:43 <bunnypranav> Oh okay, will remember for any future changes.
2025-10-30 13:08:59 <mfossati> Superpes ok vado!
2025-10-30 13:09:38 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mfossati@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199814 (owner: ''Marco Fossati)'
2025-10-30 13:09:38 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mfossati@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199844 (https://phabricator.wikimedia.org/T408618) (owner: ''Marco Fossati)'
2025-10-30 13:09:38 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mfossati@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199847 (owner: ''Marco Fossati)'
2025-10-30 13:12:58 <wikibugs> ('Merged) ''jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199814 (owner: ''Marco Fossati)'
2025-10-30 13:12:58 <wikibugs> ('Merged) ''jenkins-bot: Style adjustments [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199844 (https://phabricator.wikimedia.org/T408618) (owner: ''Marco Fossati)'
2025-10-30 13:13:01 <wikibugs> ('Merged) ''jenkins-bot: Capture more captions [extensions/ReaderExperiments] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199847 (owner: ''Marco Fossati)'
2025-10-30 13:13:50 <logmsgbot> !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1199814|Localisation updates from https://translatewiki.net.]], [[gerrit:1199844|Style adjustments (T408618)]], [[gerrit:1199847|Capture more captions]]
2025-10-30 13:13:55 <stashbot> T408618: UI Bug Bash for Image browsing (production) - https://phabricator.wikimedia.org/T408618
2025-10-30 13:14:45 <JavierMonton> Hi, sorry for the question, it's my first time trying to deploy a change, I added it to the calendar but I'm not sure if I have to do anything else. Can I help with it somehow?
2025-10-30 13:15:16 <logmsgbot> !log elukey@cumin1003 START - Cookbook sre.k8s.pool-depool-node pool for host ml-serve2001.codfw.wmnet
2025-10-30 13:15:19 <logmsgbot> !log elukey@cumin1003 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host ml-serve2001.codfw.wmnet
2025-10-30 13:15:30 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11327064 (''ops-monitoring-bot) Cookbook cookbooks.sre.k8s.pool-depool-node started by elukey@cumin1003 pool for host ml-serve2001.codfw.wmnet completed: - ml-serve2001.codfw.w...'
2025-10-30 13:15:33 <seanleong-wmde> Hi, anyone able to help me deploy my changes? Thanks!
2025-10-30 13:15:38 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Machine-Learning-Team: DIMM_A2 errors for ml-serve2001 - https://phabricator.wikimedia.org/T408516#11327067 (''elukey) ''Open''Resolved a:''elukey Host repooled!'
2025-10-30 13:16:26 <mfossati> urbanecm, TheresNoTime: are you around to deploy the config changes by JavierMonton and seanleong-wmde?
2025-10-30 13:16:45 <bunnypranav> same here btw, I also need a deployer
2025-10-30 13:16:51 <logmsgbot> !log marostegui@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
2025-10-30 13:16:53 <seanleong-wmde> yup, I am around, but I need a deployer
2025-10-30 13:18:10 <logmsgbot> !log mfossati@deploy2002 mfossati: Backport for [[gerrit:1199814|Localisation updates from https://translatewiki.net.]], [[gerrit:1199844|Style adjustments (T408618)]], [[gerrit:1199847|Capture more captions]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-30 13:18:35 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1001.eqiad.wmnet
2025-10-30 13:18:43 <mfossati> checking, please hold on :-)
2025-10-30 13:19:22 <wikibugs> 'SRE, ''SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11327081 (''elukey) rationale for `ml-team-admins`: while Dawid will soon be in `ops`, some tools available only to `ml-team-admins` will need to be tested in the future and not needi...'
2025-10-30 13:20:30 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2028.codfw.wmnet with reason: host reimage
2025-10-30 13:22:15 <mfossati> hmm for some reason I'm not seeing the changes with WikimediaDebug ... let me dig further
2025-10-30 13:22:45 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance
2025-10-30 13:23:16 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2214.codfw.wmnet with reason: Maintenance
2025-10-30 13:24:32 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1001.eqiad.wmnet
2025-10-30 13:26:59 <mfossati> no idea why the WikimediaDebug extension in Firefox isn't showing me the changes
2025-10-30 13:28:03 <mfossati> I'm trying to switch a few backends
2025-10-30 13:28:58 <wikibugs> ('PS2) ''Gehel: WDQS: remove ferm rule for port 80 [puppet] - ''https://gerrit.wikimedia.org/r/1199849 (https://phabricator.wikimedia.org/T408736)'
2025-10-30 13:31:30 <mfossati> Oh I think I got it: I can't test on the wikis where the extension is deployed since they aren't yet at 1.45.0-wmf.25. I'll go forward. Thanks for bearing with me
2025-10-30 13:32:07 <logmsgbot> !log mfossati@deploy2002 mfossati: Continuing with sync
2025-10-30 13:32:17 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1167.eqiad.wmnet with reason: Maintenance
2025-10-30 13:32:36 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
2025-10-30 13:32:44 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1167 (T407997)', diff saved to https://phabricator.wikimedia.org/P84446 and previous config saved to /var/cache/conftool/dbconfig/20251030-133243-marostegui.json
2025-10-30 13:32:49 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 13:34:05 <logmsgbot> !log cmooney@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on 60 hosts with reason: downtime new nokia devices in case they alert during tests
2025-10-30 13:34:12 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''netops: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11327142 (''ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=aacadee6-1bf1-45b7-bbed-963884cb38ed) set by cmooney@cumin1003 for 5 d...'
2025-10-30 13:34:21 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 13:35:59 <logmsgbot> !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database minwikisource (T408346)
2025-10-30 13:36:00 <Superpes> mfossati I just noticed! That's right, it can't be tested on wikis, because it's for the next update :D
2025-10-30 13:36:04 <stashbot> T408346: [wikireplicas] Create views for new wiki minwikisource - https://phabricator.wikimedia.org/T408346
2025-10-30 13:36:10 <logmsgbot> !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database minwikisource (T408346)
2025-10-30 13:36:18 <wikibugs> ('CR) ''Brouberol: [C:''+1] "LGTM" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 13:36:35 <Superpes> Is anyone available to deploy the other patches?
2025-10-30 13:36:55 <logmsgbot> !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199814|Localisation updates from https://translatewiki.net.]], [[gerrit:1199844|Style adjustments (T408618)]], [[gerrit:1199847|Capture more captions]] (duration: 23m 05s)
2025-10-30 13:37:00 <stashbot> T408618: UI Bug Bash for Image browsing (production) - https://phabricator.wikimedia.org/T408618
2025-10-30 13:37:01 <mfossati> Superpes: LOL, I definitely overlooked that
2025-10-30 13:37:21 <mfossati> I'm all done here!
2025-10-30 13:39:36 <mfossati> Superpes: I have deploy rights, so I guess I could do that, but don't wanna step on any official deployer toes :-)
2025-10-30 13:39:58 <wikibugs> ('CR) ''Muehlenhoff: C:openldap extend wikimediaPerson schema for Phabricator (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: ''Slyngshede)'
2025-10-30 13:40:11 <mfossati> Let me check if they're available on Slack
2025-10-30 13:40:18 <Superpes> I think no one is available atm (except you) :D
2025-10-30 13:40:22 <Superpes> Yep for sure!
2025-10-30 13:40:26 <seanleong-wmde> thanks!
2025-10-30 13:40:37 <seanleong-wmde> I have just 1 config change
2025-10-30 13:42:02 <logmsgbot> !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
2025-10-30 13:42:19 <logmsgbot> !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
2025-10-30 13:42:27 <mfossati> They both seem away on Slack, too. Well, I'll wear the deployer hat then
2025-10-30 13:42:50 <logmsgbot> !log fnegri@cumin1003 START - Cookbook sre.wikireplicas.add-wiki for database pcmwikiquote (T408354)
2025-10-30 13:42:51 <seanleong-wmde> o7
2025-10-30 13:42:55 <stashbot> T408354: [wikireplicas] Create views for new wiki pcmwikiquote - https://phabricator.wikimedia.org/T408354
2025-10-30 13:43:21 <wikibugs> ('CR) ''Bking: [C:''+1] Add an opensearch-test-codfw namespace (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779) (owner: ''Stevemunene)'
2025-10-30 13:44:00 <JavierMonton> thanks mfossati!
2025-10-30 13:44:21 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-30 13:44:38 <mfossati> seanleong-wmde, JavierMonton, Superpes, Bunnypranav: I'll backport all config patches at once. If anybody needs to verify their patch, please let me know
2025-10-30 13:45:34 <wikibugs> ('PS2) ''Andrea Denisse: alartmanager: change the lpl-team-slack-api-alerts config [puppet] - ''https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535) (owner: ''Huei Tan)'
2025-10-30 13:45:54 <wikibugs> ('CR) ''Andrea Denisse: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535) (owner: ''Huei Tan)'
2025-10-30 13:46:02 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: ''Seanleong-wmde)'
2025-10-30 13:46:03 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: ''JavierMonton)'
2025-10-30 13:46:03 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199725 (https://phabricator.wikimedia.org/T408298) (owner: ''Superpes15)'
2025-10-30 13:46:04 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199727 (https://phabricator.wikimedia.org/T408514) (owner: ''Superpes15)'
2025-10-30 13:46:04 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1198626 (https://phabricator.wikimedia.org/T408284) (owner: ''Bunnypranav)'
2025-10-30 13:46:27 <seanleong-wmde> mfossati I could do a short test if it's possible in mwdebug
2025-10-30 13:46:40 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T407997)', diff saved to https://phabricator.wikimedia.org/P84447 and previous config saved to /var/cache/conftool/dbconfig/20251030-134639-marostegui.json
2025-10-30 13:46:43 <wikibugs> ('CR) ''Andrea Denisse: [C:''+2] alartmanager: change the lpl-team-slack-api-alerts config [puppet] - ''https://gerrit.wikimedia.org/r/1200053 (https://phabricator.wikimedia.org/T376535) (owner: ''Huei Tan)'
2025-10-30 13:46:46 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 13:46:50 <mfossati> seanleong-wmde: sure thing
2025-10-30 13:46:56 <seanleong-wmde> thanks!
2025-10-30 13:46:59 <wikibugs> ('Merged) ''jenkins-bot: Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1193703 (https://phabricator.wikimedia.org/T397258) (owner: ''Seanleong-wmde)'
2025-10-30 13:47:02 <wikibugs> ('Merged) ''jenkins-bot: Disable default user-agent collection. [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199246 (https://phabricator.wikimedia.org/T384964) (owner: ''JavierMonton)'
2025-10-30 13:47:04 <wikibugs> ('Merged) ''jenkins-bot: [huwiki] Set $wgUploadNavigationUrl [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199725 (https://phabricator.wikimedia.org/T408298) (owner: ''Superpes15)'
2025-10-30 13:47:07 <wikibugs> ('Merged) ''jenkins-bot: [ruwiki] Enable WikiLove extension [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199727 (https://phabricator.wikimedia.org/T408514) (owner: ''Superpes15)'
2025-10-30 13:47:09 <wikibugs> ('Merged) ''jenkins-bot: core-Namespaces: Add R: and R_talk: NS for crhwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1198626 (https://phabricator.wikimedia.org/T408284) (owner: ''Bunnypranav)'
2025-10-30 13:47:43 <logmsgbot> !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1193703|Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (T397258)]], [[gerrit:1199246|Disable default user-agent collection. (T384964)]], [[gerrit:1199725|[huwiki] Set $wgUploadNavigationUrl (T408298)]], [[gerrit:1199727|[ruwiki] Enable WikiLove extension (T408514)]], [[gerrit:1198626|core-Namespaces: Add R:
2025-10-30 13:47:43 <logmsgbot> and R_talk: NS for crhwiki (T408284)]]
2025-10-30 13:47:53 <stashbot> T397258: Implement Visual Changes to Edit Summary Based on UX Proposal - https://phabricator.wikimedia.org/T397258
2025-10-30 13:47:55 <stashbot> T384964: [Event Platform] Disable default collection of user agent for analytics streams - https://phabricator.wikimedia.org/T384964
2025-10-30 13:47:57 <stashbot> T408298: Set $wgUploadNavigationUrl for hu.wikipedia.org - https://phabricator.wikimedia.org/T408298
2025-10-30 13:47:57 <stashbot> T408514: Install Extension:WikiLove in Russian Wikipedia - https://phabricator.wikimedia.org/T408514
2025-10-30 13:47:58 <stashbot> T408284: Request to create a namespace for Crimean Tatar Wikipedia - https://phabricator.wikimedia.org/T408284
2025-10-30 13:48:09 <wikibugs> ('CR) ''Slyngshede: C:openldap extend wikimediaPerson schema for Phabricator (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: ''Slyngshede)'
2025-10-30 13:50:20 <logmsgbot> !log mfossati@deploy2002 superpes, bunnypranav, javiermonton, mfossati, seanleong-wmde: Backport for [[gerrit:1193703|Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (T397258)]], [[gerrit:1199246|Disable default user-agent collection. (T384964)]], [[gerrit:1199725|[huwiki] Set $wgUploadNavigationUrl (T408298)]], [[gerrit:1199727|[ruwiki] Enable WikiLove extension (T408514)]], [[g
2025-10-30 13:50:20 <logmsgbot> errit:1198626|core-Namespaces: Add R: and R_talk: NS for crhwiki (T408284)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-30 13:50:41 <Superpes> Testing :)
2025-10-30 13:50:42 <seanleong-wmde> testing it now
2025-10-30 13:51:22 <bunnypranav> testing
2025-10-30 13:51:45 <JavierMonton> checking
2025-10-30 13:52:27 <bunnypranav> mfossati, all good for mine, but you'll need to run namespace dupes script as well for that wiki.
2025-10-30 13:53:26 <bunnypranav> there are existing pages with the new namespace prefix (R:, https://crh.wikipedia.org/wiki/Mahsus:%C3%96nekDizini?prefix=R%3A&namespace=0), so that will need fixing
2025-10-30 13:53:35 <mfossati> bunnypranav: sorry, but I've never done that and not sure how to
2025-10-30 13:53:56 <bunnypranav> Oh, anyone here that can help with the script?
2025-10-30 13:55:28 <wikibugs> ('CR) ''Majavah: [C:''-1] "It seems like this is caused by a mismatch of the wmf server packages and the debian client package:" [puppet] - ''https://gerrit.wikimedia.org/r/1199850 (owner: ''Andrew Bogott)'
2025-10-30 13:56:50 <bunnypranav> mfossati: https://www.mediawiki.org/wiki/Manual:NamespaceDupes.php is the docs for it, if it helps. I assume, since this should not be clashing by my understanding, that "./maintenance/run namespaceDupes --fix" the automatic repair one should work
2025-10-30 13:56:56 <suzannewoodWMDE2> on the debug extension, do you know which of the options e.g. k8s-mwdebug we should be looking on?
2025-10-30 13:57:01 <suzannewoodWMDE2> in the dropdown
2025-10-30 13:57:07 <bunnypranav> would you be able to do it?
2025-10-30 13:57:58 <wikibugs> ('PS1) ''Elukey: team-sre: set only critical alerts for mirrors [alerts] - ''https://gerrit.wikimedia.org/r/1200068'
2025-10-30 13:58:17 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11327268 (''MatthewVernon) @VRiley-WMF the host is up, but it can't reach any of its spinning disks (the OS sees none, and the BMC says 0 physical disks)....'
2025-10-30 13:59:49 <Superpes> Sorry for the late everything fine for my patches :)
2025-10-30 14:00:15 <mfossati> bunnypranav: I'm afraid I can't help further, never done that, so not confident at all
2025-10-30 14:00:59 <JavierMonton> everything fine on my side too
2025-10-30 14:01:03 <bunnypranav> hmm, I was actually advised that this is the script to be run.
2025-10-30 14:01:05 <wikibugs> ('CR) ''Bking: [C:''+1] global_config: stop relying on DNS to translate FQDNs into IP addresses [puppet] - ''https://gerrit.wikimedia.org/r/1199813 (https://phabricator.wikimedia.org/T408706) (owner: ''Brouberol)'
2025-10-30 14:01:50 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P84448 and previous config saved to /var/cache/conftool/dbconfig/20251030-140147-marostegui.json
2025-10-30 14:01:52 <bunnypranav> (advised by other deployers few days ago)
2025-10-30 14:02:11 <wikibugs> ('CR) ''Tiziano Fogli: [C:''+2] cinder: enable nrpe2nodexp wrapper on check-cinder-snapshot-leaks [puppet] - ''https://gerrit.wikimedia.org/r/1200012 (https://phabricator.wikimedia.org/T328502) (owner: ''Tiziano Fogli)'
2025-10-30 14:02:37 <mfossati> bunnypranav: it would be great if you could directly ask them
2025-10-30 14:03:42 <bunnypranav> I can send xSavitar a DM to do it in a few hours when they said they will be available, and we continue the patch for now. Would that be okay with you?
2025-10-30 14:03:48 <wikibugs> ('CR) ''Kgraessle: [C:''+1] Enable ChangesListQuery partitioning on mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798) (owner: ''Tim Starling)'
2025-10-30 14:03:57 <wikibugs> ('PS1) ''Marostegui: rebuild_abuse_filter_log_trigger.sh: Quick oneliner to drop a trigger [software] - ''https://gerrit.wikimedia.org/r/1200070 (https://phabricator.wikimedia.org/T408780)'
2025-10-30 14:04:07 <wikibugs> ('CR) ''Kgraessle: [C:''+1] Enable ChangesListQuery partitioning on enwiki and commonswiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798) (owner: ''Tim Starling)'
2025-10-30 14:04:50 <wikibugs> ('CR) ''Marostegui: "Just a quick script for this task, as it may be needed in the future for other schema changes, just leaving this one here as example as so" [software] - ''https://gerrit.wikimedia.org/r/1200070 (https://phabricator.wikimedia.org/T408780) (owner: ''Marostegui)'
2025-10-30 14:04:55 <mfossati> bunnypranav: yep, that sounds good. Thanks for your understanding, as I came here only to backport my patches :-)
2025-10-30 14:05:02 <wikibugs> ('CR) ''Marostegui: [C:''+2] rebuild_abuse_filter_log_trigger.sh: Quick oneliner to drop a trigger [software] - ''https://gerrit.wikimedia.org/r/1200070 (https://phabricator.wikimedia.org/T408780) (owner: ''Marostegui)'
2025-10-30 14:05:12 <wikibugs> ('CR) ''Kgraessle: [C:''+1] Enable ChangesListQuery partitioning on all wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199892 (https://phabricator.wikimedia.org/T403798) (owner: ''Tim Starling)'
2025-10-30 14:05:18 <bunnypranav> Thank you for the backport, really appreciate it! :)
2025-10-30 14:05:32 <mfossati> all right, let's go!
2025-10-30 14:05:33 <wikibugs> ('Merged) ''jenkins-bot: rebuild_abuse_filter_log_trigger.sh: Quick oneliner to drop a trigger [software] - ''https://gerrit.wikimedia.org/r/1200070 (https://phabricator.wikimedia.org/T408780) (owner: ''Marostegui)'
2025-10-30 14:05:45 <logmsgbot> !log mfossati@deploy2002 superpes, bunnypranav, javiermonton, mfossati, seanleong-wmde: Continuing with sync
2025-10-30 14:06:46 <logmsgbot> !log aqu@deploy2002 Started deploy [analytics/refinery@39e92e9] (hadoop-test): Update pageview allowlist TEST [analytics/refinery@39e92e9f]
2025-10-30 14:07:04 <wikibugs> ('PS3) ''Bking: ingress: remove reference to defunct template [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876)'
2025-10-30 14:07:51 <logmsgbot> !log aqu@deploy2002 Finished deploy [analytics/refinery@39e92e9] (hadoop-test): Update pageview allowlist TEST [analytics/refinery@39e92e9f] (duration: 01m 04s)
2025-10-30 14:08:24 <logmsgbot> !log aqu@deploy2002 Started deploy [analytics/refinery@39e92e9]: Update pageview allowlist [analytics/refinery@39e92e9f]
2025-10-30 14:09:17 <wikibugs> ('PS4) ''Bking: ingress: remove reference to defunct template [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876)'
2025-10-30 14:09:49 <wikibugs> ('CR) ''Bking: "Done" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1196174 (https://phabricator.wikimedia.org/T406876) (owner: ''Bking)'
2025-10-30 14:11:22 <logmsgbot> !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1193703|Add feature flag for pilot wikis about visual changes coming from Wikibase having an icon. (T397258)]], [[gerrit:1199246|Disable default user-agent collection. (T384964)]], [[gerrit:1199725|[huwiki] Set $wgUploadNavigationUrl (T408298)]], [[gerrit:1199727|[ruwiki] Enable WikiLove extension (T408514)]], [[gerrit:1198626|core-Namespaces: Add R:
2025-10-30 14:11:22 <logmsgbot> and R_talk: NS for crhwiki (T408284)]] (duration: 23m 39s)
2025-10-30 14:11:31 <stashbot> T397258: Implement Visual Changes to Edit Summary Based on UX Proposal - https://phabricator.wikimedia.org/T397258
2025-10-30 14:11:32 <stashbot> T384964: [Event Platform] Disable default collection of user agent for analytics streams - https://phabricator.wikimedia.org/T384964
2025-10-30 14:11:33 <stashbot> T408298: Set $wgUploadNavigationUrl for hu.wikipedia.org - https://phabricator.wikimedia.org/T408298
2025-10-30 14:11:33 <stashbot> T408514: Install Extension:WikiLove in Russian Wikipedia - https://phabricator.wikimedia.org/T408514
2025-10-30 14:11:34 <stashbot> T408284: Request to create a namespace for Crimean Tatar Wikipedia - https://phabricator.wikimedia.org/T408284
2025-10-30 14:12:10 <mfossati> we're all done here! This was a quite impromptu session :-D
2025-10-30 14:12:17 <logmsgbot> !log aqu@deploy2002 Finished deploy [analytics/refinery@39e92e9]: Update pageview allowlist [analytics/refinery@39e92e9f] (duration: 03m 52s)
2025-10-30 14:12:27 <bunnypranav> Thank you so much!
2025-10-30 14:12:59 <seanleong-wmde> thanks! mfossati
2025-10-30 14:13:46 <logmsgbot> !log aqu@deploy2002 Started deploy [analytics/refinery@39e92e9] (thin): Update pageview allowlist THIN [analytics/refinery@39e92e9f]
2025-10-30 14:14:00 <wikibugs> ('PS2) ''Tiziano Fogli: base: remove check_microcode [puppet] - ''https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694)'
2025-10-30 14:14:24 <wikibugs> ('PS1) ''Marostegui: db2195: Migration to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1200075'
2025-10-30 14:14:44 <mfossati> a pleasure, have a nice one folks!
2025-10-30 14:15:03 <logmsgbot> !log aqu@deploy2002 Finished deploy [analytics/refinery@39e92e9] (thin): Update pageview allowlist THIN [analytics/refinery@39e92e9f] (duration: 01m 16s)
2025-10-30 14:15:12 <wikibugs> ('CR) ''Marostegui: [C:''+2] db2195: Migration to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1200075 (owner: ''Marostegui)'
2025-10-30 14:16:34 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2195.codfw.wmnet with reason: Maintenance
2025-10-30 14:16:38 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2195 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84449 and previous config saved to /var/cache/conftool/dbconfig/20251030-141638-marostegui.json
2025-10-30 14:16:58 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P84450 and previous config saved to /var/cache/conftool/dbconfig/20251030-141657-marostegui.json
2025-10-30 14:17:53 <Superpes> Grazie mfossati :3
2025-10-30 14:18:00 <wikibugs> ('CR) ''Vgutierrez: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: ''Aaron Schulz)'
2025-10-30 14:18:08 <wikibugs> ('CR) ''Stevemunene: Deploy airflow images from airflow-dags repository build (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 14:18:13 <wikibugs> ('CR) ''Stevemunene: [C:''+2] Deploy airflow images from airflow-dags repository build [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 14:19:58 <wikibugs> ('Merged) ''jenkins-bot: Deploy airflow images from airflow-dags repository build [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199819 (https://phabricator.wikimedia.org/T408711) (owner: ''Stevemunene)'
2025-10-30 14:20:10 <wikibugs> ('CR) ''Muehlenhoff: base: remove check_microcode (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694) (owner: ''Tiziano Fogli)'
2025-10-30 14:21:23 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/1197617 (https://phabricator.wikimedia.org/T406495) (owner: ''Slyngshede)'
2025-10-30 14:22:31 <wikibugs> ('PS3) ''Tiziano Fogli: base: remove check_microcode [puppet] - ''https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694)'
2025-10-30 14:23:00 <logmsgbot> !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
2025-10-30 14:24:29 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2195 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84451 and previous config saved to /var/cache/conftool/dbconfig/20251030-142428-root.json
2025-10-30 14:26:10 <logmsgbot> !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
2025-10-30 14:26:52 <logmsgbot> !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
2025-10-30 14:27:20 <logmsgbot> !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
2025-10-30 14:27:40 <logmsgbot> !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
2025-10-30 14:28:54 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org
2025-10-30 14:30:05 <jouncebot> Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1430)
2025-10-30 14:30:48 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "LGTM" [alerts] - ''https://gerrit.wikimedia.org/r/1200068 (owner: ''Elukey)'
2025-10-30 14:31:38 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694) (owner: ''Tiziano Fogli)'
2025-10-30 14:32:05 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T407997)', diff saved to https://phabricator.wikimedia.org/P84452 and previous config saved to /var/cache/conftool/dbconfig/20251030-143204-marostegui.json
2025-10-30 14:32:10 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 14:32:21 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance
2025-10-30 14:32:35 <logmsgbot> !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Apply JVM upgrade to 11.0.29 - eevans@cumin1003
2025-10-30 14:33:22 <wikibugs> ('PS5) ''Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753)'
2025-10-30 14:33:31 <wikibugs> ('PS3) ''Cathal Mooney: sre.hosts.provision: move the switch config to parent class and run [cookbooks] - ''https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342)'
2025-10-30 14:33:48 <logmsgbot> !log stevemunene@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
2025-10-30 14:33:51 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-30 14:34:29 <wikibugs> ('PS2) ''Tiziano Fogli: dbctl: enable nrpe2nodexp wrapper on dbctl_uncommitted_diffs [puppet] - ''https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694)'
2025-10-30 14:34:29 <wikibugs> ('CR) ''Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - ''https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: ''Tiziano Fogli)'
2025-10-30 14:35:56 <wikibugs> ('PS1) ''Andrew Bogott: cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - ''https://gerrit.wikimedia.org/r/1200080'
2025-10-30 14:35:57 <wikibugs> ('CR) ''Elukey: [C:''+2] team-sre: set only critical alerts for mirrors [alerts] - ''https://gerrit.wikimedia.org/r/1200068 (owner: ''Elukey)'
2025-10-30 14:36:06 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''decommission-hardware: decommission es2026 - https://phabricator.wikimedia.org/T408385#11327475 (''Jhancock.wm) ''Open''Resolved'
2025-10-30 14:36:18 <wikibugs> ('PS2) ''Andrew Bogott: cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - ''https://gerrit.wikimedia.org/r/1200080'
2025-10-30 14:36:22 <wikibugs> ('CR) ''Brouberol: [V:''+1] "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199813 (https://phabricator.wikimedia.org/T408706) (owner: ''Brouberol)'
2025-10-30 14:36:54 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1200080 (owner: ''Andrew Bogott)'
2025-10-30 14:37:04 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org
2025-10-30 14:39:07 <wikibugs> ('CR) ''CI reject: [V:''-1] cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - ''https://gerrit.wikimedia.org/r/1200080 (owner: ''Andrew Bogott)'
2025-10-30 14:39:35 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2195 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84453 and previous config saved to /var/cache/conftool/dbconfig/20251030-143934-root.json
2025-10-30 14:39:35 <logmsgbot> !log fnegri@cumin1003 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database pcmwikiquote (T408354)
2025-10-30 14:39:38 <wikibugs> ('CR) ''Brouberol: [V:''+1] "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1199813 (https://phabricator.wikimedia.org/T408706) (owner: ''Brouberol)'
2025-10-30 14:39:43 <stashbot> T408354: [wikireplicas] Create views for new wiki pcmwikiquote - https://phabricator.wikimedia.org/T408354
2025-10-30 14:41:11 <wikibugs> ('PS3) ''Andrew Bogott: cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - ''https://gerrit.wikimedia.org/r/1200080'
2025-10-30 14:42:06 <wikibugs> ('PS3) ''Vgutierrez: haproxy,varnish: Report X-Is-Browser back from varnish [puppet] - ''https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161)'
2025-10-30 14:44:01 <wikibugs> ('CR) ''Muehlenhoff: "This runs on the Cumin hosts, which are shared infrastructure, but the dbctl infrastructure is used by the DBAs, so I'll add them as revie" [puppet] - ''https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: ''Tiziano Fogli)'
2025-10-30 14:44:42 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1200080 (owner: ''Andrew Bogott)'
2025-10-30 14:44:45 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1172.eqiad.wmnet with reason: Maintenance
2025-10-30 14:44:53 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1172 (T407997)', diff saved to https://phabricator.wikimedia.org/P84454 and previous config saved to /var/cache/conftool/dbconfig/20251030-144452-marostegui.json
2025-10-30 14:44:59 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 14:47:55 <wikibugs> ('PS4) ''Andrew Bogott: cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - ''https://gerrit.wikimedia.org/r/1200080'
2025-10-30 14:47:59 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1200080 (owner: ''Andrew Bogott)'
2025-10-30 14:48:26 <wikibugs> ('CR) ''Marostegui: [C:''+1] "This is fine, and pretty easy to generate an alert (it only goes to irc) so we can see how it works. Let me know if you want me to do so" [puppet] - ''https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: ''Tiziano Fogli)'
2025-10-30 14:50:03 <wikibugs> ('CR) ''Ladsgroup: [C:''+1] dbctl: enable nrpe2nodexp wrapper on dbctl_uncommitted_diffs [puppet] - ''https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: ''Tiziano Fogli)'
2025-10-30 14:50:56 <wikibugs> ('CR) ''Vgutierrez: [V:''+2] "varnishtests are happy for both text and upload" [puppet] - ''https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) (owner: ''Vgutierrez)'
2025-10-30 14:51:18 <wikibugs> ('CR) ''Andrew Bogott: [C:''+2] cloud-vps pdns: Don't install (or use) default-mysql-client [puppet] - ''https://gerrit.wikimedia.org/r/1200080 (owner: ''Andrew Bogott)'
2025-10-30 14:51:31 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 14:51:40 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 14:51:49 <wikibugs> ('CR) ''CDanis: [C:''+1] haproxy,varnish: Report X-Is-Browser back from varnish [puppet] - ''https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) (owner: ''Vgutierrez)'
2025-10-30 14:53:11 <wikibugs> 'SRE-swift-storage, ''Infrastructure-Foundations: UEFI installer not installing grub correctly (at least on systems where / is RAID) - https://phabricator.wikimedia.org/T404356#11327643 (''elukey) After a chat with Jesse it may be possible that this bug is a variant of what we have been chasing in T381919. Th...'
2025-10-30 14:54:24 <wikibugs> ('CR) ''Bking: [C:''+2] Add an opensearch-test-codfw namespace [puppet] - ''https://gerrit.wikimedia.org/r/1200010 (https://phabricator.wikimedia.org/T408779) (owner: ''Stevemunene)'
2025-10-30 14:54:24 <wikibugs> ('CR) ''Elukey: [C:''+1] sre.hosts.provision: move the switch config to parent class and run (''3 comments) [cookbooks] - ''https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: ''Cathal Mooney)'
2025-10-30 14:54:40 <wikibugs> ('PS1) ''STran: Deploy temporary accounts to enwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200083'
2025-10-30 14:54:40 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2195 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84455 and previous config saved to /var/cache/conftool/dbconfig/20251030-145440-root.json
2025-10-30 14:55:11 <wikibugs> ('PS1) ''CDanis: benthos webrequest: x-is-browser [puppet] - ''https://gerrit.wikimedia.org/r/1200084'
2025-10-30 14:55:34 <wikibugs> ('PS3) ''Gehel: WDQS: remove ferm rule for port 80 [puppet] - ''https://gerrit.wikimedia.org/r/1199849 (https://phabricator.wikimedia.org/T408736)'
2025-10-30 14:55:36 <wikibugs> ('PS1) ''Arlolra: Turn off GeoCrumbsUseParserOutputFallback [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236)'
2025-10-30 14:55:52 <wikibugs> ('CR) ''Elukey: "I am totally fine with this, do we know what monitors will be re-enabled? Just to be sure and avoid noise :)" [puppet] - ''https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565) (owner: ''Muehlenhoff)'
2025-10-30 14:57:00 <wikibugs> ('PS1) ''Marostegui: db2170: Migration to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1200086 (https://phabricator.wikimedia.org/T407463)'
2025-10-30 14:57:30 <wikibugs> ('CR) ''Marostegui: [C:''+2] db2170: Migration to MariaDB 10.11 [puppet] - ''https://gerrit.wikimedia.org/r/1200086 (https://phabricator.wikimedia.org/T407463) (owner: ''Marostegui)'
2025-10-30 14:57:59 <wikibugs> ('CR) ''Brouberol: [V:''+1 C:''+2] global_config: stop relying on DNS to translate FQDNs into IP addresses [puppet] - ''https://gerrit.wikimedia.org/r/1199813 (https://phabricator.wikimedia.org/T408706) (owner: ''Brouberol)'
2025-10-30 14:58:15 <wikibugs> ('CR) ''Gehel: [C:''+2] WDQS: remove ferm rule for port 80 [puppet] - ''https://gerrit.wikimedia.org/r/1199849 (https://phabricator.wikimedia.org/T408736) (owner: ''Gehel)'
2025-10-30 14:58:27 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2170.codfw.wmnet with reason: Maintenance
2025-10-30 14:58:32 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depool db2170 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P84456 and previous config saved to /var/cache/conftool/dbconfig/20251030-145831-marostegui.json
2025-10-30 14:58:58 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T407997)', diff saved to https://phabricator.wikimedia.org/P84457 and previous config saved to /var/cache/conftool/dbconfig/20251030-145857-marostegui.json
2025-10-30 14:59:03 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 15:00:05 <jouncebot> dduvall and dancy: OwO what's this, a deployment window?? Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1500). nyaa~
2025-10-30 15:00:05 <gehel> marostegui: I think you have a pending puppet merge on puppetmaster, Feel free to merge mine at the same time (a removal of a ferm rule)
2025-10-30 15:00:16 <marostegui> gehel: No, I don't
2025-10-30 15:00:22 <marostegui> gehel: I think it is brouberol
2025-10-30 15:00:23 <brouberol> gehel: ok to merge the ferm rule change for wdqs?
2025-10-30 15:00:27 <marostegui> There we go!
2025-10-30 15:00:27 <gehel> sure
2025-10-30 15:00:31 <brouberol> (it's both of us)
2025-10-30 15:01:00 <gehel> the mariadb change has now disappeared!
2025-10-30 15:01:08 <marostegui> gehel: I think I was already pushing when you pinged me
2025-10-30 15:01:13 <marostegui> But your change wasn't there when I pushed :)
2025-10-30 15:02:06 <gehel> everything is now merged!
2025-10-30 15:02:44 <marostegui> yay!
2025-10-30 15:03:12 <wikibugs> ('CR) ''CDanis: [C:''+1] benthos::webrequest: Provide X-Is-Browser data [puppet] - ''https://gerrit.wikimedia.org/r/1199781 (owner: ''Vgutierrez)'
2025-10-30 15:04:02 <wikibugs> ('CR) ''Vgutierrez: [V:''+2 C:''+2] haproxy,varnish: Report X-Is-Browser back from varnish [puppet] - ''https://gerrit.wikimedia.org/r/1199792 (https://phabricator.wikimedia.org/T398161) (owner: ''Vgutierrez)'
2025-10-30 15:04:05 <wikibugs> ('Abandoned) ''CDanis: benthos webrequest: x-is-browser [puppet] - ''https://gerrit.wikimedia.org/r/1200084 (owner: ''CDanis)'
2025-10-30 15:05:09 <wikibugs> ('PS1) ''Tiziano Fogli: dotls: enable nrpe2nodexp wrapper on check_dotls [puppet] - ''https://gerrit.wikimedia.org/r/1200088 (https://phabricator.wikimedia.org/T384425)'
2025-10-30 15:05:09 <wikibugs> ('CR) ''Tiziano Fogli: "This change enables the nrpe2nodexp wrapper to export NRPE plugin results to Prometheus via the node exporter." [puppet] - ''https://gerrit.wikimedia.org/r/1200088 (https://phabricator.wikimedia.org/T384425) (owner: ''Tiziano Fogli)'
2025-10-30 15:05:54 <wikibugs> ('CR) ''Tiziano Fogli: [C:''+2] base: remove check_microcode (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1184447 (https://phabricator.wikimedia.org/T350694) (owner: ''Tiziano Fogli)'
2025-10-30 15:06:00 <wikibugs> ('CR) ''Tiziano Fogli: [C:''+2] dbctl: enable nrpe2nodexp wrapper on dbctl_uncommitted_diffs [puppet] - ''https://gerrit.wikimedia.org/r/1200074 (https://phabricator.wikimedia.org/T350694) (owner: ''Tiziano Fogli)'
2025-10-30 15:06:36 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2170 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P84458 and previous config saved to /var/cache/conftool/dbconfig/20251030-150636-root.json
2025-10-30 15:07:04 <wikibugs> ('PS1) ''Daniel Kinzler: Note that per-route rate limits require Envoy 1.33 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200090'
2025-10-30 15:08:39 <wikibugs> ('CR) ''C. Scott Ananian: [C:''+1] Turn off GeoCrumbsUseParserOutputFallback [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236) (owner: ''Arlolra)'
2025-10-30 15:08:51 <jinxer-wm> FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 15:09:45 <jinxer-wm> FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2025-10-30 15:09:46 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2195 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84459 and previous config saved to /var/cache/conftool/dbconfig/20251030-150946-root.json
2025-10-30 15:14:06 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P84460 and previous config saved to /var/cache/conftool/dbconfig/20251030-151405-marostegui.json
2025-10-30 15:14:34 <wikibugs> ('CR) ''Muehlenhoff: "AFAICT currently no alerts would be issued for all the common base alerts (disk space, host down etc) and also not for the OSM replication" [puppet] - ''https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565) (owner: ''Muehlenhoff)'
2025-10-30 15:16:22 <wikibugs> ('CR) ''Vgutierrez: [C:''+2] benthos::webrequest: Provide X-Is-Browser data [puppet] - ''https://gerrit.wikimedia.org/r/1199781 (owner: ''Vgutierrez)'
2025-10-30 15:16:59 <wikibugs> ('Abandoned) ''Andrew Bogott: dbutils::statement: add option to --skip-ssl [puppet] - ''https://gerrit.wikimedia.org/r/1199850 (owner: ''Andrew Bogott)'
2025-10-30 15:17:04 <wikibugs> ('Abandoned) ''Andrew Bogott: pdns_server::db_backups: --skip-ssl for db setup commands [puppet] - ''https://gerrit.wikimedia.org/r/1199851 (owner: ''Andrew Bogott)'
2025-10-30 15:18:51 <jinxer-wm> FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 15:21:42 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2170 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P84461 and previous config saved to /var/cache/conftool/dbconfig/20251030-152141-root.json
2025-10-30 15:24:07 <logmsgbot> !log cmooney@cumin1003 START - Cookbook sre.hosts.provision for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
2025-10-30 15:24:19 <logmsgbot> !log cmooney@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host sretest1005.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
2025-10-30 15:24:48 <wikibugs> ('CR) ''Cathal Mooney: "Re-tested with test-cookbook and working as expected. Thanks for the review!" [cookbooks] - ''https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: ''Cathal Mooney)'
2025-10-30 15:24:50 <wikibugs> ('CR) ''Cathal Mooney: [C:''+2] sre.hosts.provision: move the switch config to parent class and run [cookbooks] - ''https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: ''Cathal Mooney)'
2025-10-30 15:27:09 <wikibugs> ('PS1) ''Bking: opensearch-cluster: temporarily remove prometheus-related annotations from chart [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200093 (https://phabricator.wikimedia.org/T362114)'
2025-10-30 15:28:08 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops, ''Traffic, ''Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11327839 (''Jhancock.wm) ''In progress''Resolved a:''Jhancock.wm'
2025-10-30 15:29:14 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P84462 and previous config saved to /var/cache/conftool/dbconfig/20251030-152913-marostegui.json
2025-10-30 15:30:35 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11327853 (''VRiley-WMF) Of course, I'm looking into this now.'
2025-10-30 15:31:15 <wikibugs> ('Merged) ''jenkins-bot: sre.hosts.provision: move the switch config to parent class and run [cookbooks] - ''https://gerrit.wikimedia.org/r/1198108 (https://phabricator.wikimedia.org/T265342) (owner: ''Cathal Mooney)'
2025-10-30 15:31:20 <logmsgbot> !log dancy@deploy2002 Installing scap version "4.221.0" for 165 host(s)
2025-10-30 15:32:29 <moritzm> !log installing openjdk-21 security updates
2025-10-30 15:32:33 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-30 15:33:20 <wikibugs> ('CR) ''Elukey: [C:''+1] "Let's try :)" [puppet] - ''https://gerrit.wikimedia.org/r/1200030 (https://phabricator.wikimedia.org/T381565) (owner: ''Muehlenhoff)'
2025-10-30 15:33:51 <jinxer-wm> FIRING: [4x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 15:34:23 <wikibugs> ('CR) ''Bking: [C:''+2] opensearch-cluster: temporarily remove prometheus-related annotations from chart [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200093 (https://phabricator.wikimedia.org/T362114) (owner: ''Bking)'
2025-10-30 15:34:31 <moritzm> !log installing imagemagick security updates
2025-10-30 15:34:34 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-30 15:35:09 <logmsgbot> !log dancy@deploy2002 Installation of scap version "4.221.0" completed for 165 hosts
2025-10-30 15:36:24 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 15:36:34 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 15:36:48 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2170 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P84463 and previous config saved to /var/cache/conftool/dbconfig/20251030-153647-root.json
2025-10-30 15:37:10 <wikibugs> ('PS6) ''Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753)'
2025-10-30 15:37:28 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 15:37:36 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 15:38:51 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 15:39:45 <jinxer-wm> RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
2025-10-30 15:44:22 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T407997)', diff saved to https://phabricator.wikimedia.org/P84464 and previous config saved to /var/cache/conftool/dbconfig/20251030-154420-marostegui.json
2025-10-30 15:44:27 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 15:44:27 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1177.eqiad.wmnet with reason: Maintenance
2025-10-30 15:44:35 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1177 (T407997)', diff saved to https://phabricator.wikimedia.org/P84465 and previous config saved to /var/cache/conftool/dbconfig/20251030-154434-marostegui.json
2025-10-30 15:50:40 <wikibugs> ('CR) ''Brouberol: [C:''+1] Add OpenSearch cluster configs for net-new clusters [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: ''Bking)'
2025-10-30 15:51:20 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 15:51:37 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 15:51:54 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'db2170 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P84466 and previous config saved to /var/cache/conftool/dbconfig/20251030-155153-root.json
2025-10-30 15:52:18 <wikibugs> ('CR) ''Brouberol: [C:''-1] "Don't merge yet:" [puppet] - ''https://gerrit.wikimedia.org/r/1200034 (https://phabricator.wikimedia.org/T403955) (owner: ''Stevemunene)'
2025-10-30 15:57:59 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T407997)', diff saved to https://phabricator.wikimedia.org/P84467 and previous config saved to /var/cache/conftool/dbconfig/20251030-155758-marostegui.json
2025-10-30 15:58:04 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 16:00:05 <jouncebot> jhathaway and moritzm: I, the Bot under the Fountain, call upon thee, The Deployer, to do Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1600).
2025-10-30 16:00:05 <jouncebot> No Gerrit patches in the queue for this window AFAICS.
2025-10-30 16:02:53 <logmsgbot> !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-codfw: Apply JVM upgrade to 11.0.29 - eevans@cumin1003
2025-10-30 16:04:10 <wikibugs> ('CR) ''Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: ''Aaron Schulz)'
2025-10-30 16:06:03 <wikibugs> ('CR) ''Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: ''Aaron Schulz)'
2025-10-30 16:06:31 <wikibugs> ('PS1) ''Andrew Bogott: labsaliaser: include python3-keystoneauth1 [puppet] - ''https://gerrit.wikimedia.org/r/1200095'
2025-10-30 16:09:12 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 16:10:05 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 16:12:03 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 16:12:20 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 16:12:35 <wikibugs> ('PS1) ''Andrew Bogott: cloudservices: include openstack client packages [puppet] - ''https://gerrit.wikimedia.org/r/1200096'
2025-10-30 16:12:52 <wikibugs> ('Abandoned) ''Andrew Bogott: labsaliaser: include python3-keystoneauth1 [puppet] - ''https://gerrit.wikimedia.org/r/1200095 (owner: ''Andrew Bogott)'
2025-10-30 16:12:56 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1200096 (owner: ''Andrew Bogott)'
2025-10-30 16:13:07 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P84468 and previous config saved to /var/cache/conftool/dbconfig/20251030-161306-marostegui.json
2025-10-30 16:13:32 <wikibugs> 'SRE, ''SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11328180 (''Dzahn) >>! In T408579#11326218, @elukey wrote: > Hi Daniel! I think full access since the kerberos identity was requested :) I think there is still a misunderstanding her...'
2025-10-30 16:16:34 <wikibugs> ('CR) ''Andrew Bogott: [C:''+2] cloudservices: include openstack client packages [puppet] - ''https://gerrit.wikimedia.org/r/1200096 (owner: ''Andrew Bogott)'
2025-10-30 16:16:48 <logmsgbot> !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2248-2267].codfw.wmnet
2025-10-30 16:16:56 <logmsgbot> !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2248-2267].codfw.wmnet
2025-10-30 16:19:20 <logmsgbot> !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2003-2004,2007-2010,2019-2032].codfw.wmnet
2025-10-30 16:21:36 <wikibugs> 'SRE, ''SRE-Access-Requests: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11328259 (''elukey) My understanding is that asking a kerberos identity implies https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#All_of_the_above, what is the misundersta...'
2025-10-30 16:22:43 <logmsgbot> !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Apply JVM upgrade to 11.0.29 - eevans@cumin1003
2025-10-30 16:24:18 <wikibugs> ('PS1) ''Andrew Bogott: pdns_server: rename 'master' to 'primary' [puppet] - ''https://gerrit.wikimedia.org/r/1200097'
2025-10-30 16:25:38 <wikibugs> ('PS2) ''Andrew Bogott: pdns_server: rename 'master' to 'primary' [puppet] - ''https://gerrit.wikimedia.org/r/1200097'
2025-10-30 16:25:45 <wikibugs> ('CR) ''Andrew Bogott: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/1200097 (owner: ''Andrew Bogott)'
2025-10-30 16:26:29 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11328329 (''KOfori) This has my approval.'
2025-10-30 16:28:15 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P84469 and previous config saved to /var/cache/conftool/dbconfig/20251030-162814-marostegui.json
2025-10-30 16:31:39 <wikibugs> ('PS1) ''Ottomata: page-analytics - bump image to get pageviews/v3/per_editor [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200099 (https://phabricator.wikimedia.org/T405041)'
2025-10-30 16:32:38 <logmsgbot> !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2003-2004,2007-2010,2019-2032].codfw.wmnet
2025-10-30 16:33:03 <wikibugs> ('CR) ''BryanDavis: "Andrew added the prometheus logging in I0443357a7e2abb5b48ea6d2f78053078dc3f68c8" [puppet] - ''https://gerrit.wikimedia.org/r/1199305 (https://phabricator.wikimedia.org/T408457) (owner: ''Majavah)'
2025-10-30 16:33:18 <wikibugs> ('CR) ''Ottomata: [C:''+2] page-analytics - bump image to get pageviews/v3/per_editor [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200099 (https://phabricator.wikimedia.org/T405041) (owner: ''Ottomata)'
2025-10-30 16:33:51 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 16:34:14 <wikibugs> ('PS3) ''Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - ''https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203)'
2025-10-30 16:34:48 <wikibugs> 'ops-codfw, ''SRE, ''DC-Ops: Q4:rack/setup/install sretest2010 Config J 1P test host - https://phabricator.wikimedia.org/T394357#11328383 (''Jhancock.wm) i ran `reset /system1/pwrmgtsvc1` with a physical console up to observe. it didn't reboot for me. i powered it down manually and checked the insides aga...'
2025-10-30 16:35:17 <wikibugs> ('Merged) ''jenkins-bot: page-analytics - bump image to get pageviews/v3/per_editor [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200099 (https://phabricator.wikimedia.org/T405041) (owner: ''Ottomata)'
2025-10-30 16:35:54 <jinxer-wm> FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2025-10-30 16:39:09 <logmsgbot> !log otto@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply
2025-10-30 16:39:31 <logmsgbot> !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply
2025-10-30 16:40:54 <jinxer-wm> RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2025-10-30 16:42:39 <logmsgbot> !log otto@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply
2025-10-30 16:42:45 <logmsgbot> !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply
2025-10-30 16:42:50 <logmsgbot> !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply
2025-10-30 16:43:06 <logmsgbot> !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply
2025-10-30 16:43:22 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T407997)', diff saved to https://phabricator.wikimedia.org/P84470 and previous config saved to /var/cache/conftool/dbconfig/20251030-164322-marostegui.json
2025-10-30 16:43:27 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 16:43:39 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1178.eqiad.wmnet with reason: Maintenance
2025-10-30 16:43:47 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1178 (T407997)', diff saved to https://phabricator.wikimedia.org/P84471 and previous config saved to /var/cache/conftool/dbconfig/20251030-164346-marostegui.json
2025-10-30 16:43:56 <logmsgbot> !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2268-2287].codfw.wmnet
2025-10-30 16:44:05 <logmsgbot> !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2268-2287].codfw.wmnet
2025-10-30 16:44:31 <wikibugs> ('CR) ''Brouberol: [C:''-1] "`" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: ''Bking)'
2025-10-30 16:44:57 <logmsgbot> !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2040,2043,2045,2048,2052-2054,2063,2079-2084,2096-2101].codfw.wmnet
2025-10-30 16:45:07 <logmsgbot> !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply
2025-10-30 16:45:20 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply
2025-10-30 16:45:25 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply
2025-10-30 16:45:40 <logmsgbot> !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply
2025-10-30 16:48:51 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 16:49:53 <wikibugs> ('PS1) ''Cparle: Enable pagination on Special:Watchlist everywhere [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200105 (https://phabricator.wikimedia.org/T41510)'
2025-10-30 16:51:58 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, November 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200105 (https://phabricator.wikimedia.org/T41510) (owner: ''Cparle)'
2025-10-30 16:54:55 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q2:rack/setup/install wikikube-worker11XX - https://phabricator.wikimedia.org/T408749#11328564 (''Clement_Goubert)'
2025-10-30 16:57:11 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T407997)', diff saved to https://phabricator.wikimedia.org/P84472 and previous config saved to /var/cache/conftool/dbconfig/20251030-165710-marostegui.json
2025-10-30 16:57:17 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 16:58:25 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''procurement, ''serviceops: Q2:rack/setup/install wikikube-worker refresh - https://phabricator.wikimedia.org/T408760#11328592 (''Clement_Goubert)'
2025-10-30 16:58:56 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops, ''serviceops: Q2:rack/setup/install wikikube-worker1335-59 - https://phabricator.wikimedia.org/T408752#11328593 (''Clement_Goubert)'
2025-10-30 16:59:28 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply
2025-10-30 16:59:33 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply
2025-10-30 16:59:41 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236) (owner: ''Arlolra)'
2025-10-30 17:00:05 <jouncebot> bd808: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1700).
2025-10-30 17:00:05 <jouncebot> swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1700).
2025-10-30 17:00:11 <swfrench-wmf> o/
2025-10-30 17:00:38 <logmsgbot> !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2040,2043,2045,2048,2052-2054,2063,2079-2084,2096-2101].codfw.wmnet
2025-10-30 17:00:54 <jinxer-wm> FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2025-10-30 17:01:33 <logmsgbot> !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2288-2299].codfw.wmnet
2025-10-30 17:01:39 <logmsgbot> !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2288-2299].codfw.wmnet
2025-10-30 17:02:16 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply
2025-10-30 17:02:22 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply
2025-10-30 17:02:26 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199836 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-30 17:02:27 <logmsgbot> !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2230-2241].codfw.wmnet
2025-10-30 17:02:40 <jinxer-wm> FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2028:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2028 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
2025-10-30 17:03:15 <wikibugs> ('Merged) ''jenkins-bot: Enroll 50% of client sessions in PHP 8.3 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199836 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-30 17:03:48 <logmsgbot> !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1199836|Enroll 50% of client sessions in PHP 8.3 (T405955)]]
2025-10-30 17:03:52 <stashbot> T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
2025-10-30 17:05:57 <bd808> nothing for my deploy window this week. The only changes in developer-portal were translation file noise from a MediaWiki major version bump at TWN.
2025-10-30 17:07:25 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11328646 (''VRiley-WMF)'
2025-10-30 17:07:37 <wikibugs> 'ops-eqiad, ''SRE, ''DBA, ''DC-Ops, ''decommission-hardware: decommission es1031.eqiad.wmnet - https://phabricator.wikimedia.org/T408600#11328647 (''VRiley-WMF) ''Open''Resolved This is completed.'
2025-10-30 17:08:38 <logmsgbot> !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1199836|Enroll 50% of client sessions in PHP 8.3 (T405955)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-30 17:08:51 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-30 17:09:22 <logmsgbot> !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2230-2241].codfw.wmnet
2025-10-30 17:10:22 <wikibugs> ('PS1) ''Kosta Harlan: EventBus: Enable TYPE_EVENT for loginwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701)'
2025-10-30 17:11:15 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701) (owner: ''Kosta Harlan)'
2025-10-30 17:11:50 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid: apply
2025-10-30 17:11:55 <logmsgbot> !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid: apply
2025-10-30 17:12:01 <logmsgbot> !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2300-2319].codfw.wmnet
2025-10-30 17:12:01 <wikibugs> ('PS3) ''Dzahn: aptrepo::staging: add job to clear incoming folder [puppet] - ''https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: ''Jelto)'
2025-10-30 17:12:08 <wikibugs> ('CR) ''Dzahn: aptrepo::staging: add job to clear incoming folder (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: ''Jelto)'
2025-10-30 17:12:09 <logmsgbot> !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2300-2319].codfw.wmnet
2025-10-30 17:12:13 <logmsgbot> !log swfrench@deploy2002 swfrench: Continuing with sync
2025-10-30 17:12:13 <dduvall> jouncebot: now
2025-10-30 17:12:13 <jouncebot> For the next 0 hour(s) and 47 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1700)
2025-10-30 17:12:13 <jouncebot> For the next 0 hour(s) and 47 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1700)
2025-10-30 17:12:19 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P84473 and previous config saved to /var/cache/conftool/dbconfig/20251030-171218-marostegui.json
2025-10-30 17:12:46 <logmsgbot> !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2320-2330].codfw.wmnet
2025-10-30 17:12:51 <logmsgbot> !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2320-2330].codfw.wmnet
2025-10-30 17:13:22 <swfrench-wmf> dduvall: I have a deployment in flight, then I'll need to do some manual capacity tuning on two mediawiki services, but there should be some time afterward left in the window if you need it
2025-10-30 17:13:51 <jinxer-wm> RESOLVED: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 17:14:03 <dduvall> swfrench-wmf: perfect, thanks. group1 was rolled back yesterday so i'm hoping to get it back out a little early before rolling to all wikis
2025-10-30 17:14:03 <wikibugs> ('PS1) ''Jcrespo: Fix unit tests that had been broken (but only were detected on trixie) [software/transferpy] - ''https://gerrit.wikimedia.org/r/1200112'
2025-10-30 17:14:21 <logmsgbot> !log jasmine@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2116-2123,2216-2230].codfw.wmnet
2025-10-30 17:15:54 <jinxer-wm> RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2025-10-30 17:16:53 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11328747 (''VRiley-WMF) PP:0000:103234 - Has no additional interconnects PP:000:1268259 - Has 23324916, 23324917, 23324918 and 23324919'
2025-10-30 17:17:00 <swfrench-wmf> dduvall: sounds good. I'll let you know when I'm done
2025-10-30 17:20:32 <logmsgbot> !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199836|Enroll 50% of client sessions in PHP 8.3 (T405955)]] (duration: 16m 44s)
2025-10-30 17:20:38 <stashbot> T405955: MediaWiki on PHP 8.3 production workload migration - https://phabricator.wikimedia.org/T405955
2025-10-30 17:23:08 <wikibugs> ('CR) ''CDanis: add discovery records for gerrit as CNAMEs to public names (''2 comments) [dns] - ''https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) (owner: ''Dzahn)'
2025-10-30 17:24:46 <wikibugs> ('CR) ''Scott French: [C:''+2] mw-(api-int|jobrunner): serve 25% of traffic on PHP 8.3 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199837 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-30 17:25:04 <wikibugs> ('PS3) ''Dzahn: add discovery records for gerrit as CNAMEs to public names [dns] - ''https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259)'
2025-10-30 17:25:10 <wikibugs> ('CR) ''Dzahn: add discovery records for gerrit as CNAMEs to public names (''2 comments) [dns] - ''https://gerrit.wikimedia.org/r/1199486 (https://phabricator.wikimedia.org/T365259) (owner: ''Dzahn)'
2025-10-30 17:26:28 <wikibugs> ('Merged) ''jenkins-bot: mw-(api-int|jobrunner): serve 25% of traffic on PHP 8.3 [deployment-charts] - ''https://gerrit.wikimedia.org/r/1199837 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-30 17:27:20 <wikibugs> ('CR) ''Muehlenhoff: [C:''+1] "LGTM" [puppet] - ''https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: ''Jelto)'
2025-10-30 17:27:24 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 17:27:26 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P84474 and previous config saved to /var/cache/conftool/dbconfig/20251030-172726-marostegui.json
2025-10-30 17:27:30 <logmsgbot> !log jasmine@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2116-2123,2216-2230].codfw.wmnet
2025-10-30 17:27:31 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 17:27:44 <wikibugs> ('PS1) ''Clément Goubert: site.pp: Add new wikikube insetup hosts [puppet] - ''https://gerrit.wikimedia.org/r/1200116 (https://phabricator.wikimedia.org/T408749)'
2025-10-30 17:29:10 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
2025-10-30 17:29:25 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
2025-10-30 17:29:32 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
2025-10-30 17:29:48 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
2025-10-30 17:30:08 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
2025-10-30 17:30:18 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
2025-10-30 17:30:26 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
2025-10-30 17:30:37 <logmsgbot> !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
2025-10-30 17:30:54 <jinxer-wm> FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2025-10-30 17:32:23 <wikibugs> ('CR) ''Vgutierrez: [C:''+1] Route "/api/rest_v1/" requests with "?spec" query to the rest gateway [puppet] - ''https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: ''Aaron Schulz)'
2025-10-30 17:35:01 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
2025-10-30 17:35:12 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
2025-10-30 17:35:17 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
2025-10-30 17:35:29 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
2025-10-30 17:35:52 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
2025-10-30 17:35:54 <jinxer-wm> RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2025-10-30 17:36:00 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
2025-10-30 17:36:07 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
2025-10-30 17:36:08 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
2025-10-30 17:36:14 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
2025-10-30 17:36:30 <logmsgbot> !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
2025-10-30 17:37:40 <wikibugs> 'SRE, ''Infrastructure-Foundations: megacli issues on Debian Trixie - https://phabricator.wikimedia.org/T408776#11328850 (''MoritzMuehlenhoff) ''Open''Resolved a:''MoritzMuehlenhoff I grabbed the megacli "source package" from the http://hwraid.le-vert.net/debian (written in brackets since it doesn'...'
2025-10-30 17:39:40 <wikibugs> ('CR) ''Aaron Schulz: Route "/api/rest_v1/" requests with "?spec" query to the rest gateway (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/1199886 (https://phabricator.wikimedia.org/T397203) (owner: ''Aaron Schulz)'
2025-10-30 17:40:45 <wikibugs> ('CR) ''Dzahn: [C:''+2] aptrepo::staging: add job to clear incoming folder [puppet] - ''https://gerrit.wikimedia.org/r/1199243 (https://phabricator.wikimedia.org/T408527) (owner: ''Jelto)'
2025-10-30 17:41:04 <wikibugs> 'SRE, ''SRE-swift-storage, ''Infrastructure-Foundations: Key packages missing from trixie-wikimedia - https://phabricator.wikimedia.org/T407513#11328866 (''MoritzMuehlenhoff) megacli is not available, all the details in T407513. I've also imported the prometheus-statds-exporter to trixie-wikimedia, so once...'
2025-10-30 17:42:02 <wikibugs> ('PS1) ''Kosta Harlan: hCaptcha: Enable 100% passive mode for edits on test2wiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586)'
2025-10-30 17:42:34 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T407997)', diff saved to https://phabricator.wikimedia.org/P84475 and previous config saved to /var/cache/conftool/dbconfig/20251030-174233-marostegui.json
2025-10-30 17:42:41 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 17:42:50 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1192.eqiad.wmnet with reason: Maintenance
2025-10-30 17:42:58 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1192 (T407997)', diff saved to https://phabricator.wikimedia.org/P84476 and previous config saved to /var/cache/conftool/dbconfig/20251030-174257-marostegui.json
2025-10-30 17:43:10 <swfrench-wmf> dduvall: I believe I'm done. all yours!
2025-10-30 17:43:23 <dduvall> swfrench-wmf: ty!
2025-10-30 17:45:56 <wikibugs> ('PS1) ''TrainBranchBot: group1 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200121 (https://phabricator.wikimedia.org/T405681)'
2025-10-30 17:45:58 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200121 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-30 17:46:50 <logmsgbot> !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Apply JVM upgrade to 11.0.29 - eevans@cumin1003
2025-10-30 17:47:11 <wikibugs> ('Merged) ''jenkins-bot: group1 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200121 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-30 17:48:51 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-30 17:52:19 <wikibugs> ('CR) ''Samuel (WMF): [C:''+1] hCaptcha: Enable 100% passive mode for edits on test2wiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586) (owner: ''Kosta Harlan)'
2025-10-30 17:53:48 <logmsgbot> !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.25 refs T405681
2025-10-30 17:53:53 <stashbot> T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681
2025-10-30 17:56:02 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 17:56:08 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 17:56:14 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T407997)', diff saved to https://phabricator.wikimedia.org/P84477 and previous config saved to /var/cache/conftool/dbconfig/20251030-175611-marostegui.json
2025-10-30 17:56:19 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 17:57:58 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Audit Eqiad Patch panels for variance from Netbox - https://phabricator.wikimedia.org/T408197#11328938 (''Jclark-ctr) https://netbox.wikimedia.org/circuits/circuit-terminations/?site_id=6&sort=circuit We should go through each of these and verify the connections. @VRiley-WMF Th...'
2025-10-30 17:58:40 <wikibugs> ('PS1) ''TrainBranchBot: group2 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200123 (https://phabricator.wikimedia.org/T405681)'
2025-10-30 17:58:42 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200123 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-30 17:59:36 <wikibugs> ('Merged) ''jenkins-bot: group2 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200123 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-30 18:00:05 <jouncebot> dduvall and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T1800). nyaa~
2025-10-30 18:05:05 <logmsgbot> !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2028.codfw.wmnet with OS trixie
2025-10-30 18:06:26 <logmsgbot> !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.25 refs T405681
2025-10-30 18:06:31 <stashbot> T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681
2025-10-30 18:08:51 <jinxer-wm> FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 18:09:35 <dduvall> !log rolling back group2 from 1.45.0-wmf.25 to wmf.24 due to high rate of `PHP Deprecated: Asking for a replica from groups except dump/vslow is deprecated` errors
2025-10-30 18:09:37 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-30 18:09:52 <dduvall> !log rolling back group2 from 1.45.0-wmf.25 to wmf.24 due to high rate of `PHP Deprecated: Asking for a replica from groups except dump/vslow is deprecated` errors (T405681)
2025-10-30 18:09:56 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-30 18:10:06 <wikibugs> ('PS1) ''TrainBranchBot: group1 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200126 (https://phabricator.wikimedia.org/T405681)'
2025-10-30 18:10:09 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200126 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-30 18:11:22 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P84478 and previous config saved to /var/cache/conftool/dbconfig/20251030-181121-marostegui.json
2025-10-30 18:11:26 <wikibugs> ('Merged) ''jenkins-bot: group1 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200126 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-30 18:14:52 <wikibugs> ('PS2) ''BCornwall: ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - ''https://gerrit.wikimedia.org/r/1196777 (owner: ''Ncmonitor)'
2025-10-30 18:15:11 <wikibugs> ('PS1) ''Dzahn: admin: create user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - ''https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639)'
2025-10-30 18:15:27 <wikibugs> ('CR) ''CI reject: [V:''-1] admin: create user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - ''https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) (owner: ''Dzahn)'
2025-10-30 18:15:53 <wikibugs> ('CR) ''BCornwall: [C:''+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - ''https://gerrit.wikimedia.org/r/1196776 (owner: ''Ncmonitor)'
2025-10-30 18:16:45 <wikibugs> ('PS3) ''BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - ''https://gerrit.wikimedia.org/r/1196776 (owner: ''Ncmonitor)'
2025-10-30 18:17:09 <wikibugs> ('PS2) ''Dzahn: admin: create user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - ''https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639)'
2025-10-30 18:17:55 <wikibugs> ('CR) ''CI reject: [V:''-1] admin: create user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - ''https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) (owner: ''Dzahn)'
2025-10-30 18:18:07 <logmsgbot> !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.25 refs T405681
2025-10-30 18:18:12 <stashbot> T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681
2025-10-30 18:18:20 <wikibugs> ('PS4) ''BCornwall: NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - ''https://gerrit.wikimedia.org/r/1196776 (owner: ''Ncmonitor)'
2025-10-30 18:18:45 <wikibugs> ('CR) ''BCornwall: [C:''+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - ''https://gerrit.wikimedia.org/r/1196776 (owner: ''Ncmonitor)'
2025-10-30 18:18:47 <wikibugs> ('CR) ''BCornwall: [C:''+2] ACMEChiefConfig: Automated MarkMonitor domain sync [puppet] - ''https://gerrit.wikimedia.org/r/1196777 (owner: ''Ncmonitor)'
2025-10-30 18:18:51 <wikibugs> ('CR) ''BCornwall: [V:''+2 C:''+2] NCRedirRedirects: Automated MarkMonitor domain sync [puppet] - ''https://gerrit.wikimedia.org/r/1196776 (owner: ''Ncmonitor)'
2025-10-30 18:19:42 <wikibugs> ('PS3) ''Dzahn: admin: upgrade user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - ''https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639)'
2025-10-30 18:22:39 <wikibugs> ('PS2) ''Andrea Denisse: alertmanager: Add dashboard and runbook for Slack alerts [puppet] - ''https://gerrit.wikimedia.org/r/1200124 (https://phabricator.wikimedia.org/T408145)'
2025-10-30 18:22:39 <wikibugs> ('CR) ''Andrea Denisse: "Hi folks, I tested this in Pontoon and I sent an alert to the #engineering-all channel." [puppet] - ''https://gerrit.wikimedia.org/r/1200124 (https://phabricator.wikimedia.org/T408145) (owner: ''Andrea Denisse)'
2025-10-30 18:23:54 <wikibugs> ('PS1) ''Zabe: Do not use special db group [extensions/Flow] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1200132 (https://phabricator.wikimedia.org/T408540)'
2025-10-30 18:26:29 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P84479 and previous config saved to /var/cache/conftool/dbconfig/20251030-182629-marostegui.json
2025-10-30 18:28:16 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11329123 (''Dzahn)'
2025-10-30 18:28:26 <jinxer-wm> FIRING: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 18:28:35 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering: Add dpogorzelski to ML and Data Platform posix groups - https://phabricator.wikimedia.org/T408579#11329137 (''Dzahn)'
2025-10-30 18:32:40 <jinxer-wm> RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2028:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2028 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
2025-10-30 18:33:26 <jinxer-wm> RESOLVED: ProbeDown: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 18:34:00 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-30 18:34:07 <wikibugs> ('PS1) ''Bking: opensearch-cluster: stop hard-coding admin username [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200135 (https://phabricator.wikimedia.org/T408012)'
2025-10-30 18:35:21 <wikibugs> ('PS6) ''Xcollazo: dumps: Release the new MW Content File Export. Deprecate legacy XML dumps. [puppet] - ''https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022)'
2025-10-30 18:35:21 <wikibugs> ('CR) ''Xcollazo: "@joal@wikimedia.org, @brouberol@wikimedia.org, for your review." [puppet] - ''https://gerrit.wikimedia.org/r/1199783 (https://phabricator.wikimedia.org/T401022) (owner: ''Xcollazo)'
2025-10-30 18:35:38 <dduvall> zabe: thanks for triaging/fixing. would it make sense to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/1200132 now or should i wait for a backport of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/1200133 ?
2025-10-30 18:36:32 <zabe> It is a deprecation warning. The fix can be backported to reduce log spam, but imo this doesn't has to block the train
2025-10-30 18:36:54 <wikibugs> ('PS1) ''Zabe: Do not use special db group [extensions/FlaggedRevs] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1200139 (https://phabricator.wikimedia.org/T408540)'
2025-10-30 18:37:20 <dduvall> we typically block on egregious amounts of logspam, but yeah, not always
2025-10-30 18:37:33 <zabe> Amir +2'ed the other patch
2025-10-30 18:37:39 <zabe> so we can backport both
2025-10-30 18:37:49 <dduvall> alright. i'll do both at the same time then. ty!
2025-10-30 18:38:35 <wikibugs> ('PS2) ''Bking: opensearch-cluster: stop hard-coding admin username [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200135 (https://phabricator.wikimedia.org/T408012)'
2025-10-30 18:39:59 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by dduvall@deploy2002 using scap backport" [extensions/FlaggedRevs] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1200139 (https://phabricator.wikimedia.org/T408540) (owner: ''Zabe)'
2025-10-30 18:40:00 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by dduvall@deploy2002 using scap backport" [extensions/Flow] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1200132 (https://phabricator.wikimedia.org/T408540) (owner: ''Zabe)'
2025-10-30 18:41:07 <wikibugs> ('Merged) ''jenkins-bot: Do not use special db group [extensions/FlaggedRevs] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1200139 (https://phabricator.wikimedia.org/T408540) (owner: ''Zabe)'
2025-10-30 18:41:37 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T407997)', diff saved to https://phabricator.wikimedia.org/P84480 and previous config saved to /var/cache/conftool/dbconfig/20251030-184136-marostegui.json
2025-10-30 18:41:42 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 18:41:53 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1203.eqiad.wmnet with reason: Maintenance
2025-10-30 18:42:01 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1203 (T407997)', diff saved to https://phabricator.wikimedia.org/P84481 and previous config saved to /var/cache/conftool/dbconfig/20251030-184200-marostegui.json
2025-10-30 18:43:51 <jinxer-wm> RESOLVED: JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 18:49:10 <jinxer-wm> FIRING: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
2025-10-30 18:49:50 <wikibugs> ('CR) ''Bking: [C:''+2] opensearch-cluster: stop hard-coding admin username [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200135 (https://phabricator.wikimedia.org/T408012) (owner: ''Bking)'
2025-10-30 18:50:01 <wikibugs> ('Merged) ''jenkins-bot: Do not use special db group [extensions/Flow] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1200132 (https://phabricator.wikimedia.org/T408540) (owner: ''Zabe)'
2025-10-30 18:50:40 <logmsgbot> !log dduvall@deploy2002 Started scap sync-world: Backport for [[gerrit:1200139|Do not use special db group (T408540)]], [[gerrit:1200132|Do not use special db group (T408540)]]
2025-10-30 18:50:45 <stashbot> T408540: PHP Deprecated: Asking for a replica from groups except dump/vslow is deprecated: watchlist [Called from Wikimedia\Rdbms\LoadBalancer::getConnectionInternal] - https://phabricator.wikimedia.org/T408540
2025-10-30 18:52:16 <wikibugs> ('PS7) ''Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753)'
2025-10-30 18:52:58 <logmsgbot> !log dduvall@deploy2002 zabe, dduvall: Backport for [[gerrit:1200139|Do not use special db group (T408540)]], [[gerrit:1200132|Do not use special db group (T408540)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-30 18:53:50 <logmsgbot> !log dduvall@deploy2002 zabe, dduvall: Continuing with sync
2025-10-30 18:54:10 <jinxer-wm> RESOLVED: [2x] BFDdown: BFD session down between cr2-eqdfw and fe80::7a4f:9b00:174e:7c0c - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown
2025-10-30 18:54:47 <wikibugs> ('PS1) ''Bking: dse-k8s: Create CNAME record for opensearch-ipoid-test [dns] - ''https://gerrit.wikimedia.org/r/1200145 (https://phabricator.wikimedia.org/T408012)'
2025-10-30 18:55:19 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T407997)', diff saved to https://phabricator.wikimedia.org/P84482 and previous config saved to /var/cache/conftool/dbconfig/20251030-185518-marostegui.json
2025-10-30 18:55:24 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 18:57:58 <wikibugs> ('CR) ''ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, October 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586) (owner: ''Kosta Harlan)'
2025-10-30 18:58:04 <wikibugs> ('CR) ''CDanis: [C:''+1] dse-k8s: Create CNAME record for opensearch-ipoid-test [dns] - ''https://gerrit.wikimedia.org/r/1200145 (https://phabricator.wikimedia.org/T408012) (owner: ''Bking)'
2025-10-30 18:58:04 <logmsgbot> !log dduvall@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200139|Do not use special db group (T408540)]], [[gerrit:1200132|Do not use special db group (T408540)]] (duration: 07m 24s)
2025-10-30 18:58:10 <stashbot> T408540: PHP Deprecated: Asking for a replica from groups except dump/vslow is deprecated: watchlist [Called from Wikimedia\Rdbms\LoadBalancer::getConnectionInternal] - https://phabricator.wikimedia.org/T408540
2025-10-30 18:58:19 <wikibugs> ('PS1) ''Scott French: deployment_server: default to PHP 8.3 in mwscript-k8s [puppet] - ''https://gerrit.wikimedia.org/r/1200142 (https://phabricator.wikimedia.org/T405955)'
2025-10-30 18:59:42 <wikibugs> ('CR) ''Dzahn: [C:''+1] dse-k8s: Create CNAME record for opensearch-ipoid-test [dns] - ''https://gerrit.wikimedia.org/r/1200145 (https://phabricator.wikimedia.org/T408012) (owner: ''Bking)'
2025-10-30 18:59:48 <wikibugs> ('CR) ''Bking: [C:''+2] dse-k8s: Create CNAME record for opensearch-ipoid-test [dns] - ''https://gerrit.wikimedia.org/r/1200145 (https://phabricator.wikimedia.org/T408012) (owner: ''Bking)'
2025-10-30 19:01:08 <logmsgbot> !log bking@dns1004 START - running authdns-update
2025-10-30 19:02:01 <logmsgbot> !log bking@dns1004 END - running authdns-update
2025-10-30 19:02:32 <wikibugs> ('PS1) ''TrainBranchBot: group2 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200153 (https://phabricator.wikimedia.org/T405681)'
2025-10-30 19:02:34 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Initiated by dduvall@deploy2002" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200153 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-30 19:03:22 <wikibugs> ('Merged) ''jenkins-bot: group2 to 1.45.0-wmf.25 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200153 (https://phabricator.wikimedia.org/T405681) (owner: ''TrainBranchBot)'
2025-10-30 19:10:27 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P84483 and previous config saved to /var/cache/conftool/dbconfig/20251030-191026-marostegui.json
2025-10-30 19:12:25 <logmsgbot> !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.25 refs T405681
2025-10-30 19:12:29 <stashbot> T405681: 1.45.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T405681
2025-10-30 19:13:20 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1001-dev.eqiad.wmnet with OS trixie
2025-10-30 19:15:16 <wikibugs> ('PS5) ''Func: Revert "Adding Movepage-summary to wgForceUIMsgAsContentMsg to allow" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848)'
2025-10-30 19:15:48 <wikibugs> ('CR) ''Bartosz Dziewoński: "I'll schedule this for deployment the next time I have something to deploy, if I don't forget." [mediawiki-config] - ''https://gerrit.wikimedia.org/r/941424 (https://phabricator.wikimedia.org/T183848) (owner: ''Func)'
2025-10-30 19:19:00 <jinxer-wm> FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 19:24:48 <wikibugs> ('CR) ''Andrea Denisse: [C:''+1] "LGTM, thank you!" [puppet] - ''https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) (owner: ''Dzahn)'
2025-10-30 19:25:34 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P84484 and previous config saved to /var/cache/conftool/dbconfig/20251030-192534-marostegui.json
2025-10-30 19:27:41 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage
2025-10-30 19:32:31 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage
2025-10-30 19:33:01 <wikibugs> ('CR) ''Dzahn: [C:''+2] admin: upgrade user for Sherry Yang, no ssh key but analytics-privatedata [puppet] - ''https://gerrit.wikimedia.org/r/1200128 (https://phabricator.wikimedia.org/T408639) (owner: ''Dzahn)'
2025-10-30 19:35:55 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11329331 (''Dzahn) Hello @SherryYang-WMF give it max. 30 minutes from now and y...'
2025-10-30 19:36:51 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Access to wmf LDAP and analytics-privatedata-users shell group for SherryYang-WMF - https://phabricator.wikimedia.org/T408639#11329332 (''Dzahn) ''Open''Resolved a:''Dzahn'
2025-10-30 19:38:51 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 19:39:12 <wikibugs> ('PS1) ''Dzahn: admin: add mvernon to analytics-privatedata-users [puppet] - ''https://gerrit.wikimedia.org/r/1200159 (https://phabricator.wikimedia.org/T408793)'
2025-10-30 19:40:16 <jinxer-wm> FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2025-10-30 19:40:42 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T407997)', diff saved to https://phabricator.wikimedia.org/P84485 and previous config saved to /var/cache/conftool/dbconfig/20251030-194041-marostegui.json
2025-10-30 19:40:47 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 19:40:58 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1209.eqiad.wmnet with reason: Maintenance
2025-10-30 19:41:06 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1209 (T407997)', diff saved to https://phabricator.wikimedia.org/P84486 and previous config saved to /var/cache/conftool/dbconfig/20251030-194105-marostegui.json
2025-10-30 19:45:16 <jinxer-wm> RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2025-10-30 19:47:29 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet with OS trixie
2025-10-30 19:50:20 <wikibugs> ('CR) ''Andrea Denisse: [C:''+1] "LGTM, thank you!" [puppet] - ''https://gerrit.wikimedia.org/r/1200159 (https://phabricator.wikimedia.org/T408793) (owner: ''Dzahn)'
2025-10-30 19:51:17 <wikibugs> ('CR) ''D3r1ck01: [C:''+2] "starting gate-and-submit ahead of backport window" [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199854 (https://phabricator.wikimedia.org/T406170) (owner: ''D3r1ck01)'
2025-10-30 19:51:25 <wikibugs> ('CR) ''D3r1ck01: [C:''+2] "starting gate-and-submit ahead of backport window" [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199856 (https://phabricator.wikimedia.org/T406170) (owner: ''D3r1ck01)'
2025-10-30 19:53:48 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T407997)', diff saved to https://phabricator.wikimedia.org/P84487 and previous config saved to /var/cache/conftool/dbconfig/20251030-195347-marostegui.json
2025-10-30 19:53:54 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 19:54:00 <wikibugs> ('CR) ''Dzahn: [C:''+2] admin: add mvernon to analytics-privatedata-users [puppet] - ''https://gerrit.wikimedia.org/r/1200159 (https://phabricator.wikimedia.org/T408793) (owner: ''Dzahn)'
2025-10-30 19:56:20 <wikibugs> ('CR) ''RLazarus: [C:''+1] deployment_server: default to PHP 8.3 in mwscript-k8s (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1200142 (https://phabricator.wikimedia.org/T405955) (owner: ''Scott French)'
2025-10-30 19:56:45 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''Patch-For-Review: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11329422 (''Dzahn) @MatthewVernon You have been added to the group. Give it the usual couple minutes for puppet to deploy it across the f...'
2025-10-30 20:00:05 <jouncebot> RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T2000).
2025-10-30 20:00:05 <jouncebot> xSavitar, arlolra, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2025-10-30 20:01:20 <xSavitar> o/
2025-10-30 20:01:28 <arlolra> hello
2025-10-30 20:02:11 <wikibugs> ('PS1) ''Dzahn: admin: add kerberos principal indication to mvernon [puppet] - ''https://gerrit.wikimedia.org/r/1200162 (https://phabricator.wikimedia.org/T408793)'
2025-10-30 20:02:41 <xSavitar> arlolra, do you want to do your config change first while my patches land? I +2'd them 10 mins ahead of the window to save us some time.
2025-10-30 20:02:57 <arlolra> sure
2025-10-30 20:03:02 <xSavitar> I can do mine after yours then kostajh takes it from there.
2025-10-30 20:03:06 <wikibugs> ('Merged) ''jenkins-bot: Stats: add getLabels() function [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199854 (https://phabricator.wikimedia.org/T406170) (owner: ''D3r1ck01)'
2025-10-30 20:03:06 <xSavitar> arlolra, go for it.
2025-10-30 20:03:20 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by arlolra@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236) (owner: ''Arlolra)'
2025-10-30 20:03:46 <kostajh> Hi
2025-10-30 20:03:51 <xSavitar> kostajh, heh
2025-10-30 20:04:05 <kostajh> Sounds good
2025-10-30 20:04:10 <wikibugs> ('Merged) ''jenkins-bot: Turn off GeoCrumbsUseParserOutputFallback [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200085 (https://phabricator.wikimedia.org/T390236) (owner: ''Arlolra)'
2025-10-30 20:04:28 <wikibugs> ('CR) ''Dzahn: [C:''+2] admin: add kerberos principal indication to mvernon [puppet] - ''https://gerrit.wikimedia.org/r/1200162 (https://phabricator.wikimedia.org/T408793) (owner: ''Dzahn)'
2025-10-30 20:04:54 <logmsgbot> !log arlolra@deploy2002 Started scap sync-world: Backport for [[gerrit:1200085|Turn off GeoCrumbsUseParserOutputFallback (T390236)]]
2025-10-30 20:04:59 <stashbot> T390236: Turn off GeoCrumbsUseParserOutputFallback in production - https://phabricator.wikimedia.org/T390236
2025-10-30 20:06:25 <wikibugs> ('Merged) ''jenkins-bot: Stats: have RunningTimer manage the initial label set [core] (wmf/1.45.0-wmf.25) - ''https://gerrit.wikimedia.org/r/1199856 (https://phabricator.wikimedia.org/T406170) (owner: ''D3r1ck01)'
2025-10-30 20:07:08 <logmsgbot> !log arlolra@deploy2002 arlolra: Backport for [[gerrit:1200085|Turn off GeoCrumbsUseParserOutputFallback (T390236)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-30 20:07:41 <arlolra> xSavitar: hmm, so it seems I'm deploying your changes as well
2025-10-30 20:08:06 <arlolra> "The following are unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.45.0-wmf.25"
2025-10-30 20:08:07 <xSavitar> I'm not sure
2025-10-30 20:08:30 <xSavitar> Oh! and you didn't supply the gerrit patches?
2025-10-30 20:08:39 <xSavitar> So scap will auto-detect even when not specified?
2025-10-30 20:08:55 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P84488 and previous config saved to /var/cache/conftool/dbconfig/20251030-200854-marostegui.json
2025-10-30 20:09:50 <xSavitar> arlolra, if it insists, go ahead and deploy both.
2025-10-30 20:09:51 <arlolra> I suppposed. I clicked to see the diff
2025-10-30 20:09:53 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''Patch-For-Review: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11329460 (''Dzahn) @MatthewVernon I created the Kerberos principal for you per https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/...'
2025-10-30 20:10:06 <xSavitar> arlolra, what does the diff say?
2025-10-30 20:10:14 <arlolra> It showed your patches
2025-10-30 20:10:15 <wikibugs> 'SRE, ''SRE-Access-Requests, ''Data-Engineering, ''Patch-For-Review: Grant Access to analytics-privatedata-users for mvernon - https://phabricator.wikimedia.org/T408793#11329461 (''Dzahn) ''Open''Resolved a:''Dzahn'
2025-10-30 20:10:22 <arlolra> And then there was anotehr prompt
2025-10-30 20:10:23 <arlolra> "Continue with deployment (all patches will be deployed)? [y/N]:"
2025-10-30 20:10:39 <xSavitar> Okay, accept it, and roll it out
2025-10-30 20:10:54 <arlolra> You can see the interaction in the log
2025-10-30 20:11:07 <xSavitar> My patches are about prometheus metrics, shouldn't be too much to worry about
2025-10-30 20:11:18 <xSavitar> checks...
2025-10-30 20:11:22 <arlolra> So nothing for you to check on the testservesr?
2025-10-30 20:11:56 <xSavitar> nothing
2025-10-30 20:12:05 <xSavitar> I checked the logs and saw `20:07:08 arlolra: Backport for [[gerrit:1200085|Turn off GeoCrumbsUseParserOutputFallback (T390236)]] synced to the testservers`
2025-10-30 20:12:05 <stashbot> T390236: Turn off GeoCrumbsUseParserOutputFallback in production - https://phabricator.wikimedia.org/T390236
2025-10-30 20:12:25 <xSavitar> Not sure if hidden in that will deploy the others but let's try.
2025-10-30 20:12:49 <arlolra> I mean the spiderpig log
2025-10-30 20:13:06 <xSavitar> I see `Continue with deployment (all patches will be deployed)? [y/N]:`
2025-10-30 20:13:08 <xSavitar> Yes, that's fine
2025-10-30 20:13:19 <xSavitar> We can deploy all of them, yes!
2025-10-30 20:13:38 <xSavitar> arlolra, I mean once you're done testing on mwdebug
2025-10-30 20:14:02 <xSavitar> There is really nothing to test on my side. I can only see once it rolls out if metrics are being logged again
2025-10-30 20:14:06 <logmsgbot> !log arlolra@deploy2002 arlolra: Continuing with sync
2025-10-30 20:14:15 <xSavitar> s/logged/sent
2025-10-30 20:18:20 <logmsgbot> !log arlolra@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200085|Turn off GeoCrumbsUseParserOutputFallback (T390236)]] (duration: 13m 26s)
2025-10-30 20:18:26 <stashbot> T390236: Turn off GeoCrumbsUseParserOutputFallback in production - https://phabricator.wikimedia.org/T390236
2025-10-30 20:19:35 <xSavitar> arlolra, it seems to me like my changes were not actually deployed
2025-10-30 20:19:39 <jinxer-wm> FIRING: TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=eqsin&var-device=cr3-eqsin:9804&var-bgp_group=Transit6&var-bgp_neighbor=Hurricane+Electric - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
2025-10-30 20:19:52 <xSavitar> So I'll like to actually try to deploy them and see if anything happens.
2025-10-30 20:20:09 <xSavitar> Maybe scap didn't do what it said it'll do?
2025-10-30 20:20:19 <arlolra> That's somewhat surprising
2025-10-30 20:20:31 <arlolra> Has the train rolled out to group2 yet?
2025-10-30 20:20:44 <xSavitar> yes per https://versions.toolforge.org/
2025-10-30 20:21:09 <icinga-wm> PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
2025-10-30 20:21:23 <xSavitar> arlolra, should I try? :)
2025-10-30 20:21:42 <arlolra> Sure, but it seems like a no-op
2025-10-30 20:21:48 <arlolra> Are you sure your patches work?
2025-10-30 20:22:12 <xSavitar> Yes, I tested them locally
2025-10-30 20:22:17 <dancy> xSavitar: Scap deploys all mediawiki config and code that is merged into a suitable train branch.
2025-10-30 20:22:28 <kostajh> ^ this
2025-10-30 20:22:36 <xSavitar> I'm looking a https://grafana-rw.wikimedia.org/d/000000067/resourceloader-module-builds?forceLogin=true&from=now-6M&orgId=1&timezone=utc&to=now&var-module=startup&viewPanel=panel-17 and it's not going up yet
2025-10-30 20:22:37 <kostajh> You had merged the patches, that's why they got picked up by arlolra's deploy
2025-10-30 20:23:09 <kostajh> Can I proceed with my patches?
2025-10-30 20:23:33 <xSavitar> kostajh, you can go ahead and I'll wait for a while.
2025-10-30 20:23:42 <xSavitar> dancy, okay, thanks!
2025-10-30 20:24:03 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P84489 and previous config saved to /var/cache/conftool/dbconfig/20251030-202402-marostegui.json
2025-10-30 20:24:39 <jinxer-wm> FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
2025-10-30 20:24:53 <wikibugs> ('PS6) ''Dzahn: site/role: create placeholder role/profile for tcpproxy [puppet] - ''https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532)'
2025-10-30 20:24:59 <wikibugs> ('PS7) ''Dzahn: site/role: create placeholder role/profile for tcpproxy [puppet] - ''https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532)'
2025-10-30 20:25:15 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701) (owner: ''Kosta Harlan)'
2025-10-30 20:25:15 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586) (owner: ''Kosta Harlan)'
2025-10-30 20:25:27 <xSavitar> dancy, would it be terrible if I try again after Kosta is done just in case? :)
2025-10-30 20:25:44 <dancy> That's fine.
2025-10-30 20:25:58 <xSavitar> dancy, okay, if I find anything unusual, I'll let you know, okay?
2025-10-30 20:26:04 <wikibugs> ('Merged) ''jenkins-bot: EventBus: Enable TYPE_EVENT for loginwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200111 (https://phabricator.wikimedia.org/T408701) (owner: ''Kosta Harlan)'
2025-10-30 20:26:07 <wikibugs> ('Merged) ''jenkins-bot: hCaptcha: Enable 100% passive mode for edits on test2wiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200120 (https://phabricator.wikimedia.org/T405586) (owner: ''Kosta Harlan)'
2025-10-30 20:26:07 <dancy> Sounds good. I'll be around.
2025-10-30 20:26:11 <icinga-wm> RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 232.15 ms
2025-10-30 20:26:13 <xSavitar> dancy, thank you
2025-10-30 20:26:24 <kostajh> I see one of your patches wasn't synced, xSavitar
2025-10-30 20:26:25 <wikibugs> ('PS8) ''Dzahn: site/role: create placeholder role/profile for tcpproxy [puppet] - ''https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532)'
2025-10-30 20:26:29 <kostajh> xSavitar: https://spiderpig.wikimedia.org/jobs/840
2025-10-30 20:26:46 <kostajh> so https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1199856 will sync out now
2025-10-30 20:26:59 <xSavitar> interesting, I had a weird feeling, okay
2025-10-30 20:27:01 <xSavitar> please sync it
2025-10-30 20:27:14 <logmsgbot> !log eevans@cumin1003 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Apply JVM upgrade to 11.0.29 - eevans@cumin1003
2025-10-30 20:27:27 <logmsgbot> !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1200111|EventBus: Enable TYPE_EVENT for loginwiki (T408701)]], [[gerrit:1200120|hCaptcha: Enable 100% passive mode for edits on test2wiki (T405586)]]
2025-10-30 20:27:34 <stashbot> T408701: Enable event logging for the mediawiki.product_metrics.suggested_investigations_interaction stream on loginwiki - https://phabricator.wikimedia.org/T408701
2025-10-30 20:27:34 <stashbot> T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586
2025-10-30 20:27:56 <kostajh> xSavitar: IMO in general, it's better not to +2 things ahead of the deployment window, because it makes it more difficult to know what is getting synced out and when
2025-10-30 20:28:19 <xSavitar> kostajh, yes you're right. I was just about to write that to myself
2025-10-30 20:28:30 <wikibugs> ('PS9) ''Dzahn: site/role: create placeholder role/profile for tcpproxy [puppet] - ''https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532)'
2025-10-30 20:28:53 <xSavitar> Because now the task doesn't have any trace of it being backported since scap will use the gerrit ID to log actions/activity to the task
2025-10-30 20:28:58 <xSavitar> notes...
2025-10-30 20:29:35 <logmsgbot> !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1200111|EventBus: Enable TYPE_EVENT for loginwiki (T408701)]], [[gerrit:1200120|hCaptcha: Enable 100% passive mode for edits on test2wiki (T405586)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-30 20:29:37 <mutante> you can still use !log manually and if you mention a task number it will add it there
2025-10-30 20:30:44 <xSavitar> mutante, thanks! Was just wondering if scap autodetects changes while deploying something else and the deployer accepts to proceed, if those can be logged to the autodetected tasks as well if the gerrit patch has references it
2025-10-30 20:32:02 <kostajh> xSavitar: we're on mwdebug now, if you want to verify your change
2025-10-30 20:32:20 <xSavitar> kostajh, nothing to verify for now. I'm fine
2025-10-30 20:32:50 <mutante> xSavitar: if logmsgbot can made to say it, it will be logged by stashbot. other than that it sounds like a scap feature request I guess
2025-10-30 20:33:35 <xSavitar> mutante, ack! I'll file something tomorrow then let the RelEng experts decide if it's a good idea or not.
2025-10-30 20:33:53 <mutante> sounds good
2025-10-30 20:35:19 <logmsgbot> !log kharlan@deploy2002 kharlan: Continuing with sync
2025-10-30 20:39:10 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T407997)', diff saved to https://phabricator.wikimedia.org/P84491 and previous config saved to /var/cache/conftool/dbconfig/20251030-203910-marostegui.json
2025-10-30 20:39:16 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 20:39:26 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance
2025-10-30 20:39:34 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1214 (T407997)', diff saved to https://phabricator.wikimedia.org/P84492 and previous config saved to /var/cache/conftool/dbconfig/20251030-203933-marostegui.json
2025-10-30 20:39:35 <logmsgbot> !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1200111|EventBus: Enable TYPE_EVENT for loginwiki (T408701)]], [[gerrit:1200120|hCaptcha: Enable 100% passive mode for edits on test2wiki (T405586)]] (duration: 12m 08s)
2025-10-30 20:39:39 <jinxer-wm> FIRING: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
2025-10-30 20:39:46 <stashbot> T408701: Enable event logging for the mediawiki.product_metrics.suggested_investigations_interaction stream on loginwiki - https://phabricator.wikimedia.org/T408701
2025-10-30 20:39:47 <stashbot> T405586: hCaptcha editing trial deployment tracker - https://phabricator.wikimedia.org/T405586
2025-10-30 20:39:52 <kostajh> xSavitar: it's live
2025-10-30 20:40:14 <xSavitar> checks...
2025-10-30 20:41:24 <xSavitar> kostajh, yep, metrics are coming in again, thanks for sync 🙏🏽
2025-10-30 20:41:29 <icinga-wm> PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
2025-10-30 20:42:19 <icinga-wm> RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
2025-10-30 20:43:23 <wikibugs> ('CR) ''Andrew Bogott: [C:''+2] pdns_server: rename 'master' to 'primary' [puppet] - ''https://gerrit.wikimedia.org/r/1200097 (owner: ''Andrew Bogott)'
2025-10-30 20:43:37 <wikibugs> ('CR) ''Dzahn: [V:''+1 C:''+2] "applies all of the base stuff but only on node 1001 - https://puppet-compiler.wmflabs.org/output/1198397/7518/tcp-proxy1001.eqiad.wmnet/in"; [puppet] - ''https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532) (owner: ''Dzahn)'
2025-10-30 20:44:34 <kostajh> xSavitar: you're welcome!
2025-10-30 20:44:39 <jinxer-wm> RESOLVED: [4x] TransitBGPDown: Transit BGP session down between cr2-eqsin and Hurricane Electric (103.231.152.47) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown
2025-10-30 20:45:46 <wikibugs> 'SRE, ''collaboration-services, ''Traffic, ''Patch-For-Review, ''Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11329606 (''Dzahn) config example kindly provided by Chris Danis: {P84490}'
2025-10-30 20:46:07 <wikibugs> 'SRE, ''envoy, ''serviceops, ''Patch-For-Review: Upgrade Envoy to v1.29.12 - https://phabricator.wikimedia.org/T403663#11329609 (''RLazarus) ''In progress''Resolved'
2025-10-30 20:47:12 <xSavitar> dancy, I filed https://phabricator.wikimedia.org/T408868 so that I don't forget. I can always improve the task if needed, but I just did a brain-dump right now. Thanks!
2025-10-30 20:47:23 <wikibugs> ('CR) ''Dzahn: [C:''+2] site/role: create placeholder role/profile for tcpproxy [puppet] - ''https://gerrit.wikimedia.org/r/1198397 (https://phabricator.wikimedia.org/T408532) (owner: ''Dzahn)'
2025-10-30 20:51:03 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T407997)', diff saved to https://phabricator.wikimedia.org/P84493 and previous config saved to /var/cache/conftool/dbconfig/20251030-205102-marostegui.json
2025-10-30 20:51:08 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 20:51:15 <wikibugs> ('CR) ''Scott French: P:cache:haproxy: introduce ua classes (''3 comments) [puppet] - ''https://gerrit.wikimedia.org/r/1199247 (https://phabricator.wikimedia.org/T408060) (owner: ''Fabfur)'
2025-10-30 20:55:13 <wikibugs> ('PS8) ''Bking: Add OpenSearch cluster configs for net-new clusters [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753)'
2025-10-30 20:55:54 <wikibugs> ('CR) ''Bking: "Record added in I09456d395dd57caa9a61ab2a86a9c9df163f995c" [deployment-charts] - ''https://gerrit.wikimedia.org/r/1198139 (https://phabricator.wikimedia.org/T357753) (owner: ''Bking)'
2025-10-30 20:58:53 <sbassett> Will the Web Team be using their deployment window in a few minutes for anything? If not, there’s a sec patch update I’d like to get out.
2025-10-30 20:59:00 <jinxer-wm> RESOLVED: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2025-10-30 21:00:05 <jouncebot> Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251030T2100)
2025-10-30 21:05:37 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 21:05:51 <logmsgbot> !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply
2025-10-30 21:06:11 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P84495 and previous config saved to /var/cache/conftool/dbconfig/20251030-210610-marostegui.json
2025-10-30 21:07:15 <wikibugs> ('PS1) ''Andrew Bogott: cloud-vps pdns recursor: include nagios_common::check_dns_query [puppet] - ''https://gerrit.wikimedia.org/r/1200170'
2025-10-30 21:09:00 <jinxer-wm> FIRING: [2x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
2025-10-30 21:09:41 <wikibugs> ('CR) ''Andrew Bogott: [C:''+2] cloud-vps pdns recursor: include nagios_common::check_dns_query [puppet] - ''https://gerrit.wikimedia.org/r/1200170 (owner: ''Andrew Bogott)'
2025-10-30 21:11:25 <wikibugs> ('PS1) ''Bking: opensearch-cluster: fix chart typo [deployment-charts] - ''https://gerrit.wikimedia.org/r/1200171 (https://phabricator.wikimedia.org/T408012)'
2025-10-30 21:12:50 <icinga-wm> RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-30 21:13:51 <jinxer-wm> RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2004-dev (172.20.5.8) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 21:17:10 <sbassett> !log Deployed updated security mitigation for T407131
2025-10-30 21:17:13 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-30 21:21:18 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P84496 and previous config saved to /var/cache/conftool/dbconfig/20251030-212117-marostegui.json
2025-10-30 21:25:32 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2005-dev.codfw.wmnet with OS trixie
2025-10-30 21:28:44 <wikibugs> ('PS1) ''Jdlrobson: Drop references to removed configuration [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470)'
2025-10-30 21:28:50 <icinga-wm> PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-30 21:29:00 <wikibugs> ('PS2) ''Jdlrobson: Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470)'
2025-10-30 21:29:05 <wikibugs> ('CR) ''Jdlrobson: [C:''-2] Drop references to removed Advanced mobile contribution configuration [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1200173 (https://phabricator.wikimedia.org/T402470) (owner: ''Jdlrobson)'
2025-10-30 21:33:51 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 21:33:51 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 21:36:26 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T407997)', diff saved to https://phabricator.wikimedia.org/P84497 and previous config saved to /var/cache/conftool/dbconfig/20251030-213625-marostegui.json
2025-10-30 21:36:32 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 21:36:42 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1226.eqiad.wmnet with reason: Maintenance
2025-10-30 21:36:50 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1226 (T407997)', diff saved to https://phabricator.wikimedia.org/P84498 and previous config saved to /var/cache/conftool/dbconfig/20251030-213649-marostegui.json
2025-10-30 21:42:04 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage
2025-10-30 21:44:31 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudbackup1002-dev.eqiad.wmnet with OS trixie
2025-10-30 21:48:07 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2005-dev.codfw.wmnet with reason: host reimage
2025-10-30 21:48:09 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T407997)', diff saved to https://phabricator.wikimedia.org/P84499 and previous config saved to /var/cache/conftool/dbconfig/20251030-214808-marostegui.json
2025-10-30 21:48:15 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 21:49:00 <jinxer-wm> FIRING: [5x] SystemdUnitFailed: docker-reporter-kubernetes-dse_eqiad-images.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2025-10-30 21:57:49 <logmsgbot> !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage
2025-10-30 21:58:04 <icinga-wm> PROBLEM - Host ms-be1090 is DOWN: PING CRITICAL - Packet loss = 100%
2025-10-30 22:03:17 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P84500 and previous config saved to /var/cache/conftool/dbconfig/20251030-220316-marostegui.json
2025-10-30 22:04:15 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1002-dev.eqiad.wmnet with reason: host reimage
2025-10-30 22:04:34 <icinga-wm> RECOVERY - Host ms-be1090 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms
2025-10-30 22:09:57 <wikibugs> ('PS2) ''Tim Starling: Enable ChangesListQuery partitioning on mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798)'
2025-10-30 22:09:57 <wikibugs> ('PS2) ''Tim Starling: Enable ChangesListQuery partitioning on enwiki and commonswiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798)'
2025-10-30 22:09:57 <wikibugs> ('PS2) ''Tim Starling: Enable ChangesListQuery partitioning on all wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199892 (https://phabricator.wikimedia.org/T403798)'
2025-10-30 22:10:52 <icinga-wm> RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-30 22:11:10 <TimStarling> any deployments going on?
2025-10-30 22:11:57 <wikibugs> 'ops-eqiad, ''SRE, ''SRE-swift-storage, ''DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11329915 (''VRiley-WMF) Hey @MatthewVernon I apologize about that. It seems the cable slipped out of the card while I was trying to diagnose the issue. It...'
2025-10-30 22:12:38 <wikibugs> 'ops-eqiad, ''SRE, ''DC-Ops: Unresponsive management for ms-be1090.mgmt:22 - https://phabricator.wikimedia.org/T408585#11329916 (''VRiley-WMF) ''Open''Resolved closing duplicate.'
2025-10-30 22:13:51 <jinxer-wm> RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 22:14:15 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798) (owner: ''Tim Starling)'
2025-10-30 22:15:04 <wikibugs> ('Merged) ''jenkins-bot: Enable ChangesListQuery partitioning on mediawikiwiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199890 (https://phabricator.wikimedia.org/T403798) (owner: ''Tim Starling)'
2025-10-30 22:15:37 <logmsgbot> !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1199890|Enable ChangesListQuery partitioning on mediawikiwiki (T403798)]]
2025-10-30 22:15:43 <stashbot> T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798
2025-10-30 22:18:24 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P84501 and previous config saved to /var/cache/conftool/dbconfig/20251030-221824-marostegui.json
2025-10-30 22:32:35 <logmsgbot> !log eevans@cumin1003 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Apply JVM upgrade to 11.0.29 - eevans@cumin1003
2025-10-30 22:33:32 <logmsgbot> !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T407997)', diff saved to https://phabricator.wikimedia.org/P84502 and previous config saved to /var/cache/conftool/dbconfig/20251030-223331-marostegui.json
2025-10-30 22:33:37 <logmsgbot> !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
2025-10-30 22:33:37 <stashbot> T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table - https://phabricator.wikimedia.org/T407997
2025-10-30 22:34:00 <jinxer-wm> FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
2025-10-30 22:39:48 <jinxer-wm> FIRING: PuppetFailure: Puppet has failed on tcp-proxy1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
2025-10-30 22:42:04 <logmsgbot> !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1199890|Enable ChangesListQuery partitioning on mediawikiwiki (T403798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-30 22:42:11 <stashbot> T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798
2025-10-30 22:42:40 <logmsgbot> !log tstarling@deploy2002 tstarling: Continuing with sync
2025-10-30 22:48:51 <jinxer-wm> FIRING: [2x] JobUnavailable: Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2025-10-30 22:55:58 <logmsgbot> !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199890|Enable ChangesListQuery partitioning on mediawikiwiki (T403798)]] (duration: 40m 21s)
2025-10-30 22:56:03 <stashbot> T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798
2025-10-30 22:56:56 <icinga-wm> PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-30 22:57:39 <jinxer-wm> FIRING: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 22:59:56 <icinga-wm> RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
2025-10-30 23:00:46 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudservices2005-dev.codfw.wmnet with OS trixie
2025-10-30 23:02:39 <jinxer-wm> RESOLVED: [2x] CoreBGPDown: Core BGP session down between cloudsw1-b1-codfw and cloudservices2005-dev (172.20.5.9) - group cloud_host - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown
2025-10-30 23:04:47 <logmsgbot> !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1002-dev.eqiad.wmnet with OS trixie
2025-10-30 23:19:58 <wikibugs> ('CR) ''TrainBranchBot: [C:''+2] "Approved by tstarling@deploy2002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798) (owner: ''Tim Starling)'
2025-10-30 23:20:46 <wikibugs> ('Merged) ''jenkins-bot: Enable ChangesListQuery partitioning on enwiki and commonswiki [mediawiki-config] - ''https://gerrit.wikimedia.org/r/1199891 (https://phabricator.wikimedia.org/T403798) (owner: ''Tim Starling)'
2025-10-30 23:21:07 <logmsgbot> !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1199891|Enable ChangesListQuery partitioning on enwiki and commonswiki (T403798)]]
2025-10-30 23:21:12 <stashbot> T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798
2025-10-30 23:25:30 <logmsgbot> !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1199891|Enable ChangesListQuery partitioning on enwiki and commonswiki (T403798)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.
2025-10-30 23:27:43 <logmsgbot> !log tstarling@deploy2002 tstarling: Continuing with sync
2025-10-30 23:30:32 <wikibugs> ('PS1) ''Dzahn: site: fix regex for tcp-proxy to cover 1002 [puppet] - ''https://gerrit.wikimedia.org/r/1200188 (https://phabricator.wikimedia.org/T408532)'
2025-10-30 23:31:06 <wikibugs> ('CR) ''Dzahn: [C:''+2] site: fix regex for tcp-proxy to cover 1002 [puppet] - ''https://gerrit.wikimedia.org/r/1200188 (https://phabricator.wikimedia.org/T408532) (owner: ''Dzahn)'
2025-10-30 23:35:41 <logmsgbot> !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1199891|Enable ChangesListQuery partitioning on enwiki and commonswiki (T403798)]] (duration: 14m 33s)
2025-10-30 23:35:46 <stashbot> T403798: Slow watchlist queries due to large and expensive temporary table construction - https://phabricator.wikimedia.org/T403798
2025-10-30 23:36:01 <wikibugs> ('PS1) ''Dzahn: tcpproxy: set puppet7 and firewall provider to ferm for new role [puppet] - ''https://gerrit.wikimedia.org/r/1200189 (https://phabricator.wikimedia.org/T408532)'
2025-10-30 23:38:55 <wikibugs> ('CR) ''Dzahn: [C:''+2] tcpproxy: set puppet7 and firewall provider to ferm for new role [puppet] - ''https://gerrit.wikimedia.org/r/1200189 (https://phabricator.wikimedia.org/T408532) (owner: ''Dzahn)'
2025-10-30 23:48:48 <mutante> !log forward-fixing to puppet7 on tcp-proxy1001/1002 per T349619 T408532
2025-10-30 23:48:54 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2025-10-30 23:48:55 <stashbot> T349619: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619
2025-10-30 23:48:55 <stashbot> T408532: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532
2025-10-30 23:50:29 <icinga-wm> PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
2025-10-30 23:51:19 <icinga-wm> RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 05 Dec 2025 08:25:21 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
2025-10-30 23:57:56 <wikibugs> ('PS1) ''Dzahn: tcpproxy: add config template [puppet] - ''https://gerrit.wikimedia.org/r/1200190 (https://phabricator.wikimedia.org/T408532)'

This page is generated from SQL logs, you can also download static txt files from here