[00:00:02] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1175970 (owner: 10TrainBranchBot) [00:08:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176327 [00:08:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176327 (owner: 10TrainBranchBot) [00:11:52] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11067058 (10Dzahn) @ssastry Unfortunately it's not that simple to answer. groups and their members: ` parsoid-test-roots: ssastry, arlolra, cscott, ihurbain, mbs... [00:36:08] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1176327 (owner: 10TrainBranchBot) [00:40:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [00:45:30] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11067067 (10Papaul) @Jhancock.wm done [00:48:45] 06SRE, 10SRE-Access-Requests: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11067068 (10ABreault-WMF) > Then we would only keep parsoid-test-roots and just make sure those group members match your actual team members. Sounds good. The curr... [01:00:53] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:12:17] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 24s) [01:25:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [01:31:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [01:36:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [01:37:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [01:37:58] (03CR) 10RLazarus: "Thanks for this extra context!" [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [01:41:30] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [01:51:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [01:56:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [02:00:32] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 164610 MB (4% inode=99%): /var/lib/hadoop/data/g 166537 MB (4% inode=99%): /var/lib/hadoop/data/j 165382 MB (4% inode=99%): /var/lib/hadoop/data/c 156515 MB (4% inode=99%): /var/lib/hadoop/data/b 158361 MB (4% inode=99%): /var/lib/hadoop/data/l 157618 MB (4% inode=99%): /var/lib/hadoop/data/k 157931 MB (4% inode=99%): /var/lib/hadoop/data [02:00:32] 7 MB (4% inode=99%): /var/lib/hadoop/data/i 147263 MB (3% inode=99%): /var/lib/hadoop/data/m 157960 MB (4% inode=99%): /var/lib/hadoop/data/d 163381 MB (4% inode=99%): /var/lib/hadoop/data/h 159421 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [02:16:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [02:21:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [03:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:17:31] (03PS1) 10Dzahn: admin: create user osleger, add to parsoid-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/1176336 (https://phabricator.wikimedia.org/T401300) [03:18:12] (03CR) 10CI reject: [V:04-1] admin: create user osleger, add to parsoid-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/1176336 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [03:19:40] (03CR) 10Dzahn: [V:04-1] ":DataTest::test_no_shell_user_has_entry_in_ldap_only" [puppet] - 10https://gerrit.wikimedia.org/r/1176336 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [03:21:27] (03PS2) 10Dzahn: admin: upgrade user osleger, add to parsoid-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/1176336 (https://phabricator.wikimedia.org/T401300) [03:32:40] (03PS1) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) [03:35:55] (03CR) 10Dzahn: "note: those groups reference RT tickets from before Phabricator but the tickets are still available :)" [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [03:40:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11067108 (10Dzahn) >>! In T401300#11067068, @ABreault-WMF wrote: >> Then we would only keep parsoid-test-roots and just make sure those group m... [03:47:34] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [03:47:35] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [03:47:45] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [04:07:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:07:19] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [04:12:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:17:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:18:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:23:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:26:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:27:30] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:28:51] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [04:31:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:32:09] (03PS1) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [04:32:34] (03CR) 10CI reject: [V:04-1] Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [04:37:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:42:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [04:44:35] (03PS2) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [04:45:01] (03CR) 10CI reject: [V:04-1] Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [04:49:04] (03PS3) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [05:02:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:08:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:16:47] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [05:17:11] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [05:17:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:23:31] (03CR) 10Dr0ptp4kt: "Okay, here's the SLO piece for our A/B testing system Experimentation Lab ("xLab") discussed at https://wikitech.wikimedia.org/wiki/SLO/Ex" [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [05:25:32] (03PS4) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [05:27:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:28:54] 10ops-eqiad, 06SRE, 06DC-Ops: Investigate whether we can add RAM to dumpsdata100[4-7] from any decommissioned hosts - https://phabricator.wikimedia.org/T401299#11067151 (10VRiley-WMF) a:03VRiley-WMF [05:30:08] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [05:32:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:36:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1015.eqiad.wmnet w/ force delete existing files, repooling both afterwards [05:36:19] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [05:36:21] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling both afterwards [05:47:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:51:04] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:52:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [05:57:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T0600). [06:13:09] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:19:35] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1176336 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [06:22:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:27:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:32:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [06:33:08] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2021.codfw.wmnet w/ force delete existing files, repooling both afterwards [06:33:12] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [06:34:56] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1015.eqiad.wmnet w/ force delete existing files, repooling both afterwards [06:53:43] (03CR) 10Jforrester: "<3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176234 (https://phabricator.wikimedia.org/T386794) (owner: 10Genoveva Galarza) [07:00:05] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:12:42] (03PS1) 10MVernon: swift: add ms-be1091 to profile::swift::storagehosts: [puppet] - 10https://gerrit.wikimedia.org/r/1176357 [07:14:21] (03CR) 10MVernon: [C:03+2] swift: add ms-be1091 to profile::swift::storagehosts: [puppet] - 10https://gerrit.wikimedia.org/r/1176357 (owner: 10MVernon) [07:20:32] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 158673 MB (4% inode=99%): /var/lib/hadoop/data/g 155374 MB (4% inode=99%): /var/lib/hadoop/data/j 152850 MB (4% inode=99%): /var/lib/hadoop/data/c 156622 MB (4% inode=99%): /var/lib/hadoop/data/b 157580 MB (4% inode=99%): /var/lib/hadoop/data/l 156198 MB (4% inode=99%): /var/lib/hadoop/data/k 157225 MB (4% inode=99%): /var/lib/hadoop/data [07:20:32] 9 MB (4% inode=99%): /var/lib/hadoop/data/i 153905 MB (4% inode=99%): /var/lib/hadoop/data/m 156462 MB (4% inode=99%): /var/lib/hadoop/data/d 154166 MB (4% inode=99%): /var/lib/hadoop/data/h 149144 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [07:35:14] (03PS2) 10Hashar: gerrit: replicate repo renames as "gerrit2" application user [puppet] - 10https://gerrit.wikimedia.org/r/1175122 (https://phabricator.wikimedia.org/T239693) [07:39:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance [07:41:17] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1176261 (https://phabricator.wikimedia.org/T396037) (owner: 10Brouberol) [07:43:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:43:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T399728)', diff saved to https://phabricator.wikimedia.org/P80940 and previous config saved to /var/cache/conftool/dbconfig/20250807-074306-fceratto.json [07:43:10] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [07:43:15] (03CR) 10Brouberol: [C:03+2] Update the image tag associated with PG 15 [puppet] - 10https://gerrit.wikimedia.org/r/1176261 (https://phabricator.wikimedia.org/T396037) (owner: 10Brouberol) [07:47:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:48:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T399728)', diff saved to https://phabricator.wikimedia.org/P80941 and previous config saved to /var/cache/conftool/dbconfig/20250807-074803-fceratto.json [07:51:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-ml: apply [07:51:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-ml: apply [07:52:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=eqiad%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [07:54:29] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [07:57:54] (03CR) 10MVernon: [C:03+2] swift: remove ms-be106[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/1176258 (https://phabricator.wikimedia.org/T391354) (owner: 10MVernon) [07:57:57] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1045 - vriley@cumin1002" [07:58:02] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1045 - vriley@cumin1002" [07:58:02] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:59:29] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1045 [07:59:45] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1045 [08:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T0800) [08:00:21] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11067275 (10VRiley-WMF) [08:01:39] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-dev: apply [08:01:43] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-dev: apply [08:02:02] !log mvernon@cumin1003 START - Cookbook sre.hosts.decommission for hosts ms-be[1061-1063].eqiad.wmnet [08:02:31] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:03:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P80942 and previous config saved to /var/cache/conftool/dbconfig/20250807-080311-fceratto.json [08:04:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-analytics-test: apply [08:04:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-analytics-test: apply [08:04:37] 10ops-eqiad, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be106[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T401368 (10MatthewVernon) 03NEW [08:06:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-analytics-product: apply [08:06:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-analytics-product: apply [08:06:09] 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#11067307 (10MatthewVernon) 05Open→03Resolved [08:06:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:06:34] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:06:52] vriley@cumin1002 provision (PID 3324402) is awaiting input [08:08:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-platform-eng: apply [08:08:21] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-platform-eng: apply [08:09:44] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-research: apply [08:09:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-research: apply [08:11:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:11:16] good morning. I am going to run the train! [08:11:41] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-search: apply [08:11:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-search: apply [08:11:53] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T401118#11067320 (10Sadiya.Mohammed_WMDE) Hi. yes the name is Sadiya Halima Mohammed. Email: sadiya.mohammed@wikimedia.de [08:13:31] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [08:13:35] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [08:14:15] (03PS1) 10TrainBranchBot: group2 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176430 (https://phabricator.wikimedia.org/T396374) [08:14:17] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176430 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [08:14:58] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-wmde: apply [08:15:02] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-wmde: apply [08:15:06] (03Merged) 10jenkins-bot: group2 to 1.45.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176430 (https://phabricator.wikimedia.org/T396374) (owner: 10TrainBranchBot) [08:15:28] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11067331 (10MatthewVernon) [08:15:55] !log mvernon@cumin1003 START - Cookbook sre.dns.netbox [08:16:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:17:13] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [08:17:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:17:17] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [08:18:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P80943 and previous config saved to /var/cache/conftool/dbconfig/20250807-081818-fceratto.json [08:20:00] !log mvernon@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[1061-1063].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1003" [08:21:00] those memcached errors are for kube-dumps [08:21:07] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - free space: /srv 8607 MB (3% inode=69%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [08:21:11] so that would be something related to the wiki db dumps as I get it [08:21:15] that deplo1 [08:21:22] that deploy1003 disk space issue is hmm .. different :\ [08:21:30] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:22:43] PROBLEM - ganeti-noded running on ganeti1032 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [08:23:04] mvernon@cumin1003 decommission (PID 966271) is awaiting input [08:23:43] RECOVERY - ganeti-noded running on ganeti1032 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [08:24:27] (03PS1) 10MVernon: swift: add 1 new codfw host, drain 3 [puppet] - 10https://gerrit.wikimedia.org/r/1176432 (https://phabricator.wikimedia.org/T382056) [08:25:06] !log mvernon@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ms-be[1061-1063].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - mvernon@cumin1003" [08:25:06] !log mvernon@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:25:07] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ms-be[1061-1063].eqiad.wmnet [08:25:25] 06SRE, 10SRE-swift-storage, 10Ceph, 13Patch-For-Review: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#11067370 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by mvernon@cumin1003 for hosts: `ms-be[1061-1063].eqiad.wmnet` - ms-be1061.eqiad.wmnet (**PASS**... [08:26:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T399728)', diff saved to https://phabricator.wikimedia.org/P80944 and previous config saved to /var/cache/conftool/dbconfig/20250807-083325-fceratto.json [08:33:30] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:33:33] vriley@cumin1002 provision (PID 3324402) is awaiting input [08:33:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:33:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T399728)', diff saved to https://phabricator.wikimedia.org/P80945 and previous config saved to /var/cache/conftool/dbconfig/20250807-083348-fceratto.json [08:34:10] 13G /srv/homedirs/mwmaint2002 [08:34:11] 13.5G /srv/homedirs/mwmaint1002 [08:34:11] :) [08:34:16] no idea what those are [08:34:32] * hashar files a task [08:35:57] ah that is T397017 [08:35:58] T397017: Turn down mwmaint production servers - https://phabricator.wikimedia.org/T397017 [08:36:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:37:17] !log brouberol@cumin1003 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [08:38:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T399728)', diff saved to https://phabricator.wikimedia.org/P80946 and previous config saved to /var/cache/conftool/dbconfig/20250807-083848-fceratto.json [08:38:52] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [08:41:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:41:21] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.13 refs T396374 [08:41:25] T396374: 1.45.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T396374 [08:43:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:43:40] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:45:30] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11067457 (10VRiley-WMF) [08:45:41] !log brouberol@cumin1003 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [08:48:15] RESOLVED: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [08:51:28] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [08:52:50] (03CR) 10David Caro: "This does not work for maintain-harbor, as the logs are not where you expect them to be:" [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [08:53:31] (03CR) 10David Caro: "wait no, it's my lazy typing xd, typo in the mantian-harbor parameter :facepalm:" [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [08:53:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P80947 and previous config saved to /var/cache/conftool/dbconfig/20250807-085355-fceratto.json [08:57:17] (03CR) 10David Caro: [C:03+2] [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [09:03:39] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1045.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [09:04:40] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1045.eqiad.wmnet with OS bullseye [09:04:55] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11067506 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bullseye [09:06:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:33] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:07:41] PROBLEM - Disk space on an-worker1120 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 151177 MB (4% inode=99%): /var/lib/hadoop/data/m 154153 MB (4% inode=99%): /var/lib/hadoop/data/d 161312 MB (4% inode=99%): /var/lib/hadoop/data/b 154653 MB (4% inode=99%): /var/lib/hadoop/data/e 152146 MB (4% inode=99%): /var/lib/hadoop/data/g 150056 MB (3% inode=99%): /var/lib/hadoop/data/f 159666 MB (4% inode=99%): /var/lib/hadoop/data [09:07:41] 9 MB (4% inode=99%): /var/lib/hadoop/data/i 153608 MB (4% inode=99%): /var/lib/hadoop/data/j 154603 MB (4% inode=99%): /var/lib/hadoop/data/l 155195 MB (4% inode=99%): /var/lib/hadoop/data/c 151513 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1120&var-datasource=eqiad+prometheus/ops [09:08:57] !log jnuche@deploy1003 Started deploy [releng/jenkins-deploy@417d4e8] (releasing): T400645 [09:09:00] T400645: dpkg error when deploying Jenkins to releases2003.codfw.wmnet - https://phabricator.wikimedia.org/T400645 [09:09:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P80948 and previous config saved to /var/cache/conftool/dbconfig/20250807-090903-fceratto.json [09:09:29] !log jnuche@deploy1003 Finished deploy [releng/jenkins-deploy@417d4e8] (releasing): T400645 (duration: 00m 31s) [09:09:42] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11067518 (10VRiley-WMF) Could not connect to cloudcephosd1044, Will need to chack the managment cables. Then on cloudcephosd1045 it seems like it's failing with the cable on the 10g cable. Wi... [09:24:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T399728)', diff saved to https://phabricator.wikimedia.org/P80949 and previous config saved to /var/cache/conftool/dbconfig/20250807-092410-fceratto.json [09:24:16] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:24:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:24:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T399728)', diff saved to https://phabricator.wikimedia.org/P80950 and previous config saved to /var/cache/conftool/dbconfig/20250807-092433-fceratto.json [09:29:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T399728)', diff saved to https://phabricator.wikimedia.org/P80951 and previous config saved to /var/cache/conftool/dbconfig/20250807-092930-fceratto.json [09:29:34] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [09:33:18] (03PS1) 10Elukey: role::maps::master: enable import times on maps-test [puppet] - 10https://gerrit.wikimedia.org/r/1176437 (https://phabricator.wikimedia.org/T381565) [09:35:50] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1176437 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:36:40] (03PS2) 10Elukey: role::maps::master: enable import times on maps-test [puppet] - 10https://gerrit.wikimedia.org/r/1176437 (https://phabricator.wikimedia.org/T381565) [09:37:51] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1176437 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:39:37] (03CR) 10Federico Ceratto: [C:03+1] "I check the names for consistency with the CR description / commit message and related tasks." [puppet] - 10https://gerrit.wikimedia.org/r/1176432 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [09:41:12] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6511/co" [puppet] - 10https://gerrit.wikimedia.org/r/1176437 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:44:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P80952 and previous config saved to /var/cache/conftool/dbconfig/20250807-094437-fceratto.json [09:51:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:59:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P80953 and previous config saved to /var/cache/conftool/dbconfig/20250807-095945-fceratto.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1000) [10:06:05] (03CR) 10MVernon: [C:03+2] swift: add 1 new codfw host, drain 3 [puppet] - 10https://gerrit.wikimedia.org/r/1176432 (https://phabricator.wikimedia.org/T382056) (owner: 10MVernon) [10:14:51] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: ms backend hardware refresh for 24/25 - https://phabricator.wikimedia.org/T382056#11067640 (10MatthewVernon) 05Open→03Resolved [10:14:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T399728)', diff saved to https://phabricator.wikimedia.org/P80954 and previous config saved to /var/cache/conftool/dbconfig/20250807-101452-fceratto.json [10:14:57] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:14:58] (03CR) 10Hnowlan: profile::hcaptcha::proxy: config improvements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli) [10:15:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:15:08] (03CR) 10Hashar: [C:04-1] "+1 let me know and I will be happy to +2 / deploy this for you 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175850 (owner: 10Tim Starling) [10:15:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T399728)', diff saved to https://phabricator.wikimedia.org/P80955 and previous config saved to /var/cache/conftool/dbconfig/20250807-101515-fceratto.json [10:16:26] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [10:20:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T399728)', diff saved to https://phabricator.wikimedia.org/P80956 and previous config saved to /var/cache/conftool/dbconfig/20250807-102004-fceratto.json [10:20:08] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [10:24:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11067682 (10MatthewVernon) [10:24:43] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 13Patch-For-Review: Re-IP Swift hosts to per-rack subnets in codfw rows A-D - https://phabricator.wikimedia.org/T354872#11067685 (10MatthewVernon) [10:24:53] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1045.eqiad.wmnet with OS bullseye [10:25:00] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11067687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1045.eqiad.wmnet with OS bullseye executed with errors: - cloudcephosd104... [10:28:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:30:17] (03CR) 10Hnowlan: [C:03+1] api-gateway: Conditional restbase compatibility headers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172599 (https://phabricator.wikimedia.org/T400346) (owner: 10Clément Goubert) [10:30:31] 06SRE, 06Infrastructure-Foundations, 10netops: Allow read-only users to view logs on Juniper devices - https://phabricator.wikimedia.org/T401378 (10cmooney) 03NEW p:05Triage→03Low [10:35:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P80957 and previous config saved to /var/cache/conftool/dbconfig/20250807-103512-fceratto.json [10:40:12] (03PS1) 10Cathal Mooney: User management: create new RO login class and allow to view logs [homer/public] - 10https://gerrit.wikimedia.org/r/1176443 (https://phabricator.wikimedia.org/T401378) [10:40:31] PROBLEM - Disk space on an-worker1121 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 155379 MB (4% inode=99%): /var/lib/hadoop/data/h 153869 MB (4% inode=99%): /var/lib/hadoop/data/b 158054 MB (4% inode=99%): /var/lib/hadoop/data/k 154642 MB (4% inode=99%): /var/lib/hadoop/data/m 154430 MB (4% inode=99%): /var/lib/hadoop/data/f 153393 MB (4% inode=99%): /var/lib/hadoop/data/j 153646 MB (4% inode=99%): /var/lib/hadoop/data [10:40:31] 6 MB (4% inode=99%): /var/lib/hadoop/data/l 155365 MB (4% inode=99%): /var/lib/hadoop/data/i 155715 MB (4% inode=99%): /var/lib/hadoop/data/g 160023 MB (4% inode=99%): /var/lib/hadoop/data/c 147838 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1121&var-datasource=eqiad+prometheus/ops [10:45:09] (03PS3) 10Effie Mouzeli: profile::hcaptcha::proxy: config improvements [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) [10:45:24] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli) [10:45:53] (03CR) 10Effie Mouzeli: profile::hcaptcha::proxy: config improvements (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli) [10:50:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P80958 and previous config saved to /var/cache/conftool/dbconfig/20250807-105019-fceratto.json [10:50:34] 06SRE, 10Hiddenparma, 06Traffic: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270#11067785 (10Vgutierrez) a:03Vgutierrez [10:51:05] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11067787 (10cmooney) I found a possibly related Juniper PR entry here: https://prsearch.juniper.net/problemreport/PR1763499 It's for a different, but similar platform. From what I can tell our ve... [10:53:50] (03PS4) 10Effie Mouzeli: profile::hcaptcha::proxy: config improvements [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) [10:54:02] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1176248 (https://phabricator.wikimedia.org/T397841) (owner: 10Effie Mouzeli) [11:03:57] RECOVERY - Disk space on stat1008 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [11:04:31] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [11:05:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T399728)', diff saved to https://phabricator.wikimedia.org/P80959 and previous config saved to /var/cache/conftool/dbconfig/20250807-110527-fceratto.json [11:05:31] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:05:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance [11:05:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T399728)', diff saved to https://phabricator.wikimedia.org/P80960 and previous config saved to /var/cache/conftool/dbconfig/20250807-110549-fceratto.json [11:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:10:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T399728)', diff saved to https://phabricator.wikimedia.org/P80961 and previous config saved to /var/cache/conftool/dbconfig/20250807-111043-fceratto.json [11:10:48] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:19:26] !log deploy1003:~# lvextend -L+30G /dev/vg0/srv [11:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:07] RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [11:21:13] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [11:22:33] 06SRE, 10Hiddenparma, 06Traffic: Browser behaviour detection at the edge - https://phabricator.wikimedia.org/T400270#11067835 (10Vgutierrez) https://phabricator.wikimedia.org/P80962 for future reference [11:25:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P80963 and previous config saved to /var/cache/conftool/dbconfig/20250807-112551-fceratto.json [11:29:54] (03PS1) 10Jon Harald Søby: Enable wgParserEnableUserLanguage for incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176436 [11:30:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176436 (owner: 10Jon Harald Søby) [11:40:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P80964 and previous config saved to /var/cache/conftool/dbconfig/20250807-114058-fceratto.json [11:43:29] (03PS1) 10Clément Goubert: trafficserver: Add fractional routing to gateway-check [puppet] - 10https://gerrit.wikimedia.org/r/1171994 (https://phabricator.wikimedia.org/T400131) [11:43:30] (03CR) 10Clément Goubert: "I don't think so, but we should run it past Traffic" [puppet] - 10https://gerrit.wikimedia.org/r/1171994 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [11:49:40] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [11:56:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T399728)', diff saved to https://phabricator.wikimedia.org/P80965 and previous config saved to /var/cache/conftool/dbconfig/20250807-115606-fceratto.json [11:56:12] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [11:56:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1212.eqiad.wmnet with reason: Maintenance [11:56:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:56:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T399728)', diff saved to https://phabricator.wikimedia.org/P80966 and previous config saved to /var/cache/conftool/dbconfig/20250807-115646-fceratto.json [11:58:44] (03CR) 10Elukey: [V:03+1 C:03+2] role::maps::master: enable import times on maps-test [puppet] - 10https://gerrit.wikimedia.org/r/1176437 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1200) [12:02:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T399728)', diff saved to https://phabricator.wikimedia.org/P80967 and previous config saved to /var/cache/conftool/dbconfig/20250807-120205-fceratto.json [12:02:09] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:03:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:05:24] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [12:07:36] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2245.codfw.wmnet with OS bookworm [12:07:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11067971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2245.codfw.wmnet with OS bookworm [12:07:49] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2246.codfw.wmnet with OS bookworm [12:07:57] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11067974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2246.codfw.wmnet with OS bookworm [12:08:02] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2247.codfw.wmnet with OS bookworm [12:08:10] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11067975 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2247.codfw.wmnet with OS bookworm [12:09:12] (03PS5) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [12:10:36] (03PS6) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [12:11:11] jclark@cumin1002 netbox (PID 3578491) is awaiting input [12:11:37] (03PS7) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [12:13:38] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for db1260-3 - jclark@cumin1002" [12:13:41] (03PS8) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [12:13:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns for db1260-3 - jclark@cumin1002" [12:13:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:14:30] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1260.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:14:56] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1261.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:15:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:15:09] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1263.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:15:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1262.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:16:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068004 (10Jclark-ctr) a:05Marostegui→03Jclark-ctr [12:17:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P80969 and previous config saved to /var/cache/conftool/dbconfig/20250807-121712-fceratto.json [12:18:21] (03CR) 10Dr0ptp4kt: "In the latest version of the patch separate subcomponents is no longer used for Request class 2, but rather uses division with two differe" [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [12:19:01] jclark@cumin1002 provision (PID 3586944) is awaiting input [12:19:25] jclark@cumin1002 provision (PID 3586959) is awaiting input [12:20:10] jclark@cumin1002 provision (PID 3587005) is awaiting input [12:20:13] jclark@cumin1002 provision (PID 3586980) is awaiting input [12:21:54] (03PS9) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [12:23:43] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1263.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:23:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1262.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:24:08] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2245.codfw.wmnet with reason: host reimage [12:24:17] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2246.codfw.wmnet with reason: host reimage [12:24:28] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2247.codfw.wmnet with reason: host reimage [12:25:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: disk sdj failure for cloudcephosd1013.eqiad.wmnet - https://phabricator.wikimedia.org/T401319#11068026 (10Jclark-ctr) 05Open→03Resolved @fnegri i have removed failed drive. I have installed drive from decom serve... [12:27:43] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1263.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:27:58] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1262.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:28:45] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2245.codfw.wmnet with reason: host reimage [12:32:13] jclark@cumin1002 provision (PID 3600318) is awaiting input [12:32:20] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2246.codfw.wmnet with reason: host reimage [12:32:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P80970 and previous config saved to /var/cache/conftool/dbconfig/20250807-123220-fceratto.json [12:32:33] jclark@cumin1002 provision (PID 3599163) is awaiting input [12:34:43] (03CR) 10Hnowlan: [C:03+1] "lgtm from my end, but I'll defer to approval by traffic before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1171994 (https://phabricator.wikimedia.org/T400131) (owner: 10Clément Goubert) [12:36:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1263.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:37:12] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2247.codfw.wmnet with reason: host reimage [12:37:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host db1263.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:40:07] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 10decommission-hardware: decommission ms-be106[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T401368#11068071 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [12:41:40] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1260.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:41:46] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1261.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:42:01] (03PS1) 10Hnowlan: trafficserver: simplify gateway-check path globs [puppet] - 10https://gerrit.wikimedia.org/r/1176473 (https://phabricator.wikimedia.org/T400131) [12:45:24] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [12:45:43] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [12:45:44] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2245.codfw.wmnet with OS bookworm [12:45:49] 06SRE, 10SRE-swift-storage: Swift device names should not contain underscores - https://phabricator.wikimedia.org/T401387 (10MatthewVernon) 03NEW [12:45:52] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11068104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host db2245.codfw.wmnet with OS bookworm completed: - db2245 (**PASS**) - R... [12:46:20] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1261.eqiad.wmnet with OS bookworm [12:46:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068106 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1261.eqiad.wmnet with OS bookworm [12:46:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1260.eqiad.wmnet with OS bookworm [12:46:44] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1260.eqiad.wmnet with OS bookworm [12:47:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T399728)', diff saved to https://phabricator.wikimedia.org/P80972 and previous config saved to /var/cache/conftool/dbconfig/20250807-124728-fceratto.json [12:47:32] T399728: Add '*_actor_ip_hex_time' indexes to 'cu_changes', 'cu_log_event', and 'cu_private_event' on WMF wikis - https://phabricator.wikimedia.org/T399728 [12:47:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [12:47:50] 06SRE, 10SRE-swift-storage: Swift device names should not contain underscores - https://phabricator.wikimedia.org/T401387#11068122 (10MatthewVernon) p:05Triage→03High [12:48:14] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11068129 (10elukey) [12:49:00] 06SRE, 10SRE-swift-storage: Swift device names should not contain underscores - https://phabricator.wikimedia.org/T401387#11068130 (10MatthewVernon) [12:49:01] 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new JBOD disk controllers into SM swift backends - https://phabricator.wikimedia.org/T400878#11068131 (10MatthewVernon) [12:49:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (eqiad) - https://phabricator.wikimedia.org/T400877#11068132 (10MatthewVernon) [12:49:05] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Install new disk controllers to SM swift backends (codfw) - https://phabricator.wikimedia.org/T400876#11068133 (10MatthewVernon) [12:49:27] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [12:51:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:52:32] jhancock@cumin1003 reimage (PID 990150) is awaiting input [12:54:33] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [12:54:34] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2246.codfw.wmnet with OS bookworm [12:54:44] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [12:54:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11068146 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host db2246.codfw.wmnet with OS bookworm completed: - db2246 (**PASS**) - R... [12:55:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host db2248.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:55:05] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [12:55:06] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2247.codfw.wmnet with OS bookworm [12:55:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11068148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host db2247.codfw.wmnet with OS bookworm completed: - db2247 (**PASS**) - R... [12:55:59] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: adjust nftables throttling thresholds [puppet] - 10https://gerrit.wikimedia.org/r/1176246 (https://phabricator.wikimedia.org/T400971) (owner: 10Jelto) [12:59:34] (03PS1) 10MVernon: Swift - avoid _ in device names [puppet] - 10https://gerrit.wikimedia.org/r/1176476 (https://phabricator.wikimedia.org/T401387) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1300). [13:00:05] Jhs: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:24] o/ [13:00:47] I can deploy ^^ [13:01:38] (03CR) 10CI reject: [V:04-1] Swift - avoid _ in device names [puppet] - 10https://gerrit.wikimedia.org/r/1176476 (https://phabricator.wikimedia.org/T401387) (owner: 10MVernon) [13:02:20] jclark@cumin1002 provision (PID 3610476) is awaiting input [13:02:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1263.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:03:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1262.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:03:55] Jhs: are you there? [13:04:02] Lucas_WMDE, yeah [13:04:08] ok [13:05:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176436 (owner: 10Jon Harald Søby) [13:06:01] Jhs: out of interest, would this setting be useful for Wikidata as well? [13:06:04] (03PS2) 10MVernon: Swift - avoid _ in device names [puppet] - 10https://gerrit.wikimedia.org/r/1176476 (https://phabricator.wikimedia.org/T401387) [13:06:38] (03Merged) 10jenkins-bot: Enable wgParserEnableUserLanguage for incubatorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176436 (owner: 10Jon Harald Søby) [13:07:08] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1176436|Enable wgParserEnableUserLanguage for incubatorwiki]] [13:07:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2248.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:09:21] !log lucaswerkmeister-wmde@deploy1003 jhsoby, lucaswerkmeister-wmde: Backport for [[gerrit:1176436|Enable wgParserEnableUserLanguage for incubatorwiki]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:09:55] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host db2248.codfw.wmnet with OS bookworm [13:09:59] Lucas_WMDE, IMO it would be useful for any wiki that uses the {{int:lang}} hack. but there might be caching implications? But I also don't really know why there would be when {{int:lang}} is already used 🤷‍♂️ [13:10:05] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11068190 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host db2248.codfw.wmnet with OS bookworm [13:10:06] (sorry, bad wifi here in Nairobi) [13:10:11] sounds reasonable to me [13:10:21] the change ended up on mwdebug while you were out, please test :) [13:10:34] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068214 (10Jclark-ctr) [13:11:05] Lucas_WMDE, tested, works as it should 👍 [13:11:12] !log lucaswerkmeister-wmde@deploy1003 jhsoby, lucaswerkmeister-wmde: Continuing with sync [13:11:15] nice, thanks! [13:11:36] (03CR) 10Giuseppe Lavagetto: [C:03+1] benthos: webrequest_sampled_live: remove client_port [puppet] - 10https://gerrit.wikimedia.org/r/1176295 (https://phabricator.wikimedia.org/T398236) (owner: 10CDanis) [13:11:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1262.eqiad.wmnet with OS bookworm [13:11:51] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1263.eqiad.wmnet with OS bookworm [13:11:55] (03CR) 10Giuseppe Lavagetto: [C:03+1] turnilo: webrequest_sampled_live: remove client_port [puppet] - 10https://gerrit.wikimedia.org/r/1176296 (https://phabricator.wikimedia.org/T398236) (owner: 10CDanis) [13:12:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068217 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1262.eqiad.wmnet with OS bookworm [13:12:02] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068218 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1263.eqiad.wmnet with OS bookworm [13:12:06] (03CR) 10Eevans: [C:03+1] Swift - avoid _ in device names [puppet] - 10https://gerrit.wikimedia.org/r/1176476 (https://phabricator.wikimedia.org/T401387) (owner: 10MVernon) [13:12:31] (03CR) 10Giuseppe Lavagetto: [C:03+1] benthos webrequest: Add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176299 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis) [13:12:51] (03CR) 10Giuseppe Lavagetto: [C:03+1] turnilo: webrequest: add wmfuniq X-Analytics sub-field [puppet] - 10https://gerrit.wikimedia.org/r/1176300 (https://phabricator.wikimedia.org/T400753) (owner: 10CDanis) [13:13:50] can confirm, https://incubator.wikimedia.org/w/api.php?action=parse&format=json&uselang=de&text=%7B%7BUSERLANGUAGE%7D%7D&prop=text&wrapoutputclass=&disablelimitreport=1&contentmodel=wikitext&formatversion=2 changes from en to de with WikimediaDebug \oi [13:16:45] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176436|Enable wgParserEnableUserLanguage for incubatorwiki]] (duration: 09m 37s) [13:18:08] (03CR) 10MVernon: [C:03+2] Swift - avoid _ in device names [puppet] - 10https://gerrit.wikimedia.org/r/1176476 (https://phabricator.wikimedia.org/T401387) (owner: 10MVernon) [13:20:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:26:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on db2248.codfw.wmnet with reason: host reimage [13:27:38] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device names should not contain underscores - https://phabricator.wikimedia.org/T401387#11068300 (10MatthewVernon) The change has worked in eqiad, where ms-be1091 is no longer in the rings. codfw has 8h45m before another ring change can take place. [13:28:51] !log UTC afternoon backport+config window done [13:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:18] !log mvernon@cumin1003 START - Cookbook sre.hosts.reimage for host ms-be1091.eqiad.wmnet with OS bullseye [13:29:31] 06SRE, 10SRE-swift-storage, 13Patch-For-Review: Swift device names should not contain underscores - https://phabricator.wikimedia.org/T401387#11068305 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1003 for host ms-be1091.eqiad.wmnet with OS bullseye [13:30:02] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2248.codfw.wmnet with reason: host reimage [13:30:52] jclark@cumin1002 reimage (PID 3649296) is awaiting input [13:31:01] jclark@cumin1002 reimage (PID 3649328) is awaiting input [13:36:29] (03PS1) 10Brouberol: eventgate-analytics: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176479 (https://phabricator.wikimedia.org/T397447) [13:36:31] (03PS1) 10Brouberol: eventstreams-internal: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176480 (https://phabricator.wikimedia.org/T397447) [13:36:34] (03PS1) 10Brouberol: mw-page-content-change-enrich: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176481 (https://phabricator.wikimedia.org/T397447) [13:41:32] !log mvernon@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [13:42:48] (03PS1) 10Herron: thanos: clean citoid SLO recording rule history [puppet] - 10https://gerrit.wikimedia.org/r/1176484 (https://phabricator.wikimedia.org/T400073) [13:44:22] (03PS1) 10Brouberol: eventlogging: remove reference to kafka-jumbo1007 [alerts] - 10https://gerrit.wikimedia.org/r/1176485 (https://phabricator.wikimedia.org/T397447) [13:44:39] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1091.eqiad.wmnet with reason: host reimage [13:45:11] (03CR) 10CI reject: [V:04-1] thanos: clean citoid SLO recording rule history [puppet] - 10https://gerrit.wikimedia.org/r/1176484 (https://phabricator.wikimedia.org/T400073) (owner: 10Herron) [13:47:23] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [13:49:22] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [13:49:23] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2248.codfw.wmnet with OS bookworm [13:49:36] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11068354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host db2248.codfw.wmnet with OS bookworm completed: - db2248 (**PASS**) - R... [13:50:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11068364 (10Jhancock.wm) [13:51:15] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db224[5-8] - https://phabricator.wikimedia.org/T400213#11068365 (10Jhancock.wm) 05Open→03Resolved @Marostegui these are complete [13:55:17] (03PS10) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [13:56:32] (03CR) 10Dr0ptp4kt: "One smallish change." [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [13:57:16] (03CR) 10Ssingh: dnsrecursor: add recursor.yml.erb (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [13:57:40] (03CR) 10Ssingh: "I resolved/addressed some of the comments and will review this again." [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [14:01:41] !log mvernon@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1091.eqiad.wmnet with OS bullseye [14:01:54] 06SRE, 10SRE-swift-storage: Swift device names should not contain underscores - https://phabricator.wikimedia.org/T401387#11068393 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1003 for host ms-be1091.eqiad.wmnet with OS bullseye completed: - ms-be1091 (**PASS**) - Downt... [14:02:05] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1262.eqiad.wmnet with OS bookworm [14:02:09] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1263.eqiad.wmnet with OS bookworm [14:02:11] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068394 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1262.eqiad.wmnet with OS bookworm executed with errors: - db1262 (**FAIL... [14:02:14] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068395 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1263.eqiad.wmnet with OS bookworm executed with errors: - db1263 (**FAIL... [14:02:40] (03CR) 10Elukey: "@jhathaway@wikimedia.org I am dumping all the current work that I am doing for iDRAC 10 hosts (cp2043 for example) so you are aware while " [cookbooks] - 10https://gerrit.wikimedia.org/r/1173335 (https://phabricator.wikimedia.org/T392851) (owner: 10Elukey) [14:06:03] (03PS1) 10Brouberol: eventgate-analytics-external: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176487 (https://phabricator.wikimedia.org/T397447) [14:08:59] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11068426 (10elukey) Brain dump before I go on holidays, if anything is needed and I am not around. The current list of issues are: 1) The hosts are iDRAC 10, s... [14:11:22] (03CR) 10Elukey: "I am totally ignorant about this, cannot really provide a valid code review :D It looks good from a high level perspective, but Cathal wil" [puppet] - 10https://gerrit.wikimedia.org/r/1176216 (owner: 10Ayounsi) [14:14:12] jouncebot: nowandnext [14:14:13] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [14:14:13] In 0 hour(s) and 15 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1430) [14:14:21] (03CR) 10Zabe: [C:03+2] Do not create a database table when a different provider is used [extensions/ApiFeatureUsage] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176250 (https://phabricator.wikimedia.org/T397348) (owner: 10Zabe) [14:14:22] (03CR) 10Zabe: [C:03+2] Do not create a database table when a different provider is used [extensions/ApiFeatureUsage] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176251 (https://phabricator.wikimedia.org/T397348) (owner: 10Zabe) [14:15:15] (03Merged) 10jenkins-bot: Do not create a database table when a different provider is used [extensions/ApiFeatureUsage] (wmf/1.45.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1176250 (https://phabricator.wikimedia.org/T397348) (owner: 10Zabe) [14:15:24] (03Merged) 10jenkins-bot: Do not create a database table when a different provider is used [extensions/ApiFeatureUsage] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176251 (https://phabricator.wikimedia.org/T397348) (owner: 10Zabe) [14:16:14] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1176251|Do not create a database table when a different provider is used (T397348)]], [[gerrit:1176250|Do not create a database table when a different provider is used (T397348)]] [14:16:18] T397348: addWiki.php create tables it should not - https://phabricator.wikimedia.org/T397348 [14:16:41] (03PS1) 10Brouberol: Decommission kafka-jumbo1007 [puppet] - 10https://gerrit.wikimedia.org/r/1176489 (https://phabricator.wikimedia.org/T397447) [14:16:42] (03PS1) 10Brouberol: Decommission kafka-jumbo1008 [puppet] - 10https://gerrit.wikimedia.org/r/1176490 (https://phabricator.wikimedia.org/T397447) [14:16:44] (03PS1) 10Brouberol: Decommission kafka-jumbo1009 [puppet] - 10https://gerrit.wikimedia.org/r/1176491 (https://phabricator.wikimedia.org/T397447) [14:17:07] (03CR) 10CI reject: [V:04-1] Decommission kafka-jumbo1007 [puppet] - 10https://gerrit.wikimedia.org/r/1176489 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol) [14:18:08] !log zabe@deploy1003 zabe: Backport for [[gerrit:1176251|Do not create a database table when a different provider is used (T397348)]], [[gerrit:1176250|Do not create a database table when a different provider is used (T397348)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:18:46] !log zabe@deploy1003 zabe: Continuing with sync [14:18:52] (03PS2) 10Brouberol: Decommission kafka-jumbo1007 [puppet] - 10https://gerrit.wikimedia.org/r/1176489 (https://phabricator.wikimedia.org/T397447) [14:18:52] (03PS2) 10Brouberol: Decommission kafka-jumbo1008 [puppet] - 10https://gerrit.wikimedia.org/r/1176490 (https://phabricator.wikimedia.org/T397447) [14:18:52] (03PS2) 10Brouberol: Decommission kafka-jumbo1009 [puppet] - 10https://gerrit.wikimedia.org/r/1176491 (https://phabricator.wikimedia.org/T397447) [14:19:06] (03PS2) 10Elukey: thanos: clean citoid SLO recording rule history [puppet] - 10https://gerrit.wikimedia.org/r/1176484 (https://phabricator.wikimedia.org/T400073) (owner: 10Herron) [14:23:20] (03PS3) 10Elukey: thanos: clean citoid SLO recording rule history [puppet] - 10https://gerrit.wikimedia.org/r/1176484 (https://phabricator.wikimedia.org/T400073) (owner: 10Herron) [14:24:08] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176251|Do not create a database table when a different provider is used (T397348)]], [[gerrit:1176250|Do not create a database table when a different provider is used (T397348)]] (duration: 07m 54s) [14:24:12] (03CR) 10Elukey: "I tried to double check with:" [puppet] - 10https://gerrit.wikimedia.org/r/1176484 (https://phabricator.wikimedia.org/T400073) (owner: 10Herron) [14:24:12] T397348: addWiki.php create tables it should not - https://phabricator.wikimedia.org/T397348 [14:24:55] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1260.eqiad.wmnet with OS bookworm [14:25:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1261.eqiad.wmnet with OS bookworm [14:25:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1260.eqiad.wmnet with OS bookworm [14:25:12] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068492 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1261.eqiad.wmnet with OS bookworm [14:25:15] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1262.eqiad.wmnet with OS bookworm [14:25:19] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host db1263.eqiad.wmnet with OS bookworm [14:25:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068493 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1262.eqiad.wmnet with OS bookworm [14:25:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068494 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host db1263.eqiad.wmnet with OS bookworm [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1430) [14:31:56] (03CR) 10Filippo Giunchedi: [C:03+1] "Makes sense to me, thanks to Luca for checking" [puppet] - 10https://gerrit.wikimedia.org/r/1176484 (https://phabricator.wikimedia.org/T400073) (owner: 10Herron) [14:34:36] (03PS1) 10Zabe: Initial configuration for tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176493 (https://phabricator.wikimedia.org/T388639) [14:34:38] (03PS1) 10Zabe: Blah [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176494 [14:35:29] (03CR) 10CI reject: [V:04-1] Initial configuration for tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176493 (https://phabricator.wikimedia.org/T388639) (owner: 10Zabe) [14:35:39] (03CR) 10CI reject: [V:04-1] Blah [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176494 (owner: 10Zabe) [14:37:41] blah? [14:40:32] (03PS1) 10Zabe: Add new SUL wikis to the sul dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176495 [14:41:26] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1260.eqiad.wmnet with reason: host reimage [14:41:35] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1261.eqiad.wmnet with reason: host reimage [14:41:46] Its a DNM spam patch for myself, I just needed something so that the commit message is not empty [14:41:53] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1262.eqiad.wmnet with reason: host reimage [14:41:58] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1263.eqiad.wmnet with reason: host reimage [14:44:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1260.eqiad.wmnet with reason: host reimage [14:48:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1261.eqiad.wmnet with reason: host reimage [14:49:09] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11068616 (10elukey) Brain dump before the holidays :) Yiannis created https://gitlab.wikimedia.org/jgiannelos/kartotherian-difftesting/-/tree/main to make a set of request to two kar... [14:49:13] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [14:49:15] (03PS1) 10Zabe: manage-dblist: Correct list of prod/labs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176499 [14:50:11] (03CR) 10CI reject: [V:04-1] manage-dblist: Correct list of prod/labs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176499 (owner: 10Zabe) [14:51:16] (03CR) 10Zabe: multiversion: Move remaining dblist helper to WmfConfig class (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1141522 (owner: 10Krinkle) [14:52:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1262.eqiad.wmnet with reason: host reimage [14:52:37] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2020 to codfw - jhancock@cumin1003" [14:52:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ms-fe2020 to codfw - jhancock@cumin1003" [14:52:42] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:14] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host deploy2003 [14:53:15] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2017 [14:53:16] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2018 [14:53:17] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2019 [14:53:19] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2020 [14:53:24] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host deploy2003 [14:53:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2017 [14:53:28] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2018 [14:53:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2019 [14:53:32] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2020 [14:53:54] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#11068671 (10elukey) At the moment we just have kartotherian in eqiad pooled, so if any issue arises we could do two things: * Simplest one, just repool codfw and accept the latency p... [14:53:57] (03CR) 10Btullis: [C:03+1] eventgate-analytics: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176479 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol) [14:54:11] (03CR) 10Btullis: [C:03+1] eventstreams-internal: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176480 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol) [14:54:25] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host deploy2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:54:27] (03CR) 10Btullis: [C:03+1] mw-page-content-change-enrich: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176481 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol) [14:54:36] (03PS2) 10Zabe: manage-dblist: Correct list of prod/labs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176499 [14:54:43] (03CR) 10Btullis: [C:03+1] eventgate-analytics-external: update kafka-jumbo-eqiad broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176487 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol) [14:54:46] (03PS3) 10Zabe: manage-dblist: Correct list of prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176499 [14:55:04] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2017.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:19] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2018.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1263.eqiad.wmnet with reason: host reimage [14:55:36] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2019.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:55:56] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2020.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:56:56] (03CR) 10Zabe: [C:03+2] Add new SUL wikis to the sul dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176495 (owner: 10Zabe) [14:56:57] (03CR) 10Btullis: [C:03+1] eventlogging: remove reference to kafka-jumbo1007 [alerts] - 10https://gerrit.wikimedia.org/r/1176485 (https://phabricator.wikimedia.org/T397447) (owner: 10Brouberol) [14:57:52] (03Merged) 10jenkins-bot: Add new SUL wikis to the sul dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176495 (owner: 10Zabe) [14:58:11] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host deploy2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:58:14] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2017.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:58:57] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2018.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:59:13] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2020.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:59:19] (03CR) 10Elukey: "Folks I am going on holidays during the next couple of weeks, sorry that we didn't get this sorted out sooner :(" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1174038 (https://phabricator.wikimedia.org/T397246) (owner: 10Bking) [14:59:20] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ms-fe2019.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:59:45] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [14:59:46] FIRING: Emergency syslog message: Alert for device pfw1-codfw.wikimedia.org - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:00:05] hashar and brennen: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1500) [15:01:03] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:02:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:02:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1260.eqiad.wmnet with OS bookworm [15:02:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068716 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1260.eqiad.wmnet with OS bookworm completed: - db1260 (**PASS**) - Rem... [15:02:47] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2019.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:03:10] (03CR) 10Btullis: [C:03+2] Add collation to the list of sqooped table [puppet] - 10https://gerrit.wikimedia.org/r/1175924 (https://phabricator.wikimedia.org/T397923) (owner: 10Aleksandar Mastilovic) [15:04:20] (03CR) 10Jforrester: "Oh, oops, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176495 (owner: 10Zabe) [15:04:31] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [15:04:32] (03CR) 10Jforrester: [C:03+1] manage-dblist: Correct list of prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176499 (owner: 10Zabe) [15:04:45] RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:04:46] RESOLVED: Emergency syslog message: Device pfw1-codfw.wikimedia.org recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:04:50] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:05:29] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:05:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1261.eqiad.wmnet with OS bookworm [15:05:42] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068765 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1261.eqiad.wmnet with OS bookworm completed: - db1261 (**PASS**) - Rem... [15:05:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2019.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:06:41] !log jhancock@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['deploy2003'] [15:07:03] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['deploy2003'] [15:07:44] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host deploy2003.codfw.wmnet with OS bookworm [15:07:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11068783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm [15:08:03] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe2017.codfw.wmnet with OS bullseye [15:08:09] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:11] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11068785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-fe2017.codfw.wmnet with OS bullseye [15:08:22] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe2018.codfw.wmnet with OS bullseye [15:08:29] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11068790 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-fe2018.codfw.wmnet with OS bullseye [15:08:58] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:09:15] (03CR) 10Zabe: [C:03+2] manage-dblist: Correct list of prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176499 (owner: 10Zabe) [15:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:09:59] (03PS2) 10Zabe: Initial configuration for tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176493 (https://phabricator.wikimedia.org/T388639) [15:10:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:10:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1262.eqiad.wmnet with OS bookworm [15:10:31] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068793 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1262.eqiad.wmnet with OS bookworm completed: - db1262 (**PASS**) - Rem... [15:11:35] (03Merged) 10jenkins-bot: manage-dblist: Correct list of prod wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176499 (owner: 10Zabe) [15:11:39] (03CR) 10CI reject: [V:04-1] Initial configuration for tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176493 (https://phabricator.wikimedia.org/T388639) (owner: 10Zabe) [15:12:31] (03CR) 10Phuedx: "Hearty thanks for your questions, RLazarus." [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [15:13:51] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:14:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:14:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1263.eqiad.wmnet with OS bookworm [15:14:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068810 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host db1263.eqiad.wmnet with OS bookworm completed: - db1263 (**PASS**) - Rem... [15:14:33] (03CR) 10Arlolra: admin: stop using groups parsoid-roots and parsoid-admin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [15:14:56] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068812 (10Jclark-ctr) [15:15:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install db126[0-3] - https://phabricator.wikimedia.org/T400214#11068813 (10Jclark-ctr) 05Open→03Resolved [15:15:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Q1:rack/setup/install dse-k8s-worker1014 - https://phabricator.wikimedia.org/T399779#11068819 (10BTullis) a:05bking→03None [15:16:17] (03PS1) 10Zabe: Fix SUL dblist expression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176502 [15:17:06] (03PS4) 10STran: Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) [15:17:19] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.07.26 - 2025.08.15): Q1:rack/setup/install dse-k8s-worker2003 - https://phabricator.wikimedia.org/T399778#11068832 (10BTullis) a:05bking→03None [15:17:38] (03CR) 10Zabe: [C:03+2] Fix SUL dblist expression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176502 (owner: 10Zabe) [15:18:35] (03Merged) 10jenkins-bot: Fix SUL dblist expression [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176502 (owner: 10Zabe) [15:19:31] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:19:59] (03PS3) 10Zabe: Initial configuration for tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176493 (https://phabricator.wikimedia.org/T388639) [15:20:57] (03CR) 10Tchanders: [C:03+1] Enable temporary accounts for special/non-standard/private wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1175113 (https://phabricator.wikimedia.org/T400672) (owner: 10STran) [15:22:13] (03CR) 10Zabe: [C:03+2] Initial configuration for tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176493 (https://phabricator.wikimedia.org/T388639) (owner: 10Zabe) [15:23:08] (03Merged) 10jenkins-bot: Initial configuration for tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176493 (https://phabricator.wikimedia.org/T388639) (owner: 10Zabe) [15:23:36] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2017.codfw.wmnet with reason: host reimage [15:23:53] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2018.codfw.wmnet with reason: host reimage [15:24:07] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1176493|Initial configuration for tlwikisource (T388639)]] [15:24:11] T388639: Create Wikisource Tagalog - https://phabricator.wikimedia.org/T388639 [15:25:53] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [15:26:05] !log zabe@deploy1003 zabe: Backport for [[gerrit:1176493|Initial configuration for tlwikisource (T388639)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:26:29] !log zabe@deploy1003 zabe: Continuing with sync [15:28:24] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2017.codfw.wmnet with reason: host reimage [15:31:47] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2018.codfw.wmnet with reason: host reimage [15:31:52] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176493|Initial configuration for tlwikisource (T388639)]] (duration: 07m 45s) [15:31:56] T388639: Create Wikisource Tagalog - https://phabricator.wikimedia.org/T388639 [15:33:52] !log Create Wikisource Tagalog # T388639 [15:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:13] (03PS1) 10Elukey: profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 [15:34:28] (03PS1) 10Zabe: Activate tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176504 (https://phabricator.wikimedia.org/T388639) [15:34:42] (03CR) 10CI reject: [V:04-1] profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [15:35:15] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6512/co" [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [15:35:15] (03CR) 10Zabe: [C:03+2] Activate tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176504 (https://phabricator.wikimedia.org/T388639) (owner: 10Zabe) [15:36:08] (03Merged) 10jenkins-bot: Activate tlwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176504 (https://phabricator.wikimedia.org/T388639) (owner: 10Zabe) [15:36:23] (03PS2) 10Elukey: profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 [15:36:26] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1176504|Activate tlwikisource (T388639)]] [15:36:50] (03CR) 10CI reject: [V:04-1] profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [15:37:17] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6513/console" [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [15:38:20] !log zabe@deploy1003 zabe: Backport for [[gerrit:1176504|Activate tlwikisource (T388639)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:38:23] T388639: Create Wikisource Tagalog - https://phabricator.wikimedia.org/T388639 [15:38:59] !log zabe@deploy1003 zabe: Continuing with sync [15:39:10] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11068903 (10Jclark-ctr) case number 2025-0807-807153 [15:44:13] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176504|Activate tlwikisource (T388639)]] (duration: 07m 47s) [15:44:17] T388639: Create Wikisource Tagalog - https://phabricator.wikimedia.org/T388639 [15:44:50] 10ops-codfw, 06DC-Ops: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Port with no description on access switch - https://phabricator.wikimedia.org/T401411 (10phaultfinder) 03NEW [15:45:21] (03PS3) 10Elukey: profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 [15:46:11] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [15:46:31] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [15:46:32] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2017.codfw.wmnet with OS bullseye [15:46:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11068938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-fe2017.codfw.wmnet with OS bullseye completed: - ms-fe2017 (**PA... [15:47:09] (03PS4) 10Elukey: profile::pyrra::filesystem::slo: refactor the class [puppet] - 10https://gerrit.wikimedia.org/r/1176503 [15:47:59] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6515/console" [puppet] - 10https://gerrit.wikimedia.org/r/1176503 (owner: 10Elukey) [15:48:12] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11068941 (10Jhancock.wm) a:03Jhancock.wm [15:48:14] PROBLEM - Swift https backend on ms-fe2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:48:34] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [15:48:41] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176506 [15:48:41] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176506 (owner: 10Zabe) [15:49:02] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [15:49:27] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [15:49:28] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2018.codfw.wmnet with OS bullseye [15:49:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11068949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-fe2018.codfw.wmnet with OS bullseye completed: - ms-fe2018 (**PA... [15:49:46] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176506 (owner: 10Zabe) [15:50:11] !log zabe@deploy1003 Started scap sync-world: update interwiki cache [15:51:06] PROBLEM - Swift https backend on ms-fe2018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [15:53:05] 06SRE, 06Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11068965 (10greg) Moving this out of unscheduled into Triage for us (FR Tech) to re-review/prioritize our side on it as it's a thing that needs cross-... [15:53:36] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 253.17 ms [15:53:38] PROBLEM - Host ms-fe2017 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:22] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2017.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:55:30] (03CR) 10BCornwall: [V:03+2 C:03+2] "Confirmed NS servers are appropriate and DNSSEC is disabled." [puppet] - 10https://gerrit.wikimedia.org/r/1175589 (owner: 10Ncmonitor) [15:55:39] !log jhancock@cumin1003 END (ERROR) - Cookbook sre.hosts.provision (exit_code=97) for host ms-fe2017.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:55:40] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11068976 (10elukey) I am going on holidays for a couple of weeks, @RLazarus will take my place for the Pyrra configs... [15:55:55] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2017.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:56:13] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2018.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:57:40] !log zabe@deploy1003 Finished scap sync-world: update interwiki cache (duration: 07m 29s) [15:57:42] (03CR) 10Vgutierrez: "Thanks for the first draft, next week I'll submit the mentioned recording rules that are needed here and I can take it from here and submi" [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [15:59:10] PROBLEM - Host ms-fe2018 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:04] jhathaway and moritzm: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:03:52] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2018.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:03:58] RECOVERY - Host ms-fe2018 is UP: PING OK - Packet loss = 0%, RTA = 31.09 ms [16:13:36] jhancock@cumin1003 provision (PID 1028371) is awaiting input [16:26:32] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host deploy2003.codfw.wmnet with OS bookworm [16:26:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install deploy2003 - https://phabricator.wikimedia.org/T400485#11069127 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host deploy2003.codfw.wmnet with OS bookworm executed with errors: - deploy2003 (**... [16:32:17] (03PS5) 10Andrew Bogott: Add k3s class for installing k3s on a single cloud-vps node [puppet] - 10https://gerrit.wikimedia.org/r/1175625 (https://phabricator.wikimedia.org/T393782) [16:32:18] (03PS5) 10Andrew Bogott: Add puppet class and profile to create k3s cluster-api worker for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/1175626 (https://phabricator.wikimedia.org/T393782) [16:32:18] (03PS1) 10Andrew Bogott: Magnum/capi: switch to using wmf-internal image and helm repos [puppet] - 10https://gerrit.wikimedia.org/r/1176513 (https://phabricator.wikimedia.org/T393782) [16:33:38] (03PS2) 10Anzx: tlwikisource: add author ( Manunulat ) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176509 (https://phabricator.wikimedia.org/T388654) [16:46:44] (03CR) 10A smart kitten: "I don't have a strong opinion either way on whether or not to do this here, but it might be worth noting that the UA policy is a translata" [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis) [16:49:21] !log dancy@deploy1003 Installing scap version "4.198.0" for 2 host(s) [16:51:08] !log dancy@deploy1003 Installation of scap version "4.198.0" completed for 2 hosts [17:00:04] bd808: That opportune time for a Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1700) [17:01:09] (03PS1) 10BryanDavis: developer-portal: Bump container to 2025-08-07-122428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176517 [17:03:49] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2025-08-07-122428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176517 (owner: 10BryanDavis) [17:05:27] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2025-08-07-122428-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176517 (owner: 10BryanDavis) [17:08:39] !log bd808@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:09:15] !log bd808@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:09:24] !log bd808@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:09:44] !log bd808@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:09:52] !log bd808@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:11:11] !log bd808@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:20:06] (03CR) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [17:24:08] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11069331 (10Dzahn) [17:24:19] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11069332 (10Dzahn) 05Open→03In progress [17:25:07] 06SRE, 10SRE-Access-Requests, 06MW-Interfaces-Team: Requesting access to analytics-privatedata-users, SSH and Kerberos for HCoplin-WMF - https://phabricator.wikimedia.org/T400897#11069335 (10Dzahn) This ticket looks resolved. Is it? [17:25:53] 06SRE, 10SRE-Access-Requests: Request to add dsaez to analytics-research-admins - https://phabricator.wikimedia.org/T400344#11069338 (10Dzahn) a:03Miriam [17:26:03] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1015.eqiad.wmnet w/ force delete existing files, repooling both afterwards [17:26:06] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [17:27:18] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2022.codfw.wmnet w/ force delete existing files, repooling both afterwards [17:28:39] 06SRE, 10SRE-Access-Requests, 06MW-Interfaces-Team: Requesting access to analytics-privatedata-users, SSH and Kerberos for HCoplin-WMF - https://phabricator.wikimedia.org/T400897#11069354 (10Dzahn) @HCoplin-WMF Does your access work? [17:29:03] (03CR) 10Dzahn: [C:03+2] admin: add an alias to my own .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/1176322 (owner: 10Dzahn) [17:29:47] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2017.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:30:58] (03CR) 10Dzahn: [C:03+2] gerrit: replicate repo renames as "gerrit2" application user [puppet] - 10https://gerrit.wikimedia.org/r/1175122 (https://phabricator.wikimedia.org/T239693) (owner: 10Hashar) [17:31:56] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2019.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:32:45] FIRING: [2x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:33:30] (03CR) 10Arlolra: [C:03+1] admin: stop using groups parsoid-roots and parsoid-admin [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [17:34:00] (03CR) 10RLazarus: "Welcome back, and thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [17:34:33] (03CR) 10Dzahn: [C:03+2] admin: upgrade user osleger, add to parsoid-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/1176336 (https://phabricator.wikimedia.org/T401300) (owner: 10Dzahn) [17:37:45] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:38:07] (03CR) 10Ssingh: [C:03+1] "I think that's probably better. I will let Bryan comment too given they are the author of this patch." [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis) [17:39:17] hmm traffic bill over qupta [17:39:19] quota [17:40:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11069385 (10Dzahn) ` [testreduce1002:~] $ id osleger uid=49599(osleger) gid=500(wikidev) groups=500(wikidev),772(parsoid-test-roots) [parsoidt... [17:40:32] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 159616 MB (4% inode=99%): /var/lib/hadoop/data/g 150122 MB (3% inode=99%): /var/lib/hadoop/data/j 160621 MB (4% inode=99%): /var/lib/hadoop/data/c 159199 MB (4% inode=99%): /var/lib/hadoop/data/b 153330 MB (4% inode=99%): /var/lib/hadoop/data/l 158755 MB (4% inode=99%): /var/lib/hadoop/data/k 161148 MB (4% inode=99%): /var/lib/hadoop/data [17:40:32] 2 MB (4% inode=99%): /var/lib/hadoop/data/i 159121 MB (4% inode=99%): /var/lib/hadoop/data/m 151467 MB (4% inode=99%): /var/lib/hadoop/data/d 158220 MB (4% inode=99%): /var/lib/hadoop/data/h 160578 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [17:40:51] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11069399 (10Dzahn) 05In progress→03Resolved a:03Dzahn [17:41:02] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to parsoidtest1001 and testreduce1002 for OSleger_WMF - https://phabricator.wikimedia.org/T401300#11069401 (10Dzahn) a:05Dzahn→03None [17:42:41] jhancock@cumin1003 provision (PID 1037275) is awaiting input [17:42:48] (03PS2) 10Dzahn: admin: stop using groups parsoid-roots and parsoid-admin [puppet] - 10https://gerrit.wikimedia.org/r/1176337 (https://phabricator.wikimedia.org/T401300) [17:45:22] (03CR) 10Dzahn: [V:03+1] "Thank you very much for the response, Elukey :)" [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [17:46:46] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11069417 (10Jclark-ctr) @cmooney @ayounsi I am unable to SCP files from the Juniper device. Could you please assist? I am getting a “permission denied” error, and the /var/tmp directory is not acc... [17:50:01] (03CR) 10Andrew Bogott: [C:03+2] Add k3s class for installing k3s on a single cloud-vps node [puppet] - 10https://gerrit.wikimedia.org/r/1175625 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:50:05] (03CR) 10Andrew Bogott: [C:03+2] Add puppet class and profile to create k3s cluster-api worker for Magnum [puppet] - 10https://gerrit.wikimedia.org/r/1175626 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:50:09] (03CR) 10Andrew Bogott: [C:03+2] Magnum/capi: switch to using wmf-internal image and helm repos [puppet] - 10https://gerrit.wikimedia.org/r/1176513 (https://phabricator.wikimedia.org/T393782) (owner: 10Andrew Bogott) [17:52:45] FIRING: [4x] Traffic bill over quota: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:53:21] jhancock@cumin1003 provision (PID 1037275) is awaiting input [17:54:20] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [17:54:27] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11069427 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm [17:56:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 473371976 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:57:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 4784 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:57:15] 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11069441 (10RobH) >>! In T373993#11052738, @BCornwall wrote: > Re-assigning to @RobH: Rob, can you check the hot aisle in magru for us? I can, but can you adv... [17:57:45] RESOLVED: [2x] Traffic bill over quota: Alert for device cr2-eqord.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [17:58:04] (03PS3) 10BryanDavis: varnish: Update User-Agent Policy url in error messages [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) [17:59:21] (03CR) 10BryanDavis: "{{Done}}" [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis) [17:59:24] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2019.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:59:51] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host ms-fe2020.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:00:03] 06SRE, 10LDAP-Access-Requests, 10Wikidata, 10Wikidata Omega Product: Grant Access to for - https://phabricator.wikimedia.org/T401118#11069458 (10KFrancis) Thank you! The NDA has been sent to you to sign. I will confirm when all signatures are complete. [18:00:05] hashar and brennen: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1800) [18:00:36] o/ train stable on all wikis, nothing for this window. [18:01:08] 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11069462 (10ssingh) >>! In T373993#11069441, @RobH wrote: >>>! In T373993#11052738, @BCornwall wrote: >> Re-assigning to @RobH: Rob, can you check the hot aisl... [18:02:07] 10ops-esams, 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#11069464 (10RobH) Ok, if Willy asked then I can put in the ticket no worries. I was asking so I could include it in the reasoning for the ticket later to him! [18:06:06] (03CR) 10Ssingh: [C:03+1] "@bcornwall@wikimedia.org: Please review as well and roll it out for @bd808@wikimedia.org. Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1172435 (https://phabricator.wikimedia.org/T400421) (owner: 10BryanDavis) [18:07:30] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2020.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:10:56] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/1169156/6519/dns1005.wikimedia.org/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [18:11:58] (03CR) 10Ssingh: "Looking pretty good. Do you want to include the changes from 1172056 (child CR) to this commit itself for a full review?" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [18:16:12] vriley@cumin1002 reimage (PID 3939021) is awaiting input [18:16:28] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1042.eqiad.wmnet with OS bookworm [18:16:38] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11069483 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104... [18:18:39] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://puppet-compiler.wmflabs.org/output/1174566/6518/" [puppet] - 10https://gerrit.wikimedia.org/r/1174566 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:21:52] (03PS1) 10Clare Ming: XLab/Hooks: Only fetch experiment configs when user is registered [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176528 [18:22:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176528 (owner: 10Clare Ming) [18:23:24] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2022.codfw.wmnet w/ force delete existing files, repooling both afterwards [18:23:28] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [18:24:22] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1015.eqiad.wmnet w/ force delete existing files, repooling both afterwards [18:27:53] (03CR) 10Dzahn: [V:04-1 C:04-1] "mix up of apc vs apcu: "APCu is the official replacement for the outdated APC extension"" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [18:36:12] (03CR) 10BCornwall: [V:03+2 C:03+2] "All good to go." [puppet] - 10https://gerrit.wikimedia.org/r/1175588 (owner: 10Ncmonitor) [18:37:51] (03PS5) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 [18:38:18] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [18:39:35] (03PS6) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 [18:40:02] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [18:40:29] (03PS1) 10Clare Ming: Update PageVisit instruments for a logged-in synth experiment [extensions/WikimediaEvents] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176532 (https://phabricator.wikimedia.org/T397140) [18:40:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, August 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176532 (https://phabricator.wikimedia.org/T397140) (owner: 10Clare Ming) [18:41:07] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe2019.codfw.wmnet with OS bullseye [18:41:18] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11069533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-fe2019.codfw.wmnet with OS bullseye [18:41:20] !log jhancock@cumin1003 START - Cookbook sre.hosts.reimage for host ms-fe2020.codfw.wmnet with OS bullseye [18:41:28] jouncebot: nowandnext [18:41:28] For the next 1 hour(s) and 18 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T1800) [18:41:28] In 1 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T2000) [18:41:29] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11069534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin1003 for host ms-fe2020.codfw.wmnet with OS bullseye [18:41:47] cjming and dr0ptp4kt and I will be pushing some buttons for https://gerrit.wikimedia.org/r/1171205 shortly [18:41:56] (03PS7) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 [18:42:24] (this is the item scheduled for 22 UTC, opportunistically getting it done early) [18:42:24] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [18:43:35] (03PS1) 10Phuedx: DNM: MetricsPlatform: Disable logged-in experiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1176533 [18:44:28] (03PS8) 10Dzahn: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 [18:44:56] (03CR) 10CI reject: [V:04-1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [18:45:17] (03CR) 10Dr0ptp4kt: [C:03+1] "Per text chats, moving ahead now earlier in the US afternoon now." [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [18:46:44] i +1'd https://gerrit.wikimedia.org/r/c/operations/puppet/+/1171205 rzl cjming (we're chatting on Slack, but we should take it here now [18:46:49] thanks! [18:47:06] ack - exciting! [18:47:10] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1189241560 and 73 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:47:17] cjming: do you want to run the script once manually first, before we start it going automatically? I can walk you through that [18:47:41] sure [18:48:02] this'll be on the deployment host, you can just run something like [18:48:22] mwscript-k8s --sal --comment="Test run for T398422" --follow -- extensions/MetricsPlatform/maintenance/UpdateConfigs.php --wiki aawiki [18:48:22] T398422: MetricsPlatform: InstrumentConfigFetcher: Make fetching asynchronous - https://phabricator.wikimedia.org/T398422 [18:49:04] rzl: sounds good - thanks! [18:49:10] cjming you on it? [18:49:17] on it! [18:49:21] ty [18:50:00] oh -- now? or after deploy? [18:50:04] yep go for it [18:50:10] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7880 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:50:22] the idea is, if the script doesn't work, we can find that out now and fix it, instead of generating an error every minute :) [18:50:23] (03CR) 10Dzahn: "how to configure it if the key is "apcu.shm_size" but a . creates a syntax error?" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [18:50:25] !log cjming@deploy1003 mwscript-k8s job started: extensions/MetricsPlatform/maintenance/UpdateConfigs.php --wiki aawiki # Test run for T398422 [18:50:39] it ran! [18:50:43] lgtm [18:50:46] sweet [18:51:02] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:51:07] here goes, please keep hands and arms inside the vehicle [18:51:18] 🤞 [18:51:19] (03CR) 10RLazarus: [C:03+2] mw::maintenance: ExperimentationLab periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1171205 (https://phabricator.wikimedia.org/T398422) (owner: 10Phuedx) [18:52:01] running puppet-merge now, then puppet on the deployment server, then we can watch for the first run [18:52:56] > The last Puppet run was at Thu Aug 7 17:52:58 UTC 2025 (59 minutes ago). Last Puppet commit: [18:52:59] that's weird [18:53:10] (03CR) 10Dzahn: "extension name: apcu (replaces apc) config key name: apc.shm_size https://www.php.net/manual/en/apcu.configuration.php#ini.apcu.shm-size" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (owner: 10Dzahn) [18:53:35] sukhe, cdanis: just fyi, puppet hasn't completed on deploy1003 for the last hour, might or might not be related to that puppetdb replication alert [18:53:44] uh [18:53:46] hm [18:53:47] trying a manual run-puppet-agent but not sure if something Interesting is happening [18:54:04] ok! [18:54:35] > DNS lookup failed for zuul1001.eqiad.wmet Resolv::DNS::Resource::IN::A [18:54:47] yeah just got there too [18:54:54] there is no zuul, only dana 🤔 [18:55:01] ha! [18:55:09] what the hack? [18:55:13] I am on that host right now [18:55:14] cjming, dr0ptp4kt: hang on one sec, unrelated issue :) [18:55:16] so should be transient unless I am mistaken [18:55:37] but twice, hmm, that's no good [18:55:42] https://puppetboard.wikimedia.org/node/deploy1003.eqiad.wmnet [18:56:02] 👻 :siren: [18:56:05] oh wait [18:56:09] zuul1001.eqiad.wmet [18:56:11] there's your problem [18:56:17] haha yeah [18:56:19] ha! [18:56:23] where is this coming from? [18:56:38] it's my change :o [18:56:43] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1174566/8/hieradata/common.yaml [18:56:47] let me fix the typo [18:56:55] phew [18:57:11] ahh thanks! [18:57:26] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/0525209c07e0a768f7833df3abe745e707230db6%5E%21/#F0 [18:57:29] + zuul-eqiad: [18:57:31] this basically [18:57:33] + hosts: [18:57:35] !log jhancock@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2019.codfw.wmnet with reason: host reimage [18:57:36] + zuul1001.eqiad.wmet: '1001' [18:57:45] cdanis, sukhe: sorry to jump the gun! [18:57:56] no worries, thanks for finding the error :) [18:58:28] (03PS1) 10Dzahn: zuul/hieradata: fix typo in zuul1001 hostname [puppet] - 10https://gerrit.wikimedia.org/r/1176535 (https://phabricator.wikimedia.org/T395938) [18:58:38] (03CR) 10RLazarus: [C:03+1] zuul/hieradata: fix typo in zuul1001 hostname [puppet] - 10https://gerrit.wikimedia.org/r/1176535 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [18:58:54] (03CR) 10Dzahn: [C:03+2] "follow-up to I5fef80737cc58972bf464" [puppet] - 10https://gerrit.wikimedia.org/r/1176535 (https://phabricator.wikimedia.org/T395938) (owner: 10Dzahn) [19:00:05] CI would have failed on eqad but not wmet :P [19:00:32] PROBLEM - Disk space on an-worker1127 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 155529 MB (4% inode=99%): /var/lib/hadoop/data/g 145685 MB (3% inode=99%): /var/lib/hadoop/data/j 159884 MB (4% inode=99%): /var/lib/hadoop/data/c 160328 MB (4% inode=99%): /var/lib/hadoop/data/b 153515 MB (4% inode=99%): /var/lib/hadoop/data/l 157739 MB (4% inode=99%): /var/lib/hadoop/data/k 157383 MB (4% inode=99%): /var/lib/hadoop/data [19:00:32] 1 MB (4% inode=99%): /var/lib/hadoop/data/i 155779 MB (4% inode=99%): /var/lib/hadoop/data/m 153760 MB (4% inode=99%): /var/lib/hadoop/data/d 159154 MB (4% inode=99%): /var/lib/hadoop/data/h 157165 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-worker1127&var-datasource=eqiad+prometheus/ops [19:01:22] background: zuul1001/zuul2001 are new zuul CI machines not in production yet. one of the requirements was that they need a zookeeper server on them. when adding the zookeeper::server profile you have to add the hosts to a global (hiera common.yaml) list of all zookeeper servers [19:01:30] yes, wmet could be added to the typos file [19:01:36] running puppet on deploy1003 [19:02:01] (03PS1) 10Ssingh: typos: add wmet as a typo for wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1176536 [19:02:03] thanks! [19:02:27] mutante: yep ^ :) [19:03:13] https://www.merriam-webster.com/wordfinder/classic/contains/all/-1/wmet/1 [19:03:31] how likely is "flowmeter" haha [19:03:37] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2019.codfw.wmnet with reason: host reimage [19:03:50] (03CR) 10RLazarus: [C:03+1] "this might bite us when we announce the Wikimedia Equestrian Team" [puppet] - 10https://gerrit.wikimedia.org/r/1176536 (owner: 10Ssingh) [19:03:54] (03CR) 10Dzahn: [C:03+1] "not many likely false positives:P https://www.merriam-webster.com/wordfinder/classic/contains/all/-1/wmet/1" [puppet] - 10https://gerrit.wikimedia.org/r/1176536 (owner: 10Ssingh) [19:04:09] :P [19:04:09] very creative, rzl :) [19:04:31] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [19:04:41] (03CR) 10Ssingh: [C:03+2] typos: add wmet as a typo for wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1176536 (owner: 10Ssingh) [19:05:07] was puppet broken only on the deployment server but not globally? [19:05:22] not sure if there were other hosts but definitely not globally [19:05:23] just deploy1003 from what I can see [19:05:25] cant explain right away why that one server does that DNS lookup [19:05:26] the other servers were not related [19:05:37] but it's in common.yaml [19:05:57] puppet run on deploy1003 just finished now [19:06:03] thanks! [19:06:05] we will need on deploy2002 as well but yeah [19:06:12] Deployment_server::Mediawiki::Periodic_jobs/Concat[/etc/helmfile-defaults/mediawiki/periodic-jobs.yaml]/File[/etc/helmfile-defaults/mediawiki/periodic-jobs.yaml]/content: [19:06:13] cjming, dr0ptp4kt: going ahead now, thanks for your patience [19:06:24] np! [19:06:37] thx rzl [19:07:00] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [19:07:43] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [19:08:54] cjming, dr0ptp4kt: first run looks good, at least from the output! [19:09:01] https://www.irccloud.com/pastebin/GbIMPxcI/ [19:09:13] nice [19:09:23] thanks rzl...watching various dashboards [19:09:25] have a look at any data you want to check to make sure it actually did what you want, otherwise we're all set as far as I'm concerned [19:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:09:40] that sentence was a mess, you know what I mean [19:10:44] 5 hosts with failed puppet. deploy2002 - running right now. logging-hd2005 - unrelated SSL cert error when trying to use puppetserver2004. dse-k8s-etcd* (3 servers): fails to start etcd [19:10:48] rzl: whatever you want [19:11:13] haha [19:11:28] rzl: tysm! much obliged 🙌 [19:12:16] puppet finished on deploy2002, you can do codfw now. sorry for the typo. I had compiled this on all of C:zookeeper but that did not cover the deploy hosts. [19:12:28] mutante: thanks for fixing it so quickly <3 [19:12:32] not sure why exactly it affected them, but ok :) [19:12:36] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11069614 (10Jhancock.wm) [19:12:56] mutante: modules/profile/manifests/kubernetes/deployment_server/global_config.pp, iterates over zookeeper_clusters [19:13:01] and then $ips = $data['brokers'].keys().map |$n| { [19:13:01] $v4 = ipresolve($n) [19:13:07] this part failed, the resolve [19:13:25] aha!:) thanks [19:14:24] I'll delete our deployment window later, since we don't need it [19:17:50] cjming, dr0ptp4kt: error logs look fine (per logspam-watch on mwlog1002) [19:18:06] \o/ [19:18:09] thnx phuedx [19:28:03] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [19:28:46] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin1003" [19:28:47] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2019.codfw.wmnet with OS bullseye [19:28:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11069635 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-fe2019.codfw.wmnet with OS bullseye completed: - ms-fe2019 (**WA... [19:38:11] 06SRE, 06Traffic: Setting up Wikimedia Trust and Safety Help Center with Zendesk product: Seeking Guidance on host mapping - https://phabricator.wikimedia.org/T400952#11069639 (10Dzahn) a:05jhathaway→03None [19:48:32] 10SRE-SLO, 06SRE Observability, 10Abstract Wikipedia team (26Q1 (Jul–Sep)), 07Essential-Work: Create new SLO dashboard via Pyrra for Wikifunctions - https://phabricator.wikimedia.org/T394057#11069653 (10ecarg) @RLazarus We are using [[ https://grafana.wikimedia.org/goto/OZBSw7_HR?orgId=1 | this query ]] t... [19:55:07] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11069687 (10Jhancock.wm) [19:55:08] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [19:55:16] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11069688 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm [19:58:45] FIRING: Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T2000). nyaa~ [20:00:05] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] o/ [20:01:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176528 (owner: 10Clare Ming) [20:01:36] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ms-fe2020.codfw.wmnet with OS bullseye [20:01:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q1:rack/setup/install ms-fe20[17-20] - https://phabricator.wikimedia.org/T401225#11069690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin1003 for host ms-fe2020.codfw.wmnet with OS bullseye executed with errors: - ms-f... [20:02:06] (03Merged) 10jenkins-bot: XLab/Hooks: Only fetch experiment configs when user is registered [extensions/MetricsPlatform] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176528 (owner: 10Clare Ming) [20:02:20] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1176528|XLab/Hooks: Only fetch experiment configs when user is registered]] [20:04:16] !log cjming@deploy1003 cjming: Backport for [[gerrit:1176528|XLab/Hooks: Only fetch experiment configs when user is registered]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:05:05] !log cjming@deploy1003 cjming: Continuing with sync [20:10:25] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176528|XLab/Hooks: Only fetch experiment configs when user is registered]] (duration: 08m 05s) [20:11:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [extensions/WikimediaEvents] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176532 (https://phabricator.wikimedia.org/T397140) (owner: 10Clare Ming) [20:16:45] vriley@cumin1002 reimage (PID 4060279) is awaiting input [20:22:44] (03Merged) 10jenkins-bot: Update PageVisit instruments for a logged-in synth experiment [extensions/WikimediaEvents] (wmf/1.45.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1176532 (https://phabricator.wikimedia.org/T397140) (owner: 10Clare Ming) [20:23:00] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1176532|Update PageVisit instruments for a logged-in synth experiment (T397140)]] [20:23:04] T397140: Run a logged-in synthetic A/A test using the JS SDK - https://phabricator.wikimedia.org/T397140 [20:24:45] !log cjming@deploy1003 cjming: Backport for [[gerrit:1176532|Update PageVisit instruments for a logged-in synth experiment (T397140)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:25:07] !log cjming@deploy1003 cjming: Continuing with sync [20:28:14] (03PS1) 10Ahmon Dancy: mediawiki: Make LOG_FORMAT configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176543 [20:30:34] !log cjming@deploy1003 Finished scap sync-world: Backport for [[gerrit:1176532|Update PageVisit instruments for a logged-in synth experiment (T397140)]] (duration: 07m 34s) [20:30:38] T397140: Run a logged-in synthetic A/A test using the JS SDK - https://phabricator.wikimedia.org/T397140 [20:31:21] (03PS11) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [20:31:46] closing the window since i was the only one in the queue [20:32:26] (03CR) 10Dr0ptp4kt: "Applying suggested fixes. There's a bit more tidying yet and an open question for `+` cases that @vgutierrez@wikimedia.org has open in her" [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [20:35:41] (03PS12) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) [20:39:10] 10ops-eqiad, 06SRE, 06DC-Ops: Check list of PXE miss-configs for eqiad - https://phabricator.wikimedia.org/T401441 (10Jhancock.wm) 03NEW [20:39:16] (03CR) 10Dr0ptp4kt: Introduce v1 xLab / MPIC SLOs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1176343 (https://phabricator.wikimedia.org/T398869) (owner: 10Dr0ptp4kt) [20:40:19] (03PS2) 10Ahmon Dancy: mediawiki: Make LOG_FORMAT configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176543 [20:40:44] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Checklist of PCE miss-configs in codfw - https://phabricator.wikimedia.org/T401442 (10Jhancock.wm) 03NEW [20:40:53] 10ops-codfw, 06SRE, 06DC-Ops: Checklist of PCE miss-configs in codfw - https://phabricator.wikimedia.org/T401442#11069785 (10Jhancock.wm) [20:42:22] 10ops-codfw, 06SRE, 06DC-Ops: Check list of PXE miss-configs for codfw - https://phabricator.wikimedia.org/T401442#11069796 (10Jhancock.wm) [20:46:25] jouncebot: nowandnext [20:46:25] For the next 0 hour(s) and 13 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T2000) [20:46:25] In 0 hour(s) and 13 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T2100) [20:47:40] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11069798 (10Dzahn) re: wiktionary.org.uk - My first thought was that Wikimedia UK chapter might want this and/or we should redirect it to https://wikimedia.org.uk/ - Le... [20:48:45] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:49:46] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11069800 (10Dzahn) re: mediawiki.gr - similarly I would think let's ask https://wikimedia.gr Greek chapter if they know or have opinions. [20:50:51] FYI, in the remainder of the backport window, I'm going to merge https://gerrit.wikimedia.org/r/1176543, which I will follow with a helmfile-only scap deployment to clear no-op chart diffs [20:50:54] +cc dancy [20:52:08] (03CR) 10Scott French: "Thanks, Ahmon! This looks good to me. Since this is functionally a noop, I'll merge and then clear the chart-version diffs shortly." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176543 (owner: 10Ahmon Dancy) [20:52:12] (03CR) 10Scott French: [C:03+2] mediawiki: Make LOG_FORMAT configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176543 (owner: 10Ahmon Dancy) [20:54:54] (03Merged) 10jenkins-bot: mediawiki: Make LOG_FORMAT configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/1176543 (owner: 10Ahmon Dancy) [20:58:29] !log swfrench@deploy1003 Started scap sync-world: No-op deployment to clear chart version diffs from https://gerrit.wikimedia.org/r/1176543 [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250807T2100) [21:01:14] !log swfrench@deploy1003 Finished scap sync-world: No-op deployment to clear chart version diffs from https://gerrit.wikimedia.org/r/1176543 (duration: 02m 45s) [21:03:38] PROBLEM - Swift https backend on ms-fe2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:13:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11069848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm executed with errors: - cloudcephosd104... [21:14:19] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:14:26] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11069851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bullseye [21:15:06] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1016.eqiad.wmnet w/ force delete existing files, repooling both afterwards [21:15:10] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [21:15:15] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2018.codfw.wmnet w/ force delete existing files, repooling both afterwards [21:44:09] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448 (10RobH) 03NEW [21:44:53] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11069934 (10RobH) [21:45:33] 10ops-eqiad, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install ms-fe10[17-20] - https://phabricator.wikimedia.org/T401448#11069936 (10RobH) [21:53:06] PROBLEM - Backup freshness on backup1014 is CRITICAL: Stale: 3 (gerrit1003, ...), Fresh: 134 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [22:03:55] 06SRE, 10DNS, 06Traffic: Set mediawiki.gr, wikipedia.pt, and wiktionary.org.uk NS records to WMF - https://phabricator.wikimedia.org/T401438#11069994 (10BCornwall) Emails sent to @Mike_Peel and @Geraki and brazenly subbed them here too :) [22:09:02] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs2007.codfw.wmnet -> wdqs2018.codfw.wmnet w/ force delete existing files, repooling both afterwards [22:09:06] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [22:15:14] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1042.eqiad.wmnet with OS bookworm [22:15:29] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11070016 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm [22:15:35] !log bking@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T386098, transfer newly-reloaded data) xfer wikidata_main from wdqs1022.eqiad.wmnet -> wdqs1016.eqiad.wmnet w/ force delete existing files, repooling both afterwards [22:15:38] T386098: Run a full data-reload on wdqs-main, wdqs-scholarly and wdqs to capture new blank node labels - https://phabricator.wikimedia.org/T386098 [22:34:52] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage [22:38:39] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1042.eqiad.wmnet with reason: host reimage [22:58:18] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:59:08] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [22:59:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1042.eqiad.wmnet with OS bookworm [22:59:22] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11070053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host cloudcephosd1042.eqiad.wmnet with OS bookworm completed: - cloudcephosd1042 (**PASS**... [22:59:53] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install cloudcephosd10[42-47] - https://phabricator.wikimedia.org/T378828#11070054 (10VRiley-WMF) [23:04:31] FIRING: NetworkDeviceAlarmActive: Alarm active on ssw1-f1-eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [23:06:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:08:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:09:31] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate thanos-query.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:33:45] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [23:37:10] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1047 - vriley@cumin1002" [23:37:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcephosd1047 - vriley@cumin1002" [23:37:15] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:37:58] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd1047 [23:38:06] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd1047 [23:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176553 [23:38:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176553 (owner: 10TrainBranchBot) [23:38:47] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host cloudcephosd1047.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:53:04] RECOVERY - Backup freshness on backup1014 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:53:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1176553 (owner: 10TrainBranchBot)