[00:00:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P49220 and previous config saved to /var/cache/conftool/dbconfig/20230608-000028-ladsgroup.json [00:01:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T336886)', diff saved to https://phabricator.wikimedia.org/P49221 and previous config saved to /var/cache/conftool/dbconfig/20230608-000134-ladsgroup.json [00:01:38] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:05:36] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/928136/41607/" [puppet] - 10https://gerrit.wikimedia.org/r/928136 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [00:14:56] (03CR) 10Dzahn: "so what happens here is this:" [puppet] - 10https://gerrit.wikimedia.org/r/928133 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [00:15:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T336886)', diff saved to https://phabricator.wikimedia.org/P49222 and previous config saved to /var/cache/conftool/dbconfig/20230608-001534-ladsgroup.json [00:15:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [00:15:39] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:15:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [00:15:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T336886)', diff saved to https://phabricator.wikimedia.org/P49223 and previous config saved to /var/cache/conftool/dbconfig/20230608-001555-ladsgroup.json [00:16:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P49224 and previous config saved to /var/cache/conftool/dbconfig/20230608-001640-ladsgroup.json [00:22:06] (03CR) 10Dzahn: "in this context also see https://askubuntu.com/questions/20953/sudo-source-command-not-found" [puppet] - 10https://gerrit.wikimedia.org/r/928133 (https://phabricator.wikimedia.org/T330920) (owner: 10AOkoth) [00:23:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T336886)', diff saved to https://phabricator.wikimedia.org/P49225 and previous config saved to /var/cache/conftool/dbconfig/20230608-002335-ladsgroup.json [00:23:39] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:28:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-cluster [00:28:25] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [00:31:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P49226 and previous config saved to /var/cache/conftool/dbconfig/20230608-003146-ladsgroup.json [00:35:57] (03CR) 10Dzahn: admin: add all miscweb domains as extra SANs (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [00:37:23] (03CR) 10Dzahn: "see inline comments. I think:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [00:38:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P49227 and previous config saved to /var/cache/conftool/dbconfig/20230608-003841-ladsgroup.json [00:39:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/927778 [00:39:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/927778 (owner: 10TrainBranchBot) [00:45:47] (03CR) 10Dzahn: "I updated the list of checkboxes on https://phabricator.wikimedia.org/T300171" [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [00:46:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T336886)', diff saved to https://phabricator.wikimedia.org/P49228 and previous config saved to /var/cache/conftool/dbconfig/20230608-004653-ladsgroup.json [00:46:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [00:46:57] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:47:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2119.codfw.wmnet with reason: Maintenance [00:47:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T336886)', diff saved to https://phabricator.wikimedia.org/P49229 and previous config saved to /var/cache/conftool/dbconfig/20230608-004713-ladsgroup.json [00:52:06] I'm sure it'll get promptly resolved tomorrow*, but I've added T338381 as a train blocker, group0 wikis don't currently have a working math extension [00:52:07] T338381: Math extension cannot connect to Restbase - https://phabricator.wikimedia.org/T338381 [00:52:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T336886)', diff saved to https://phabricator.wikimedia.org/P49230 and previous config saved to /var/cache/conftool/dbconfig/20230608-005218-ladsgroup.json [00:52:22] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [00:53:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P49231 and previous config saved to /var/cache/conftool/dbconfig/20230608-005347-ladsgroup.json [00:56:12] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) > contint.wikimedia.org minor correction, it's https://integration.wikimedia.org >Daniel proposed to reiimage it... [00:59:03] (03CR) 10Dzahn: "last heads-up, shell access is going to be removed tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [00:59:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/927778 (owner: 10TrainBranchBot) [01:06:52] (03CR) 10Dzahn: doc: Switch sync between nodes to rsync::quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [01:07:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P49232 and previous config saved to /var/cache/conftool/dbconfig/20230608-010724-ladsgroup.json [01:08:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T336886)', diff saved to https://phabricator.wikimedia.org/P49233 and previous config saved to /var/cache/conftool/dbconfig/20230608-010853-ladsgroup.json [01:08:57] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [01:22:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P49234 and previous config saved to /var/cache/conftool/dbconfig/20230608-012230-ladsgroup.json [01:23:30] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 20 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [01:23:44] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20 days, 0:00:00 on wdqs2012.codfw.wmnet with reason: attempting WDQS stack on bullseye [01:37:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T336886)', diff saved to https://phabricator.wikimedia.org/P49235 and previous config saved to /var/cache/conftool/dbconfig/20230608-013736-ladsgroup.json [01:37:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [01:37:41] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [01:38:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2136.codfw.wmnet with reason: Maintenance [01:38:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T336886)', diff saved to https://phabricator.wikimedia.org/P49236 and previous config saved to /var/cache/conftool/dbconfig/20230608-013808-ladsgroup.json [01:43:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T336886)', diff saved to https://phabricator.wikimedia.org/P49237 and previous config saved to /var/cache/conftool/dbconfig/20230608-014303-ladsgroup.json [01:43:07] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [01:55:39] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Sprint Week main tracking task - https://phabricator.wikimedia.org/T332516 (10Papaul) 05Open→03Resolved Closing this since Sprint week has been over [01:56:08] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Sprint Week main tracking task - https://phabricator.wikimedia.org/T332516 (10Papaul) [01:56:19] 10SRE, 10ops-eqiad, 10DC-Ops: Eqiad: Backlog HW failure-racking tasks - Decommision and remote work tasks - https://phabricator.wikimedia.org/T332523 (10Papaul) 05Open→03Resolved Closing this since Sprint week has been over [01:56:58] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T338194 (10Papaul) 05Open→03Resolved a:03Papaul [01:58:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P49238 and previous config saved to /var/cache/conftool/dbconfig/20230608-015809-ladsgroup.json [02:03:08] 10SRE, 10ops-codfw, 10DC-Ops: Relabel: puppetserver2005 to puppetserver2001 - https://phabricator.wikimedia.org/T338327 (10Papaul) a:03Jhancock.wm @Jhancock.wm this server was called puppetmaster2005 and it was renamed to puppetserver2001 in Netbox but we still have the physical label as puppetmaster2005.C... [02:03:22] 10SRE, 10ops-codfw, 10DC-Ops: Relabel: puppetmaster2005 to puppetserver2001 - https://phabricator.wikimedia.org/T338327 (10Papaul) [02:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:11] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T338347 (10Papaul) a:03Jhancock.wm [02:09:48] 10SRE, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) [02:10:32] 10SRE, 10ops-codfw, 10Cloud-Services, 10cloud-services-team: cloudbackup2001 lockup on 2023-05-05 - https://phabricator.wikimedia.org/T336060 (10Papaul) 05Open→03Resolved @Andrew this is now resolve. Please let us know if you still have issues. Thanks [02:11:14] (03PS1) 10Samtar: Remove additional v1 suffix when computing internalRestbaseURL [extensions/Math] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928158 (https://phabricator.wikimedia.org/T334842) [02:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P49239 and previous config saved to /var/cache/conftool/dbconfig/20230608-021315-ladsgroup.json [02:15:21] jouncebot: nowandnext [02:15:21] No deployments scheduled for the next 3 hour(s) and 44 minute(s) [02:15:21] In 3 hour(s) and 44 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T0600) [02:15:21] In 3 hour(s) and 44 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T0600) [02:27:05] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T336886)', diff saved to https://phabricator.wikimedia.org/P49240 and previous config saved to /var/cache/conftool/dbconfig/20230608-022821-ladsgroup.json [02:28:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [02:28:26] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [02:28:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2137.codfw.wmnet with reason: Maintenance [02:28:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49241 and previous config saved to /var/cache/conftool/dbconfig/20230608-022842-ladsgroup.json [02:28:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/Math] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928158 (https://phabricator.wikimedia.org/T334842) (owner: 10Samtar) [02:31:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49242 and previous config saved to /var/cache/conftool/dbconfig/20230608-023343-ladsgroup.json [02:33:47] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:44:10] (03Merged) 10jenkins-bot: Remove additional v1 suffix when computing internalRestbaseURL [extensions/Math] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928158 (https://phabricator.wikimedia.org/T334842) (owner: 10Samtar) [02:44:49] !log samtar@deploy1002 Started scap: Backport for [[gerrit:928158|Remove additional v1 suffix when computing internalRestbaseURL (T334842 T338381)]] [02:44:54] T338381: Math extension cannot connect to Restbase - https://phabricator.wikimedia.org/T338381 [02:44:54] T334842: Replace usage of VirtualRESTServiceClient in Math extension - https://phabricator.wikimedia.org/T334842 [02:46:33] !log samtar@deploy1002 samtar: Backport for [[gerrit:928158|Remove additional v1 suffix when computing internalRestbaseURL (T334842 T338381)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [02:46:38] * TheresNoTime testing [02:48:38] * TheresNoTime syncing [02:48:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P49243 and previous config saved to /var/cache/conftool/dbconfig/20230608-024849-ladsgroup.json [02:54:40] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:928158|Remove additional v1 suffix when computing internalRestbaseURL (T334842 T338381)]] (duration: 09m 50s) [02:54:45] T338381: Math extension cannot connect to Restbase - https://phabricator.wikimedia.org/T338381 [02:54:46] T334842: Replace usage of VirtualRESTServiceClient in Math extension - https://phabricator.wikimedia.org/T334842 [03:03:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P49244 and previous config saved to /var/cache/conftool/dbconfig/20230608-030355-ladsgroup.json [03:19:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49245 and previous config saved to /var/cache/conftool/dbconfig/20230608-031901-ladsgroup.json [03:19:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [03:19:05] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [03:19:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [03:19:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49246 and previous config saved to /var/cache/conftool/dbconfig/20230608-031911-ladsgroup.json [03:22:20] (03PS2) 10KartikMistry: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928013 (https://phabricator.wikimedia.org/T337834) [03:24:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49247 and previous config saved to /var/cache/conftool/dbconfig/20230608-032416-ladsgroup.json [03:24:21] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [03:26:37] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:39:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P49248 and previous config saved to /var/cache/conftool/dbconfig/20230608-033922-ladsgroup.json [03:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:54:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P49249 and previous config saved to /var/cache/conftool/dbconfig/20230608-035428-ladsgroup.json [04:09:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T336886)', diff saved to https://phabricator.wikimedia.org/P49250 and previous config saved to /var/cache/conftool/dbconfig/20230608-040935-ladsgroup.json [04:09:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [04:09:39] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [04:10:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [04:12:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [04:13:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2147.codfw.wmnet with reason: Maintenance [04:13:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T336886)', diff saved to https://phabricator.wikimedia.org/P49251 and previous config saved to /var/cache/conftool/dbconfig/20230608-041311-ladsgroup.json [04:18:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T336886)', diff saved to https://phabricator.wikimedia.org/P49252 and previous config saved to /var/cache/conftool/dbconfig/20230608-041809-ladsgroup.json [04:18:13] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [04:33:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P49253 and previous config saved to /var/cache/conftool/dbconfig/20230608-043315-ladsgroup.json [04:48:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P49254 and previous config saved to /var/cache/conftool/dbconfig/20230608-044821-ladsgroup.json [04:58:35] (03PS2) 10KartikMistry: Enable Content and Section Translation for 9 Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924511 (https://phabricator.wikimedia.org/T337290) [05:03:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T336886)', diff saved to https://phabricator.wikimedia.org/P49255 and previous config saved to /var/cache/conftool/dbconfig/20230608-050328-ladsgroup.json [05:03:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [05:03:32] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [05:03:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2155.codfw.wmnet with reason: Maintenance [05:03:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [05:03:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [05:03:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T336886)', diff saved to https://phabricator.wikimedia.org/P49256 and previous config saved to /var/cache/conftool/dbconfig/20230608-050353-ladsgroup.json [05:08:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T336886)', diff saved to https://phabricator.wikimedia.org/P49257 and previous config saved to /var/cache/conftool/dbconfig/20230608-050852-ladsgroup.json [05:08:56] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [05:23:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P49258 and previous config saved to /var/cache/conftool/dbconfig/20230608-052358-ladsgroup.json [05:39:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P49259 and previous config saved to /var/cache/conftool/dbconfig/20230608-053904-ladsgroup.json [05:54:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T336886)', diff saved to https://phabricator.wikimedia.org/P49260 and previous config saved to /var/cache/conftool/dbconfig/20230608-055411-ladsgroup.json [05:54:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [05:54:15] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [05:54:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2172.codfw.wmnet with reason: Maintenance [05:54:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T336886)', diff saved to https://phabricator.wikimedia.org/P49261 and previous config saved to /var/cache/conftool/dbconfig/20230608-055432-ladsgroup.json [05:58:04] (03PS13) 10Elukey: profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 (https://phabricator.wikimedia.org/T337825) [05:58:06] (03PS9) 10Elukey: Move cp4037's varnishkafka instances to PKI [puppet] - 10https://gerrit.wikimedia.org/r/924509 (https://phabricator.wikimedia.org/T337825) [05:59:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T336886)', diff saved to https://phabricator.wikimedia.org/P49262 and previous config saved to /var/cache/conftool/dbconfig/20230608-055929-ladsgroup.json [05:59:33] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T0600) [06:00:04] kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T0600). [06:00:31] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41609/console" [puppet] - 10https://gerrit.wikimedia.org/r/924507 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [06:05:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41610/console" [puppet] - 10https://gerrit.wikimedia.org/r/924509 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [06:10:37] !log kill remaining processes for `andyrussg` on stat100x nodes to unblock puppet [06:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P49263 and previous config saved to /var/cache/conftool/dbconfig/20230608-061435-ladsgroup.json [06:18:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:19:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50134 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:29:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P49264 and previous config saved to /var/cache/conftool/dbconfig/20230608-062941-ladsgroup.json [06:34:45] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::cache::kafka: add support for PKI [puppet] - 10https://gerrit.wikimedia.org/r/924507 (https://phabricator.wikimedia.org/T337825) (owner: 10Elukey) [06:41:24] (03CR) 10Ayounsi: Add rule to allow TFTP to install server to support Juniper ZTP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/928121 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [06:42:44] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add new LVS host lvs2014 (codfw hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/928113 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [06:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T336886)', diff saved to https://phabricator.wikimedia.org/P49265 and previous config saved to /var/cache/conftool/dbconfig/20230608-064447-ladsgroup.json [06:44:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [06:44:52] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [06:45:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [06:45:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T336886)', diff saved to https://phabricator.wikimedia.org/P49266 and previous config saved to /var/cache/conftool/dbconfig/20230608-064508-ladsgroup.json [06:46:27] PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-journalnode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:29] PROBLEM - Hadoop JournalNode on analytics1069 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.qjournal.server.JournalNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Journalnode_process [06:50:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T336886)', diff saved to https://phabricator.wikimedia.org/P49267 and previous config saved to /var/cache/conftool/dbconfig/20230608-065006-ladsgroup.json [06:50:10] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [06:50:14] (03PS1) 10Stevemunene: Swap journal node analytics1069 with an-worker1142 [puppet] - 10https://gerrit.wikimedia.org/r/928349 (https://phabricator.wikimedia.org/T338336) [06:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:54:17] (03PS2) 10Cathal Mooney: Add rule to allow TFTP to install server to support Juniper ZTP [homer/public] - 10https://gerrit.wikimedia.org/r/928121 (https://phabricator.wikimedia.org/T336485) [07:00:04] Amir1, apergos, and jnuche: gettimeofday() says it's time for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T0700) [07:00:04] kart_ and Superpes: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:49] \o [07:00:52] morning! there are no trainees signed up today and only you, kart_ with a couple patches for the window. are you self-deploying today? [07:00:52] hmm [07:00:58] Hello :) [07:01:07] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks elukey" [puppet] - 10https://gerrit.wikimedia.org/r/927976 (https://phabricator.wikimedia.org/T337460) (owner: 10Elukey) [07:01:11] I see somene else suck in with 4 patches at the last minute *cough* Superpes *cough* :-) [07:01:13] apergos: yes. I can go ahead with self-deploy stuff [07:01:55] Lol :D [07:02:05] hmm there is no wat to guarantee the order in which files will arrive, [07:02:26] let me look more closely at one of those patches Superpes [07:02:31] (go ahead, kart_!) [07:03:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924511 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [07:03:20] apergos: for a patch adding new logos, that's no longer an issue since the php config change is only picked up at the restart stage which happens after the sync is done [07:03:41] those are the two I was concerned about, yes [07:03:55] (03Merged) 10jenkins-bot: Enable Content and Section Translation for 9 Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924511 (https://phabricator.wikimedia.org/T337290) (owner: 10KartikMistry) [07:04:22] ok then, as the patch owner that snuck them in second, you'll need to wait for kart_ to finish up. are you self-deploying today or would you like us to handle that for you? [07:04:30] !log kartik@deploy1002 Started scap: Backport for [[gerrit:924511|Enable Content and Section Translation for 9 Wikipedia (T337290)]] [07:04:34] T337290: Enable MinT, Content and Section Translation for 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337290 [07:05:07] Superpes: ^ [07:05:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P49268 and previous config saved to /var/cache/conftool/dbconfig/20230608-070512-ladsgroup.json [07:05:58] apergos I can't deploy myself :P [07:06:01] !log kartik@deploy1002 kartik: Backport for [[gerrit:924511|Enable Content and Section Translation for 9 Wikipedia (T337290)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:06:27] ok! you should sign up for a training eventually, learn how to do it, and get deployment rights :-) but that's later. [07:06:38] we'll be happy to do the deploy for you. [07:08:25] Uh, yep for sure, I'd gladly learn to deploy (when I have some time) :) Thanks! [07:09:27] sweet! more info on that is at https://wikitech.wikimedia.org/wiki/Deployments/Training [07:09:45] (03CR) 10Muehlenhoff: "Looks good, two small nits" [puppet] - 10https://gerrit.wikimedia.org/r/928052 (owner: 10Slyngshede) [07:10:30] Oh thanks! I love the badge lol :D [07:10:42] :-) [07:12:09] (03PS6) 10Slyngshede: C:idm:jobs enable Bitu timers [puppet] - 10https://gerrit.wikimedia.org/r/928052 [07:14:23] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:924511|Enable Content and Section Translation for 9 Wikipedia (T337290)]] (duration: 09m 52s) [07:14:27] T337290: Enable MinT, Content and Section Translation for 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337290 [07:14:45] !log delete pod kask-production-7dfdfc7cbc-2vw5q in wikikube codfw, since it was scheduled on a non dedicated node [07:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:57] (03CR) 10Cathal Mooney: Add rule to allow TFTP to install server to support Juniper ZTP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/928121 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [07:15:08] (03PS3) 10KartikMistry: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928013 (https://phabricator.wikimedia.org/T337834) [07:15:17] (03CR) 10Cathal Mooney: Add rule to allow TFTP to install server to support Juniper ZTP (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/928121 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [07:16:00] 10SRE, 10ops-codfw, 10Cloud-VPS, 10cloud-services-team: cloudbackup2001 lockup on 2023-05-05 - https://phabricator.wikimedia.org/T336060 (10taavi) [07:16:10] (03CR) 10Elukey: [V: 03+1 C: 03+2] analytics refinery: add a data purge job for webrequest_sampled_live [puppet] - 10https://gerrit.wikimedia.org/r/927976 (https://phabricator.wikimedia.org/T337460) (owner: 10Elukey) [07:16:12] (03CR) 10Slyngshede: C:idm:jobs enable Bitu timers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/928052 (owner: 10Slyngshede) [07:16:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928013 (https://phabricator.wikimedia.org/T337834) (owner: 10KartikMistry) [07:17:25] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for 10 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928013 (https://phabricator.wikimedia.org/T337834) (owner: 10KartikMistry) [07:17:55] !log kartik@deploy1002 Started scap: Backport for [[gerrit:928013|testwiki: Enable Section Translation for 10 Wikipedias (T337834)]] [07:17:58] T337834: Enable MinT, Content and Section Translation for a 3rd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337834 [07:19:28] !log kartik@deploy1002 kartik: Backport for [[gerrit:928013|testwiki: Enable Section Translation for 10 Wikipedias (T337834)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:20:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P49270 and previous config saved to /var/cache/conftool/dbconfig/20230608-072018-ladsgroup.json [07:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:23:38] oh huh, scap backport actually +2's the change these days? wow. maybe someone told me that recently and I forgot it. gotta try this! [07:26:56] apergos: I can confirm it is mostly automatic nowadays `scap backport ` does all the magic [07:26:58] yeah, although if you're +2'ing multiple patches at once it's a good idea to get them stacked in gerrit in advance [07:27:14] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:928013|testwiki: Enable Section Translation for 10 Wikipedias (T337834)]] (duration: 09m 19s) [07:27:17] T337834: Enable MinT, Content and Section Translation for a 3rd group of 10 languages previously lacking machine translation - https://phabricator.wikimedia.org/T337834 [07:27:27] apergos: I'm done :) [07:27:29] these are all config changes so they should be quick enough [07:27:29] though if change number is on master, I don't think it will cherry-pick to the wmf branch for ya [07:27:54] also good morning [07:28:27] morning! [07:28:47] apergos: each deploy still takes at least 10 minutes these days.. [07:29:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924166 (https://phabricator.wikimedia.org/T337641) (owner: 10Superpes15) [07:29:43] hm a merge conflict? [07:29:56] probably you need to press the 'rebase' button manually in gerrit [07:30:22] there are two cases: [07:30:36] Amir1: sorry, didn't see your question yesterday. cache warming is disable for wikidata and commons. [07:30:38] (03PS4) 10ArielGlenn: [kaawiki] Change the logo with an HD version and the tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/924166 (https://phabricator.wikimedia.org/T337641) (owner: 10Superpes15) [07:30:54] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [07:31:17] 1) it is recognized as a merge conflict by Gerrit: yes you need to rebase. For operations/mediawiki-config that happens anytime when one of the files touched by the change has been touched in the new head of the branch regardless of whether that would properly merge or not [07:31:27] it is a hint that you gotta review what got since merged in the branch [07:31:59] 2) Zuul/Jenkins-bot saying as a comment that it cannot merge the change, which is either due to the above or the CI infra being faulty which happens from time to time [07:32:27] (assuming most knows about the above, but it never hurts to repeat for those that do not know :) ) [07:32:44] the rebase was fine, now waiting for zuul [07:33:16] apparently you merged it yourself :] [07:33:20] I did [07:33:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/928052 (owner: 10Slyngshede) [07:33:28] But, just to learn, in this case which patchset was merged? [07:33:35] scripts get one chance to get it right, after that I go manual [07:33:54] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/924166 starting with your first patchset. [07:34:20] we'll be doing them in order unless you have some other preference (this one, regardless, will go first because, well, it's already in process now.) [07:35:14] Oh no, I mean, I read Patchset 4 - "Patch Set 3 was rebased" but then I read "TrainBranchBot - Patchset3 - Approved by ariel@deploy1002 using scap backport" [07:35:19] https://deploy-commands.toolforge.org/bacc/924166 what we were doing, in theory, until a rebase was required. [07:35:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T336886)', diff saved to https://phabricator.wikimedia.org/P49271 and previous config saved to /var/cache/conftool/dbconfig/20230608-073524-ladsgroup.json [07:35:29] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [07:35:37] we'll continue with scap backport after zuul is done. [07:36:09] what scap backport does under the hood: https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Merging_and_applying_patches [07:36:37] unrelated: can someone point me to the place where I can the the effective config for each wiki? [07:36:43] Oh thanks will read everything :) [07:36:47] what are you expecting zuul to do since you merged the patch by hand for whatever reason? [07:36:59] I didn't merge it by hand. sorry, I +2 it by hand [07:37:00] duesen: https://noc.wikimedia.org/wiki.php [07:37:04] or [07:37:06] ugh [07:37:08] you did merge it by hand [07:37:20] it said 'ready to submit', sorry. and I did that. [07:38:03] (03PS2) 10Jelto: admin: add all miscweb domains as extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) [07:38:14] well crap [07:38:44] no worries, just run `scap backport` again (if it's stuck somehow) and it'll deal with it just fine [07:39:17] oh I know, I was however intending for the normal gate-and-submit job to complete and somehow thought it had [07:40:00] !log ariel@deploy1002 Started scap: Backport for [[gerrit:924166|[kaawiki] Change the logo with an HD version and the tagline (T337641)]] [07:40:03] T337641: Change logo of Karakalpak Wikipedia (kaa.wikipedia.org) - https://phabricator.wikimedia.org/T337641 [07:40:23] build-and-push-container-images [07:41:12] sync-testservers-k8s [07:41:44] !log ariel@deploy1002 ariel and superpes: Backport for [[gerrit:924166|[kaawiki] Change the logo with an HD version and the tagline (T337641)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [07:42:04] Superpes: your first change is now live on mwdebug1002; please check that it does what it should [07:42:17] Testing :) [07:43:09] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [07:43:25] It's fine! Thanks apergos :) [07:43:36] ok, continuing! [07:43:56] sync-canaries-k8s [07:45:32] sync-prod-k8s [07:45:55] that's done, now sync-proxies, sync-apaches [07:46:37] php fpm restart... [07:47:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:47:50] (03CR) 10Jelto: "thanks for the review, answered in-line. Adding serviceops folks here to get them onboard as well for adding additional miscweb services t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/927998 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [07:47:52] (03CR) 10Ayounsi: [C: 03+1] Add rule to allow TFTP to install server to support Juniper ZTP [homer/public] - 10https://gerrit.wikimedia.org/r/928121 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [07:49:09] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:924166|[kaawiki] Change the logo with an HD version and the tagline (T337641)]] (duration: 09m 09s) [07:49:14] T337641: Change logo of Karakalpak Wikipedia (kaa.wikipedia.org) - https://phabricator.wikimedia.org/T337641 [07:50:18] Superpes: your first change is live in production; please test there [07:52:17] Uhm I still see the old logo! Maybe need some sort of "purge" apergos [07:52:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:53:07] I also tried in incognito mode [07:53:12] are you testing as a logged in user? [07:53:43] I tried both [07:53:53] And also with another browser [07:54:04] and I take it you have flushed your browser cache and all that [07:54:31] Yep [07:54:35] I am at kaa.wikipedia.org and in the upper left I see a puzzle globe, right of that the Wikipedia text with large W and A, under that [07:54:51] Erkin ensiklopediya [07:54:56] is that what I should be seeing? [07:55:07] It should be "enciklopediya" [07:55:31] nope [07:56:11] However, it usually takes some time - up to a week - to update the logo, but I know there are some commands to "clear the cache" and show the new image right away! taavi ? [07:56:25] https://kaa.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-kaa.svg [07:56:40] this is apparently what I am looking at. not sure how that gets populated. [07:57:08] when updating files in static/, you need to use purgeList.php to purge the edge caches for those files [07:57:30] https://wikitech.wikimedia.org/wiki/Kafka_HTTP_purging#One-off_purge [07:57:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) @JAllemandou, I was wondering if you had a time-frame for this or need anything from SRE. [07:58:50] I can see it well on my browser/edge after forcing a reload, so probably just caching [07:59:15] thank you taavi ! [07:59:23] (03CR) 10Muehlenhoff: [C: 03+2] Allow setting a separate LDAP DN for Bitu LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/928059 (https://phabricator.wikimedia.org/T338008) (owner: 10Muehlenhoff) [07:59:45] Superpes: I'll do the purge now if you haven't already [08:00:08] Now I see it! Nope I can't lol :P [08:00:22] um you see it? [08:00:24] ok well then [08:00:27] moving along :-D [08:00:31] Btw imho you can merge the other 3 patches together! :) [08:00:53] I don't want to deploy them together [08:01:15] we can run over the window, there's nothing right after us for dpeloyments on the calendar [08:01:31] (03PS3) 10ArielGlenn: [itwiktionary] Add a tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926036 (https://phabricator.wikimedia.org/T337688) (owner: 10Superpes15) [08:01:48] Oh perfect, so no problem, thanks :) [08:01:49] doing rebase now befoer we feed it to scap backport [08:02:03] 10SRE, 10Infrastructure-Foundations, 10netops: Peering: prefer primary IXP for direcly connected networks - https://phabricator.wikimedia.org/T338201 (10cmooney) Thanks for this one. While not ideal I think probably option 2 / adding DIRECT_PEER_PRIMARY is gonna be best. Is getting a little complex, but at... [08:02:35] (03CR) 10Slyngshede: [C: 03+2] C:idm:jobs enable Bitu timers [puppet] - 10https://gerrit.wikimedia.org/r/928052 (owner: 10Slyngshede) [08:03:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926036 (https://phabricator.wikimedia.org/T337688) (owner: 10Superpes15) [08:03:47] letting scap backport do the merge [08:04:11] (03Merged) 10jenkins-bot: [itwiktionary] Add a tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926036 (https://phabricator.wikimedia.org/T337688) (owner: 10Superpes15) [08:04:36] !log ariel@deploy1002 Started scap: Backport for [[gerrit:926036|[itwiktionary] Add a tagline (T337688)]] [08:04:39] T337688: Adding a tagline to itwiktionary - https://phabricator.wikimedia.org/T337688 [08:05:18] doing k8s stuffs... [08:06:10] !log ariel@deploy1002 ariel and superpes: Backport for [[gerrit:926036|[itwiktionary] Add a tagline (T337688)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:06:43] (03CR) 10Elukey: "Looks good Steve! Just to be sure, did you follow all the steps in https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Ha" [puppet] - 10https://gerrit.wikimedia.org/r/928349 (https://phabricator.wikimedia.org/T338336) (owner: 10Stevemunene) [08:07:09] Superpes: please test your second patch on mwdebug1002 [08:07:10] Tested! Looks good apergos :) [08:07:18] heh you're way ahead of me :-) [08:07:31] k8s sync... [08:08:01] Lol I was ready to test :P [08:08:11] Doesn't seem too much complicated [08:09:24] syncing to production now... [08:09:33] Surely it takes some practice to get the hang of it (but it seems that the manuals explain everything) [08:10:24] the wiki docs are very thorough [08:10:35] (03PS1) 10Muehlenhoff: Remove default entries for profile::idm::ldap_dn_password [puppet] - 10https://gerrit.wikimedia.org/r/928354 [08:10:47] php fpm restart... [08:12:43] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:926036|[itwiktionary] Add a tagline (T337688)]] (duration: 08m 07s) [08:12:46] T337688: Adding a tagline to itwiktionary - https://phabricator.wikimedia.org/T337688 [08:12:55] live in production, please test there [08:13:08] I didn't purge anything, same as last time [08:13:14] so do whatever you did to see it :-D [08:13:33] Perfect :P I can see it! Thanks [08:13:39] (03PS2) 10ArielGlenn: [fiwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926628 (https://phabricator.wikimedia.org/T337412) (owner: 10Superpes15) [08:13:56] awesome! [08:14:03] rebase of next patch is in progress [08:15:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926628 (https://phabricator.wikimedia.org/T337412) (owner: 10Superpes15) [08:15:52] (03Merged) 10jenkins-bot: [fiwiki] Limitate the use of the ContentTranslation tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926628 (https://phabricator.wikimedia.org/T337412) (owner: 10Superpes15) [08:16:22] !log ariel@deploy1002 Started scap: Backport for [[gerrit:926628|[fiwiki] Limitate the use of the ContentTranslation tool (T337412)]] [08:16:25] T337412: Restrict Content Translation tool to the autoreview group in fiwiki - https://phabricator.wikimedia.org/T337412 [08:17:10] (03PS2) 10ArielGlenn: [ruwiki] Add an editautoreviewprotected level protecion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926632 (https://phabricator.wikimedia.org/T337430) (owner: 10Superpes15) [08:17:50] !log ariel@deploy1002 superpes and ariel: Backport for [[gerrit:926628|[fiwiki] Limitate the use of the ContentTranslation tool (T337412)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:17:58] please test on mwdebug1002. [08:19:43] It works :) apergos [08:19:49] great! coontinuing [08:21:46] (03PS1) 10Muehlenhoff: Add stub secrets for profile::idm::ldap_dn_password [labs/private] - 10https://gerrit.wikimedia.org/r/928356 [08:22:35] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add stub secrets for profile::idm::ldap_dn_password [labs/private] - 10https://gerrit.wikimedia.org/r/928356 (owner: 10Muehlenhoff) [08:24:21] (03Abandoned) 10Slyngshede: C:idm switch to read/write user for LDAP access. [puppet] - 10https://gerrit.wikimedia.org/r/927752 (owner: 10Slyngshede) [08:25:38] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:926628|[fiwiki] Limitate the use of the ContentTranslation tool (T337412)]] (duration: 09m 16s) [08:25:41] T337412: Restrict Content Translation tool to the autoreview group in fiwiki - https://phabricator.wikimedia.org/T337412 [08:26:15] live-in-production-please-test [08:26:57] (03CR) 10Muehlenhoff: [C: 03+2] Remove default entries for profile::idm::ldap_dn_password [puppet] - 10https://gerrit.wikimedia.org/r/928354 (owner: 10Muehlenhoff) [08:26:57] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [08:27:52] Everything is fine apergos :P [08:28:09] ok! moving on then [08:28:25] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [08:29:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ariel@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926632 (https://phabricator.wikimedia.org/T337430) (owner: 10Superpes15) [08:29:19] I was checking who touched the Cxserver ;) [08:29:50] (03Merged) 10jenkins-bot: [ruwiki] Add an editautoreviewprotected level protecion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926632 (https://phabricator.wikimedia.org/T337430) (owner: 10Superpes15) [08:29:54] and who was it then? ;-) [08:30:15] !log ariel@deploy1002 Started scap: Backport for [[gerrit:926632|[ruwiki] Add an editautoreviewprotected level protecion (T337430)]] [08:30:21] T337430: Create autoreview-based protection level in Russian Wikipedia - https://phabricator.wikimedia.org/T337430 [08:31:51] !log ariel@deploy1002 ariel and superpes: Backport for [[gerrit:926632|[ruwiki] Add an editautoreviewprotected level protecion (T337430)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:32:40] please test on mwdebug1002 [08:32:55] It's fine apergos [08:33:04] okey dokey [08:34:09] (03PS2) 10Vgutierrez: fifo_log_demux: Fix systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) [08:35:01] (03CR) 10Vgutierrez: fifo_log_demux: Fix systemd unit file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927989 (https://phabricator.wikimedia.org/T284555) (owner: 10Vgutierrez) [08:37:32] (03PS1) 10Muehlenhoff: Update access date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/928358 [08:38:01] php fpm restart... [08:38:40] !log ariel@deploy1002 Finished scap: Backport for [[gerrit:926632|[ruwiki] Add an editautoreviewprotected level protecion (T337430)]] (duration: 08m 25s) [08:38:44] T337430: Create autoreview-based protection level in Russian Wikipedia - https://phabricator.wikimedia.org/T337430 [08:38:55] ive in production, please test there [08:39:24] Yep it works well! Many many thanks for your help and, above all, for your time apergos :3 [08:39:56] that's what we're here for. thanks for choosing us for your deployment needs :-D [08:40:05] see you next time and don't forget to sign up for a training! [08:40:09] Lol [08:40:17] Yep I'll do it! Thanks again :D [08:40:50] I'll wait a few minutes to make sure nothing unexpected happens (I don't suppose so but still) before declaring the window closed. [08:46:43] (03CR) 10Muehlenhoff: [C: 03+2] Update access date for aarora [puppet] - 10https://gerrit.wikimedia.org/r/928358 (owner: 10Muehlenhoff) [08:52:38] (03PS2) 10Majavah: hieradata: add cache_hosts for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/831044 [08:52:40] (03PS2) 10Majavah: P:mariadb::cloudinfra: add web proxy database/grants [puppet] - 10https://gerrit.wikimedia.org/r/831045 (https://phabricator.wikimedia.org/T316982) [08:52:42] (03PS2) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [08:52:44] (03PS1) 10Majavah: dynamicproxy: remove proxygetter [puppet] - 10https://gerrit.wikimedia.org/r/928457 [08:52:46] (03PS1) 10Majavah: dynamicproxy: move api files to api/ folder [puppet] - 10https://gerrit.wikimedia.org/r/928458 [08:52:48] (03PS1) 10Majavah: dynamicproxy: use a mariadb backend [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) [08:54:29] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) [08:54:33] !log UTC morning backport and config training window done [08:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:36] (03CR) 10Filippo Giunchedi: [C: 03+1] udp2log: add 6to4 relay [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [08:57:39] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) [08:57:49] (03CR) 10CI reject: [V: 04-1] dynamicproxy: use a mariadb backend [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [08:57:54] (03PS1) 10Jbond: network: add basic role for to test sonic devices [puppet] - 10https://gerrit.wikimedia.org/r/928460 [08:58:46] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) [09:01:36] (03Abandoned) 10Vgutierrez: varnish: Limit ESI depth to 1 [puppet] - 10https://gerrit.wikimedia.org/r/891586 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [09:09:23] (03CR) 10Stevemunene: Swap journal node analytics1069 with an-worker1142 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928349 (https://phabricator.wikimedia.org/T338336) (owner: 10Stevemunene) [09:09:38] (03PS2) 10Majavah: dynamicproxy: use a mariadb backend [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) [09:09:40] (03PS3) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [09:09:42] (03PS1) 10Majavah: mariadb::config::client: allow configuring default database [puppet] - 10https://gerrit.wikimedia.org/r/928461 [09:10:36] !log fetch HAProxy 2.7.9 for thirdparty/haproxy27 bullseye (apt.wm.o) [09:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:36] (03PS4) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [09:13:25] (03CR) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [09:15:31] (03CR) 10Elukey: [C: 04-1] "Precautionary -1 since something doesn't look right" [puppet] - 10https://gerrit.wikimedia.org/r/928349 (https://phabricator.wikimedia.org/T338336) (owner: 10Stevemunene) [09:16:41] (03PS3) 10Majavah: hieradata: add cache_hosts for codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/831044 [09:16:43] (03PS3) 10Majavah: P:mariadb::cloudinfra: add web proxy database/grants [puppet] - 10https://gerrit.wikimedia.org/r/831045 (https://phabricator.wikimedia.org/T316982) [09:16:45] (03PS2) 10Majavah: dynamicproxy: remove proxygetter [puppet] - 10https://gerrit.wikimedia.org/r/928457 [09:16:47] (03PS2) 10Majavah: dynamicproxy: move api files to api/ folder [puppet] - 10https://gerrit.wikimedia.org/r/928458 [09:16:49] (03PS2) 10Majavah: mariadb::config::client: allow configuring default database [puppet] - 10https://gerrit.wikimedia.org/r/928461 [09:16:51] (03PS3) 10Majavah: dynamicproxy: use a mariadb backend [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) [09:16:53] (03PS5) 10Majavah: P:wmcs::novaproxy: enable keepalived for HA [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) [09:17:36] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5032.eqsin.wmnet,cp4052.ulsfo.wmnet} and A:cp [09:18:58] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: sync [09:19:05] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: sync [09:22:21] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5032.eqsin.wmnet,cp4052.ulsfo.wmnet} and A:cp [09:24:26] !log updated to HAProxy 2.7.9 on cp4052 and cp5032 [09:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [09:40:02] (03PS1) 10Muehlenhoff: Fully manage /etc/nftables/ in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/928463 [09:40:19] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jbond@cumin2002" [09:40:20] !log jbond@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2001.codfw.wmnet with OS bookworm [09:40:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host puppetserver2001.codfw.wmnet with OS bookworm completed: - puppetserver2001... [09:43:19] 10SRE, 10Infrastructure-Foundations, 10Znuny, 10serviceops-collab: VRTS eqiad replacement - https://phabricator.wikimedia.org/T338418 (10Arnoldokoth) [09:43:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] ferm: Add types and docs to ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/923396 (owner: 10Jbond) [09:45:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou) Thank you for reminding me, I had forgotten about this task @ayounsi. We can prioritize the work, the details we'll need are: - precise names... [09:49:06] (03PS1) 10Gehel: analytics: add analytics19[59-60] to excluded_hosts [puppet] - 10https://gerrit.wikimedia.org/r/928465 (https://phabricator.wikimedia.org/T317861) [09:49:10] (03PS1) 10Gehel: analytics: remove analytics10[58-60] from net_topology [puppet] - 10https://gerrit.wikimedia.org/r/928466 (https://phabricator.wikimedia.org/T317861) [09:49:12] (03PS1) 10Gehel: analytics: exclude analytics106[123] [puppet] - 10https://gerrit.wikimedia.org/r/928467 [09:49:14] (03PS1) 10Gehel: analytics: exclude analytics106[123] from net_topology [puppet] - 10https://gerrit.wikimedia.org/r/928468 [09:56:49] (03PS2) 10Gehel: analytics: add analytics19[59-60] to excluded_hosts [puppet] - 10https://gerrit.wikimedia.org/r/928465 (https://phabricator.wikimedia.org/T317861) [09:56:51] (03PS2) 10Gehel: analytics: remove analytics10[58-60] from net_topology [puppet] - 10https://gerrit.wikimedia.org/r/928466 (https://phabricator.wikimedia.org/T317861) [09:56:53] (03PS2) 10Gehel: analytics: exclude analytics106[123] [puppet] - 10https://gerrit.wikimedia.org/r/928467 [09:56:55] (03PS2) 10Gehel: analytics: exclude analytics106[123] from net_topology [puppet] - 10https://gerrit.wikimedia.org/r/928468 [09:57:55] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@bb7526e]: (no justification provided) [09:58:04] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@bb7526e]: (no justification provided) (duration: 00m 08s) [09:58:12] (03CR) 10Vgutierrez: "looking good, see inline comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [10:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1000) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1000) [10:00:49] (03PS1) 10Jbond: puppetdb: add puppetserver hosts to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/928471 [10:00:51] (03PS1) 10Jbond: puppetserver: Add puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/928472 [10:01:30] (03CR) 10CI reject: [V: 04-1] puppetserver: Add puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/928472 (owner: 10Jbond) [10:05:12] (03PS2) 10Jbond: puppetdb: add puppetserver hosts to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/928471 [10:05:55] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:27] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:06:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41615/console" [puppet] - 10https://gerrit.wikimedia.org/r/928471 (owner: 10Jbond) [10:07:04] (03PS2) 10Jbond: puppetserver: Add puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/928472 [10:07:14] (03PS5) 10EoghanGaffney: doc: Switch sync between nodes to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) [10:07:17] (03PS1) 10Effie Mouzeli: ipoid: add records [dns] - 10https://gerrit.wikimedia.org/r/928473 (https://phabricator.wikimedia.org/T325147) [10:07:31] (03CR) 10CI reject: [V: 04-1] puppetserver: Add puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/928472 (owner: 10Jbond) [10:08:20] (03Abandoned) 10Gehel: analytics: exclude analytics106[123] from net_topology [puppet] - 10https://gerrit.wikimedia.org/r/928468 (owner: 10Gehel) [10:08:28] (03Abandoned) 10Gehel: analytics: remove analytics10[58-60] from net_topology [puppet] - 10https://gerrit.wikimedia.org/r/928466 (https://phabricator.wikimedia.org/T317861) (owner: 10Gehel) [10:08:30] (03PS3) 10Jbond: puppetserver: Add puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/928472 [10:08:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: add puppetserver hosts to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/928471 (owner: 10Jbond) [10:08:35] (03Abandoned) 10Gehel: analytics: add analytics19[59-60] to excluded_hosts [puppet] - 10https://gerrit.wikimedia.org/r/928465 (https://phabricator.wikimedia.org/T317861) (owner: 10Gehel) [10:08:40] (03Abandoned) 10Gehel: analytics: exclude analytics106[123] [puppet] - 10https://gerrit.wikimedia.org/r/928467 (owner: 10Gehel) [10:08:57] (03CR) 10CI reject: [V: 04-1] puppetserver: Add puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/928472 (owner: 10Jbond) [10:10:25] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:10:54] (03PS1) 10Clément Goubert: mediawiki: Continue deployment downtime tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/928475 (https://phabricator.wikimedia.org/T331609) [10:12:11] sigh... seems like the beginning of an outage for wdqs@eqiad [10:12:35] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.097 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:12:46] (03PS1) 10Majavah: add fake novaproxy passwords [labs/private] - 10https://gerrit.wikimedia.org/r/928477 [10:13:19] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41616/console" [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [10:13:31] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: add records [dns] - 10https://gerrit.wikimedia.org/r/928473 (https://phabricator.wikimedia.org/T325147) (owner: 10Effie Mouzeli) [10:13:59] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:14:59] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.287 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:21:45] !log jiji@cumin1001 START - Cookbook sre.dns.netbox [10:22:18] (03PS1) 10Stevemunene: analytics: Decommission analytics10[59-60] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/928478 (https://phabricator.wikimedia.org/T338408) [10:22:21] (03PS1) 10Stevemunene: analytics: Remove analytics58_60 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/928479 (https://phabricator.wikimedia.org/T338408) [10:22:42] (03CR) 10CI reject: [V: 04-1] analytics: Decommission analytics10[59-60] from hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/928478 (https://phabricator.wikimedia.org/T338408) (owner: 10Stevemunene) [10:22:45] (03CR) 10Hashar: "PCC https://puppet-compiler.wmflabs.org/output/927975/1911/contint2002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [10:22:50] (03CR) 10CI reject: [V: 04-1] analytics: Remove analytics58_60 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/928479 (https://phabricator.wikimedia.org/T338408) (owner: 10Stevemunene) [10:22:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:27:05] (03CR) 10EoghanGaffney: [V: 03+1] doc: Switch sync between nodes to rsync::quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/925969 (https://phabricator.wikimedia.org/T333945) (owner: 10EoghanGaffney) [10:35:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/928463 (owner: 10Muehlenhoff) [10:37:29] (03CR) 10Cathal Mooney: [C: 03+2] Add rule to allow TFTP to install server to support Juniper ZTP [homer/public] - 10https://gerrit.wikimedia.org/r/928121 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:37:56] (03PS4) 10Jbond: puppetserver: Add puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/928472 [10:38:11] (03Merged) 10jenkins-bot: Add rule to allow TFTP to install server to support Juniper ZTP [homer/public] - 10https://gerrit.wikimedia.org/r/928121 (https://phabricator.wikimedia.org/T336485) (owner: 10Cathal Mooney) [10:38:20] (03PS1) 10Slyngshede: Blocklists: Check that in imported regex is valid. [software/bitu] - 10https://gerrit.wikimedia.org/r/928482 [10:39:19] (03CR) 10Jbond: [C: 03+2] puppetserver: Add puppetserver role [puppet] - 10https://gerrit.wikimedia.org/r/928472 (owner: 10Jbond) [10:49:12] (03PS1) 10Jbond: puppetserver: add hiera config [puppet] - 10https://gerrit.wikimedia.org/r/928485 (https://phabricator.wikimedia.org/T330490) [10:49:25] !log mwscript maintenance/storage/moveToExternal.php --wiki=svwiki --iconv DB cluster27 (T128153) [10:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:29] T128153: Migrate all old DB rows from windows-1252 to UTF-8 on svwiki - https://phabricator.wikimedia.org/T128153 [10:49:34] (03CR) 10CI reject: [V: 04-1] puppetserver: add hiera config [puppet] - 10https://gerrit.wikimedia.org/r/928485 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [10:51:12] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/928482 (owner: 10Slyngshede) [10:53:00] (03PS2) 10Jbond: puppetserver: add hiera config [puppet] - 10https://gerrit.wikimedia.org/r/928485 (https://phabricator.wikimedia.org/T330490) [10:53:27] (03PS1) 10Effie Mouzeli: service.yaml: add iPoid to the service catalogue [puppet] - 10https://gerrit.wikimedia.org/r/928487 (https://phabricator.wikimedia.org/T325147) [10:53:30] (03CR) 10Jbond: [C: 03+2] puppetserver: add hiera config [puppet] - 10https://gerrit.wikimedia.org/r/928485 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:03:04] !log mwscript maintenance/storage/moveToExternal.php --wiki=dawiki --iconv DB cluster27 (T128153) [11:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:08] T128153: Migrate all old DB rows from windows-1252 to UTF-8 on svwiki - https://phabricator.wikimedia.org/T128153 [11:03:19] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:59] (03PS1) 10Jbond: puppetserver: add hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/928488 [11:05:50] (03CR) 10Cathal Mooney: cloudservices: codfw1dev: enable cloud-private subnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/919352 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:06:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:12:13] (03PS2) 10Clément Goubert: mediawiki: Continue deployment downtime tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/928475 (https://phabricator.wikimedia.org/T331609) [11:13:27] (03CR) 10CI reject: [V: 04-1] mediawiki: Continue deployment downtime tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/928475 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [11:17:52] (03PS3) 10Clément Goubert: mediawiki: Continue deployment downtime tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/928475 (https://phabricator.wikimedia.org/T331609) [11:18:47] (03PS2) 10Jbond: puppetserver: add hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/928488 [11:19:34] (03CR) 10Jbond: [C: 03+2] puppetserver: add hiera defaults [puppet] - 10https://gerrit.wikimedia.org/r/928488 (owner: 10Jbond) [11:25:04] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) I just had a conversation with @cmooney about this, with result being: * we will move the setup t... [11:26:24] (03PS1) 10Jbond: puppetserver: ensurepuppetserver is loaded before g10k [puppet] - 10https://gerrit.wikimedia.org/r/928503 [11:27:08] (03PS2) 10Jbond: puppetserver: ensure puppetserver is loaded before g10k [puppet] - 10https://gerrit.wikimedia.org/r/928503 [11:27:12] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Blocklists: Check that in imported regex is valid. [software/bitu] - 10https://gerrit.wikimedia.org/r/928482 (owner: 10Slyngshede) [11:28:01] (03PS1) 10Superpes15: [knwiki] Add a temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928504 (https://phabricator.wikimedia.org/T338136) [11:28:30] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) @aborrero that makes sense. For the auth dns service we need to patch the dns repo to update the... [11:28:36] !log mwscript maintenance/storage/moveToExternal.php --wiki=nlwiki --iconv DB cluster26 (T128154) [11:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:40] T128154: Migrate all old DB rows from windows-1252 to UTF-8 on nlwiki - https://phabricator.wikimedia.org/T128154 [11:31:35] (03CR) 10Volans: "Left some comment, no blockers." [cookbooks] - 10https://gerrit.wikimedia.org/r/921410 (https://phabricator.wikimedia.org/T335531) (owner: 10BCornwall) [11:32:54] (03PS1) 10Btullis: Add a workaround for a kerberos issue with Hive/HDFS on Presto 0.281 [puppet] - 10https://gerrit.wikimedia.org/r/928506 (https://phabricator.wikimedia.org/T337335) [11:34:04] (03CR) 10Btullis: [C: 03+2] Add a workaround for a kerberos issue with Hive/HDFS on Presto 0.281 [puppet] - 10https://gerrit.wikimedia.org/r/928506 (https://phabricator.wikimedia.org/T337335) (owner: 10Btullis) [11:36:21] (03PS2) 10Superpes15: [knwiki] Add a temporary logo for the 20th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928504 (https://phabricator.wikimedia.org/T338136) [11:39:03] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) Created https://netbox.wikimedia.org/ipam/ip-addresses/13309/ [11:40:36] !log depooling cp4052 for some HAProxy tests - T317799 [11:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:39] T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 [11:46:28] (03PS1) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: refresh VIPs [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) [11:47:17] (03PS1) 10Btullis: presto: add the workaround for kerberos problem to worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/928509 (https://phabricator.wikimedia.org/T337335) [11:48:11] (03CR) 10Btullis: [C: 03+2] presto: add the workaround for kerberos problem to worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/928509 (https://phabricator.wikimedia.org/T337335) (owner: 10Btullis) [11:50:19] (03PS1) 10Slyngshede: Blocklist: Disable title blocklist. [software/bitu] - 10https://gerrit.wikimedia.org/r/928510 [11:51:03] !log repooling cp4052 - T317799 [11:51:03] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) [11:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:06] T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 [11:53:38] (03CR) 10Anzx: [C: 03+1] "thanks for patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928504 (https://phabricator.wikimedia.org/T338136) (owner: 10Superpes15) [11:53:43] (03PS3) 10Jbond: puppetserver::g10k: Call g10k from the puppetserver class [puppet] - 10https://gerrit.wikimedia.org/r/928503 (https://phabricator.wikimedia.org/T330490) [11:54:27] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) [11:58:07] (03CR) 10Jbond: [C: 03+2] puppetserver::g10k: Call g10k from the puppetserver class [puppet] - 10https://gerrit.wikimedia.org/r/928503 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [11:58:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/928510 (owner: 10Slyngshede) [12:01:18] 10SRE, 10ops-eqiad, 10Cloud-Services: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10dcaro) @Jclark-ctr cloudvirtlocal1001 is ready to be replugged (it's up, let me know if you need it down) [12:03:43] !log restore cp4052 HAProxy configuration - T317799 [12:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:46] T317799: Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 [12:03:55] (03PS1) 10Jbond: puppetserver: move hiera file to correct location [puppet] - 10https://gerrit.wikimedia.org/r/928514 [12:04:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetserver: move hiera file to correct location [puppet] - 10https://gerrit.wikimedia.org/r/928514 (owner: 10Jbond) [12:07:47] 10SRE, 10ops-eqiad, 10Cloud-Services: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10dcaro) [12:07:53] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Blocklist: Disable title blocklist. [software/bitu] - 10https://gerrit.wikimedia.org/r/928510 (owner: 10Slyngshede) [12:08:14] 10SRE, 10ops-eqiad, 10Cloud-Services: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10dcaro) [12:10:36] (03PS2) 10Stevemunene: analytics: Remove analytics58_60 from the HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/928479 (https://phabricator.wikimedia.org/T338408) [12:11:12] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:12:24] (03PS1) 10Jbond: puppetserver: support puppet agent5 [puppet] - 10https://gerrit.wikimedia.org/r/928515 (https://phabricator.wikimedia.org/T330490) [12:12:25] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:12:52] (03CR) 10Jbond: [C: 03+2] puppetserver: support puppet agent5 [puppet] - 10https://gerrit.wikimedia.org/r/928515 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [12:14:57] (03PS1) 10Jbond: Revert "puppetserver: support puppet agent5" [puppet] - 10https://gerrit.wikimedia.org/r/928162 [12:15:27] (03CR) 10Jbond: [C: 03+2] Revert "puppetserver: support puppet agent5" [puppet] - 10https://gerrit.wikimedia.org/r/928162 (owner: 10Jbond) [12:17:12] (03PS1) 10Ladsgroup: Remove svwiktionary from legacy encoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928516 (https://phabricator.wikimedia.org/T128156) [12:18:38] !log cmooney@deploy1002 Locking from deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 [12:18:42] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [12:19:34] !log De-pooling lvs1017 to move link to lsw1-e1-eqiad to ssw1-e1-eqiad T322937 [12:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:11] PROBLEM - pybal on lvs1017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:23:25] ^^^ this is the maintenance I'm doing [12:24:07] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [12:24:41] 10SRE, 10ops-eqiad, 10Cloud-Services: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10dcaro) [12:24:53] 10SRE, 10ops-eqiad, 10Cloud-Services: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10dcaro) [12:25:02] (03CR) 10Muehlenhoff: "The added help text is very helpful to understand the purpose, two typos inline" [puppet] - 10https://gerrit.wikimedia.org/r/927803 (owner: 10Volans) [12:25:23] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10taavi) [12:25:31] PROBLEM - PyBal connections to etcd on lvs1017 is CRITICAL: CRITICAL: 0 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [12:29:02] (03PS1) 10Jbond: puppetserver::g10k: force dir removal [puppet] - 10https://gerrit.wikimedia.org/r/928524 [12:29:45] (03PS1) 10Snwachukwu: Use refinery v0.2.16 in refine jobs. [puppet] - 10https://gerrit.wikimedia.org/r/928525 [12:30:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41617/console" [puppet] - 10https://gerrit.wikimedia.org/r/928524 (owner: 10Jbond) [12:33:16] (03PS1) 10Ottomata: refine - Exclude 3 deleted old EventLogging schemas from being refined [puppet] - 10https://gerrit.wikimedia.org/r/928526 [12:33:24] (03CR) 10Majavah: "These are all covered by site-specific hiera atm." [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:33:39] (03CR) 10CI reject: [V: 04-1] refine - Exclude 3 deleted old EventLogging schemas from being refined [puppet] - 10https://gerrit.wikimedia.org/r/928526 (owner: 10Ottomata) [12:34:25] (03PS2) 10Ottomata: refine - Exclude 3 deleted old EventLogging schemas from being refined [puppet] - 10https://gerrit.wikimedia.org/r/928526 [12:34:33] PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: puppetserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:34:39] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/927803 (owner: 10Volans) [12:34:44] (03CR) 10Jbond: [C: 03+1] cookbooks: improve test-cookbook binary [puppet] - 10https://gerrit.wikimedia.org/r/927803 (owner: 10Volans) [12:34:47] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Move cloudcephosd1021 to cloudsw1-c8-eqiad - https://phabricator.wikimedia.org/T334641 (10taavi) [12:35:35] RECOVERY - PyBal connections to etcd on lvs1017 is OK: OK: 12 connections established with conf1007.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [12:36:00] !log cmooney@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: LVS maintenance in eqiad, blocking deploys T322937 (duration: 17m 22s) [12:36:04] T322937: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 [12:40:49] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10cmooney) [12:40:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) 05Open→03Resolved All links have now been migrated. Massive thanks to @Jclark-ctr for all the work on site! [12:41:49] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) All links have now been successfully migrated. All row E/F connectivity is now flowing via Spine switches ssw1-e1-eqiad... [12:42:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) 05Open→03Resolved [12:45:07] (03PS1) 10Ayounsi: Ignore .vscode and support Python 3.11 in Tox [software/homer] - 10https://gerrit.wikimedia.org/r/928528 [12:45:10] (03PS4) 10Jbond: wmcs: add wmcs-roots to wmcs data Persistence roles [puppet] - 10https://gerrit.wikimedia.org/r/923681 [12:45:13] (03PS1) 10Jbond: profile::admin::groups: set to merge unique [puppet] - 10https://gerrit.wikimedia.org/r/928529 [12:45:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) @Papaul I've merged that changed and pushed to the mr routers, so hopefully if you try again the ZTP cookbook will work. [12:46:11] (03CR) 10Jbond: wmcs: add wmcs-roots to wmcs data Persistence roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:46:26] (03CR) 10Majavah: wmcs: add wmcs-roots to wmcs data Persistence roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:47:04] (03CR) 10Jbond: "this may not be desirable as it makes it slightly harder to see who has access to a server via hiera" [puppet] - 10https://gerrit.wikimedia.org/r/928529 (owner: 10Jbond) [12:48:38] (03CR) 10Jbond: wmcs: add wmcs-roots to wmcs data Persistence roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:49:18] (03PS5) 10Jbond: wmcs: add wmcs-roots use hiera merge to allow more fine grained control [puppet] - 10https://gerrit.wikimedia.org/r/923681 [12:50:01] (03CR) 10Jbond: wmcs: add wmcs-roots use hiera merge to allow more fine grained control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [12:50:35] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:51:15] (03CR) 10Ayounsi: [C: 03+1] Change hierdata parents for leaf switches eqiad row F [puppet] - 10https://gerrit.wikimedia.org/r/928056 (https://phabricator.wikimedia.org/T322937) (owner: 10Cathal Mooney) [12:52:25] RECOVERY - pybal on lvs1017 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:52:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "+1 with the usual request to transfer all these hacks to the modules afterwards 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/928475 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [12:53:47] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Continue deployment downtime tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/928475 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [12:54:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41618/console" [puppet] - 10https://gerrit.wikimedia.org/r/928524 (owner: 10Jbond) [12:54:57] (03Merged) 10jenkins-bot: mediawiki: Continue deployment downtime tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/928475 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [12:57:08] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:57:42] (03PS2) 10Jbond: puppetserver::g10k: force dir removal [puppet] - 10https://gerrit.wikimedia.org/r/928524 [12:57:43] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:58:29] (03CR) 10Jbond: [C: 03+2] puppetserver::g10k: force dir removal [puppet] - 10https://gerrit.wikimedia.org/r/928524 (owner: 10Jbond) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:24] (03PS1) 10Clément Goubert: mw-debug: Change sleep debug to 4s [deployment-charts] - 10https://gerrit.wikimedia.org/r/928537 (https://phabricator.wikimedia.org/T331609) [13:00:41] yup, I don’t see anything to deploy either [13:00:59] (03CR) 10Hashar: "Well the PCC output is not that helpful cause that includes changes from the parent change ..." [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [13:01:17] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [13:04:14] (03CR) 10Clément Goubert: [C: 03+2] mw-debug: Change sleep debug to 4s [deployment-charts] - 10https://gerrit.wikimedia.org/r/928537 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [13:05:20] (03Merged) 10jenkins-bot: mw-debug: Change sleep debug to 4s [deployment-charts] - 10https://gerrit.wikimedia.org/r/928537 (https://phabricator.wikimedia.org/T331609) (owner: 10Clément Goubert) [13:05:37] (03PS1) 10Jbond: puppetserver: switch new puppetserveres to use them self [puppet] - 10https://gerrit.wikimedia.org/r/928539 (https://phabricator.wikimedia.org/T330490) [13:05:40] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:05:43] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:05:53] (03CR) 10Jbond: [C: 03+2] puppetserver: switch new puppetserveres to use them self [puppet] - 10https://gerrit.wikimedia.org/r/928539 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [13:06:19] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:06:44] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:08:10] (03CR) 10Hashar: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/output/927980/1912/" [puppet] - 10https://gerrit.wikimedia.org/r/927980 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [13:13:02] (03PS2) 10Cathal Mooney: Change hierdata parents for leaf switches eqiad row F [puppet] - 10https://gerrit.wikimedia.org/r/928056 (https://phabricator.wikimedia.org/T322937) [13:14:26] (03CR) 10Hashar: "So in short that makes the cloned git repo `/srv/dev-images` to be owned by `dockerpkg-builder:contint-admins` without the group writable" [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [13:17:30] (03CR) 10Cathal Mooney: "LGTM, just one omission to change as well." [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [13:17:55] (03CR) 10Herron: [V: 03+1 C: 03+2] "thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [13:18:22] (03CR) 10Herron: [V: 03+1] udp2log: add 6to4 relay [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [13:19:06] (03PS3) 10Herron: udp2log: add 6to4 relay [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) [13:19:18] (03PS1) 10Vgutierrez: haproxy: Add support for filter bwlim-(in|out) [puppet] - 10https://gerrit.wikimedia.org/r/928541 (https://phabricator.wikimedia.org/T317799) [13:19:46] (03CR) 10Herron: [C: 03+2] udp2log: add 6to4 relay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928122 (https://phabricator.wikimedia.org/T338127) (owner: 10Herron) [13:29:24] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:30:37] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:35:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2014.codfw.wmnet with OS bullseye [13:36:22] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudswift1002.eqiad.wmnet with OS bullseye [13:39:41] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:40:06] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:41:21] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:43:07] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse for new ns-recursor.openstack.codfw1dev.wikimediacloud.org IP. - cmooney@cumin1001" [13:43:23] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:43:46] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:44:35] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse for new ns-recursor.openstack.codfw1dev.wikimediacloud.org IP. - cmooney@cumin1001" [13:44:35] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:47:37] jouncebot: nowandnext [13:47:38] For the next 0 hour(s) and 12 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1300) [13:47:38] For the next 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1300) [13:47:38] In 2 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1600) [13:48:20] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudswift1002.eqiad.wmnet with reason: host reimage [13:49:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host puppetmaster1006.mgmt.eqiad.wmnet with reboot policy FORCED [13:49:43] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:928516|Remove svwiktionary, svwiki and dawiki from legacy encoding (T128156 T128152 T128153)]] [13:49:49] T128153: Migrate all old DB rows from windows-1252 to UTF-8 on svwiki - https://phabricator.wikimedia.org/T128153 [13:49:49] T128152: Migrate all old DB rows from windows-1252 to UTF-8 on dawiki - https://phabricator.wikimedia.org/T128152 [13:49:49] T128156: Migrate all old DB rows from windows-1252 to UTF-8 on svwiktionary - https://phabricator.wikimedia.org/T128156 [13:50:11] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:51:13] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:51:26] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:928516|Remove svwiktionary, svwiki and dawiki from legacy encoding (T128156 T128152 T128153)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [13:51:32] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [13:52:40] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:52:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudswift1002.eqiad.wmnet with reason: host reimage [13:54:30] m1 lag? [13:54:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2014.codfw.wmnet with reason: host reimage [13:55:08] interesting, but resolved now [13:57:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2014.codfw.wmnet with reason: host reimage [13:58:56] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:928516|Remove svwiktionary, svwiki and dawiki from legacy encoding (T128156 T128152 T128153)]] (duration: 09m 13s) [13:59:02] T128153: Migrate all old DB rows from windows-1252 to UTF-8 on svwiki - https://phabricator.wikimedia.org/T128153 [13:59:02] T128152: Migrate all old DB rows from windows-1252 to UTF-8 on dawiki - https://phabricator.wikimedia.org/T128152 [13:59:02] T128156: Migrate all old DB rows from windows-1252 to UTF-8 on svwiktionary - https://phabricator.wikimedia.org/T128156 [13:59:56] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:00:57] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:01:21] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:02:53] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse for new ns-recursor.openstack.codfw1dev.wikimediacloud.org IP. - cmooney@cumin1001" [14:04:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:04:15] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse for new ns-recursor.openstack.codfw1dev.wikimediacloud.org IP. - cmooney@cumin1001" [14:04:15] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:05:26] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:23] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:28] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:07:30] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:41] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:07:47] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:08:06] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:09:12] !log decom cloudsw2-c8-eqiad - T338459 [14:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:15] T338459: Decom cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T338459 [14:10:07] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:10:22] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:30] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:11:29] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:52] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:13:11] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:13:51] !log cloudsw2-c8-eqiad> request system zeroize - T338459 [14:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:14:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2014.codfw.wmnet with OS bullseye [14:14:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye completed... [14:15:09] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [14:15:12] (03PS7) 10Vgutierrez: hiera: Test HAProxy bw limits per URL on cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) [14:15:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/927795 (https://phabricator.wikimedia.org/T338188) (owner: 10Andrew Bogott) [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/924167 (owner: 10Volans) [14:17:07] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse for new ns-recursor.openstack.codfw1dev.wikimediacloud.org IP. - cmooney@cumin1001" [14:17:23] (03PS2) 10Volans: setup.py: remove temporary upper limits of deps [cookbooks] - 10https://gerrit.wikimedia.org/r/924167 [14:17:40] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [14:19:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse for new ns-recursor.openstack.codfw1dev.wikimediacloud.org IP. - cmooney@cumin1001" [14:19:11] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:19:14] RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:29] (03PS4) 10Cathal Mooney: Add include in 20.172.in-addr.arpa for 172.20.254.0/24 netbox records [dns] - 10https://gerrit.wikimedia.org/r/928543 (https://phabricator.wikimedia.org/T335759) [14:19:48] 10ops-eqiad: Decom cloudsw2-c8-eqiad - https://phabricator.wikimedia.org/T338459 (10ayounsi) a:05ayounsi→03Jclark-ctr The switch has been zeroized, You can power it down (if it's not already down) remove the cables to cloudsw1, store it as spare and update Netbox as necessary. [14:20:20] (03CR) 10CI reject: [V: 04-1] Add include in 20.172.in-addr.arpa for 172.20.254.0/24 netbox records [dns] - 10https://gerrit.wikimedia.org/r/928543 (https://phabricator.wikimedia.org/T335759) (owner: 10Cathal Mooney) [14:20:27] (03CR) 10Volans: [C: 03+2] setup.py: remove temporary upper limits of deps [cookbooks] - 10https://gerrit.wikimedia.org/r/924167 (owner: 10Volans) [14:21:44] (03CR) 10Ottomata: [C: 03+2] refine - Exclude 3 deleted old EventLogging schemas from being refined [puppet] - 10https://gerrit.wikimedia.org/r/928526 (owner: 10Ottomata) [14:22:07] (03CR) 10Ssingh: sre.hosts.reboot-cluster: allow all data centers and not just core ones (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [14:22:45] (03Merged) 10jenkins-bot: setup.py: remove temporary upper limits of deps [cookbooks] - 10https://gerrit.wikimedia.org/r/924167 (owner: 10Volans) [14:22:55] (03CR) 10Andrew Bogott: [C: 03+2] apt::repository: remove conflicting .list files from bookworm /etc/apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927795 (https://phabricator.wikimedia.org/T338188) (owner: 10Andrew Bogott) [14:23:12] (03CR) 10Volans: sre.hosts.reboot-cluster: allow all data centers and not just core ones (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [14:23:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host puppetmaster1006.mgmt.eqiad.wmnet with reboot policy FORCED [14:23:36] (03PS1) 10Volans: sre.hosts.reboot-cluster: simplify Icinga logic [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 [14:23:44] (03PS2) 10Snwachukwu: Use refinery v0.2.16 in refine jobs. [puppet] - 10https://gerrit.wikimedia.org/r/928525 (https://phabricator.wikimedia.org/T335308) [14:24:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/927803 (owner: 10Volans) [14:25:09] (03CR) 10Volans: sre.hosts.reboot-cluster: simplify Icinga logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [14:25:51] (03CR) 10Volans: sre.hosts.reboot-cluster: allow all data centers and not just core ones (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [14:26:06] (03CR) 10Vgutierrez: "initial and naive approach, feel free to alter it as needed cdanis" [puppet] - 10https://gerrit.wikimedia.org/r/928548 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [14:26:46] (03CR) 10Volans: [C: 03+2] cookbooks: improve test-cookbook binary [puppet] - 10https://gerrit.wikimedia.org/r/927803 (owner: 10Volans) [14:26:58] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10Superpes15) [14:28:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:28:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudswift1002.eqiad.wmnet with OS bullseye [14:28:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudswift1002.eqiad.wmnet with OS bu... [14:28:32] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [14:29:03] (03CR) 10Hashar: fix-staging-perms: set group name from Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:29:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003'] [14:29:43] (03PS1) 10Jbond: puppetserver: use correct puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/928562 [14:30:17] (03PS1) 10Cathal Mooney: Remove include for Netbox generated 2620:0:861:fe10::/64 entries [dns] - 10https://gerrit.wikimedia.org/r/928563 (https://phabricator.wikimedia.org/T338459) [14:32:53] (03CR) 10Hashar: [C: 03+1] git::clone: Ensure that the URL for origin is always up-to-date (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [14:33:05] (03CR) 10Ayounsi: [C: 03+1] Remove include for Netbox generated 2620:0:861:fe10::/64 entries [dns] - 10https://gerrit.wikimedia.org/r/928563 (https://phabricator.wikimedia.org/T338459) (owner: 10Cathal Mooney) [14:33:12] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/928528 (owner: 10Ayounsi) [14:33:46] (03CR) 10Cathal Mooney: [C: 03+2] Remove include for Netbox generated 2620:0:861:fe10::/64 entries [dns] - 10https://gerrit.wikimedia.org/r/928563 (https://phabricator.wikimedia.org/T338459) (owner: 10Cathal Mooney) [14:34:15] (03PS1) 10Alexandros Kosiaris: eventgate-main: Increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/928566 (https://phabricator.wikimedia.org/T338357) [14:35:06] (03CR) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:35:21] (03PS2) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) [14:35:22] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) Just adding a note here that i needed to do the following to get puppet to work on the CA. this relates to the fact that we have separate ssl directories to supp... [14:36:13] (03PS5) 10Cathal Mooney: Add include in 20.172.in-addr.arpa for 172.20.254.0/24 netbox records [dns] - 10https://gerrit.wikimedia.org/r/928543 (https://phabricator.wikimedia.org/T335759) [14:36:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventgate-main: Increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/928566 (https://phabricator.wikimedia.org/T338357) (owner: 10Alexandros Kosiaris) [14:36:37] !log installing libwep security updates on buster [14:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:11] (03Merged) 10jenkins-bot: eventgate-main: Increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/928566 (https://phabricator.wikimedia.org/T338357) (owner: 10Alexandros Kosiaris) [14:37:47] (03PS6) 10Cathal Mooney: Add include in 20.172.in-addr.arpa for 172.20.254.0/24 netbox records [dns] - 10https://gerrit.wikimedia.org/r/928543 (https://phabricator.wikimedia.org/T335759) [14:39:49] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) Sounds good, thanks! > precise names of the fields in the data (we can look for this in realtime in the data when it starts flowing) Sure, is it sa... [14:39:56] (03CR) 10Ayounsi: [C: 03+2] Ignore .vscode and support Python 3.11 in Tox [software/homer] - 10https://gerrit.wikimedia.org/r/928528 (owner: 10Ayounsi) [14:40:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/928529 (owner: 10Jbond) [14:40:51] (03CR) 10Ahmon Dancy: [C: 03+1] contint: build dev-images with a system user [puppet] - 10https://gerrit.wikimedia.org/r/927975 (https://phabricator.wikimedia.org/T338277) (owner: 10Hashar) [14:41:20] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [14:41:33] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [14:41:45] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [14:41:47] (03Merged) 10jenkins-bot: Ignore .vscode and support Python 3.11 in Tox [software/homer] - 10https://gerrit.wikimedia.org/r/928528 (owner: 10Ayounsi) [14:42:06] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [14:42:15] (03PS1) 10Eevans: sessionstore: upgrade sessionstore1001 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/928569 (https://phabricator.wikimedia.org/T337426) [14:42:17] (03PS1) 10Eevans: sessionstore: upgrade sessionstore1002 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/928570 (https://phabricator.wikimedia.org/T337426) [14:42:19] (03PS1) 10Eevans: sessionstore: upgrade sessionstore1003 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/928571 (https://phabricator.wikimedia.org/T337426) [14:42:21] (03PS1) 10Eevans: sessionstore: move per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/928572 (https://phabricator.wikimedia.org/T337426) [14:42:23] (03PS1) 10Eevans: sessionstore: remove transitional settings [puppet] - 10https://gerrit.wikimedia.org/r/928573 (https://phabricator.wikimedia.org/T337426) [14:43:31] (03CR) 10Ahmon Dancy: fix-staging-perms: set set-group-id on /srv/patches subdirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:45:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add include in 20.172.in-addr.arpa for 172.20.254.0/24 netbox records [dns] - 10https://gerrit.wikimedia.org/r/928543 (https://phabricator.wikimedia.org/T335759) (owner: 10Cathal Mooney) [14:46:11] (03CR) 10Cathal Mooney: [C: 03+2] Add include in 20.172.in-addr.arpa for 172.20.254.0/24 netbox records [dns] - 10https://gerrit.wikimedia.org/r/928543 (https://phabricator.wikimedia.org/T335759) (owner: 10Cathal Mooney) [14:47:05] (03PS1) 10Slyngshede: next_uid_number: Fix search for large databases. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/928574 [14:47:26] (03CR) 10Ahmon Dancy: fix-staging-perms: set group name from Puppet (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:47:56] (03PS1) 10Jbond: puppetdb: add abiliity to override the ca_content [puppet] - 10https://gerrit.wikimedia.org/r/928575 (https://phabricator.wikimedia.org/T330490) [14:48:05] (03CR) 10Jbond: [C: 03+2] puppet::agent: update the force_puppet7 flag [puppet] - 10https://gerrit.wikimedia.org/r/928549 (owner: 10Jbond) [14:48:08] (03CR) 10Jbond: [C: 03+2] puppetserver: use correct puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/928562 (owner: 10Jbond) [14:50:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41622/console" [puppet] - 10https://gerrit.wikimedia.org/r/928575 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:50:48] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: add abiliity to override the ca_content [puppet] - 10https://gerrit.wikimedia.org/r/928575 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [14:52:27] (03CR) 10Jbond: [C: 03+2] "sgtm merging" [puppet] - 10https://gerrit.wikimedia.org/r/928529 (owner: 10Jbond) [14:53:13] (03CR) 10Jbond: wmcs: add wmcs-roots use hiera merge to allow more fine grained control (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/923681 (owner: 10Jbond) [14:54:01] (03CR) 10Hashar: fix-staging-perms: set group name from Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:54:07] (03PS2) 10Hashar: fix-staging-perms: set group name from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) [14:56:10] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks for fixing this Antoine!" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [14:56:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. (Independent of python-ldap's limit, the size limit is also configured/imposed in our slapd setup ($size_limit parameter to th" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/928574 (owner: 10Slyngshede) [14:57:19] (03CR) 10Slyngshede: [C: 03+2] next_uid_number: Fix search for large databases. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/928574 (owner: 10Slyngshede) [14:58:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [14:59:51] (03Abandoned) 10MVernon: Update mail dashboard to use a log scale (workflow testing) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/902324 (owner: 10MVernon) [15:00:52] (03PS1) 10Jbond: puppetdb: allow puppetserver and puppetmasteres to talk to puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/928576 (https://phabricator.wikimedia.org/T330490) [15:03:18] (03PS1) 10Ilias Sarantopoulos: ml-services: update bloom image [deployment-charts] - 10https://gerrit.wikimedia.org/r/928578 (https://phabricator.wikimedia.org/T333861) [15:04:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41623/console" [puppet] - 10https://gerrit.wikimedia.org/r/928576 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:07:32] (03CR) 10BBlack: [C: 03+1] sre.hosts.reboot-cluster: allow all data centers and not just core ones (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [15:09:37] !log installing c-ares security updates on bullseye [15:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:51] (03PS1) 10Hashar: admin: reserve gerrit uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/928580 (https://phabricator.wikimedia.org/T339470) [15:10:59] (03PS2) 10Hashar: admin: reserve gerrit uid/gid [puppet] - 10https://gerrit.wikimedia.org/r/928580 (https://phabricator.wikimedia.org/T338470) [15:11:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [15:12:35] (03CR) 10MVernon: [C: 03+1] sessionstore: upgrade sessionstore1001 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/928569 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:12:38] (03PS1) 10Majavah: Remove l10nupdate manifests [puppet] - 10https://gerrit.wikimedia.org/r/928582 [15:12:42] (03CR) 10MVernon: [C: 03+1] sessionstore: upgrade sessionstore1002 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/928570 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:12:49] (03CR) 10MVernon: [C: 03+1] sessionstore: upgrade sessionstore1003 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/928571 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:13:06] (03CR) 10Ssingh: sre.hosts.reboot-cluster: simplify Icinga logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [15:13:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['puppetmaster1006'] [15:13:39] 10SRE, 10Deployments, 10Scap: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317 (10taavi) [15:13:41] (03CR) 10Ahmon Dancy: [C: 03+1] Remove l10nupdate manifests [puppet] - 10https://gerrit.wikimedia.org/r/928582 (owner: 10Majavah) [15:13:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['puppetmaster1006'] [15:13:48] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: system users with UIDs > 500 - https://phabricator.wikimedia.org/T121610 (10taavi) [15:14:24] 10SRE, 10Deployments, 10Infrastructure-Foundations, 10serviceops-radar: Make l10nupdate user a system user - https://phabricator.wikimedia.org/T120585 (10taavi) 05Open→03Declined Per https://gerrit.wikimedia.org/r/c/operations/puppet/+/896318. [15:14:42] (03CR) 10MVernon: [C: 03+1] "...this is once the eqiad rollout is done, I'm assuming." [puppet] - 10https://gerrit.wikimedia.org/r/928572 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:14:58] (03CR) 10CI reject: [V: 04-1] Remove l10nupdate manifests [puppet] - 10https://gerrit.wikimedia.org/r/928582 (owner: 10Majavah) [15:15:05] (03PS3) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) [15:15:17] (03CR) 10MVernon: [C: 03+1] sessionstore: remove transitional settings [puppet] - 10https://gerrit.wikimedia.org/r/928573 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [15:15:22] (03CR) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh FQDNs for codfw1dev DNS servers (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/928512 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [15:16:09] (03PS2) 10Majavah: Remove l10nupdate manifests [puppet] - 10https://gerrit.wikimedia.org/r/928582 [15:17:52] 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10jcrespo) >>! In T338354#8911065, @Jclark-ctr wrote: > This server is out of warranty i can pull a dimm from decom server if needed I think this is the only thing we can do, please go ahead, the server has disabled... [15:19:39] (03PS1) 10Slyngshede: LDAPBackend: Add mail and a default shell. [software/bitu] - 10https://gerrit.wikimedia.org/r/928586 [15:19:43] (03CR) 10Elukey: [C: 03+2] ml-services: update bloom image [deployment-charts] - 10https://gerrit.wikimedia.org/r/928578 (https://phabricator.wikimedia.org/T333861) (owner: 10Ilias Sarantopoulos) [15:20:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) 05Open→03Resolved Complete [15:20:29] (03CR) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [15:20:39] (03PS3) 10Hashar: fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) [15:20:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host puppetmaster1006.eqiad.wmnet with OS bullseye [15:20:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host puppetmaster1006.eqiad.wmnet with OS bullseye [15:21:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10Papaul) [15:21:40] (03CR) 10Ssingh: [C: 03+2] lvs2014: commission new LVS host (codfw hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/928112 (https://phabricator.wikimedia.org/T326767) (owner: 10Ssingh) [15:22:54] (03CR) 10Ahmon Dancy: [C: 03+1] fix-staging-perms: set set-group-id on /srv/patches subdirs [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [15:22:54] PROBLEM - Ensure legal html en.wp on en.wikipedia.org is CRITICAL: ERROR: copyright html not found for https://en.wikipedia.org/wiki/Main_Page (desktop site). https://wikitech.wikimedia.org/wiki/Check_legal_html [15:23:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2014.codfw.wmnet with OS bullseye [15:23:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye [15:25:05] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10MatthewVernon) @KFrancis can you arrange for an NDA for this person, please? @ArielGlenn / @Urbanecm_WMF are you prepared to sponsor this request, please? @thci... [15:26:31] 10SRE, 10ops-eqiad, 10DBA: db1135 has crashed - https://phabricator.wikimedia.org/T338354 (10jcrespo) @Ladsgroup @Marostegui As you will be back before I am, remember to (in case you want to do it, if not you can wait for me): * Resetup data (can be done from the lastest snapshot, as documented) * Remove th... [15:27:09] (03PS1) 10Papaul: Add puppetmaster1006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/928588 (https://phabricator.wikimedia.org/T334470) [15:27:33] (03CR) 10CI reject: [V: 04-1] Add puppetmaster1006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/928588 (https://phabricator.wikimedia.org/T334470) (owner: 10Papaul) [15:29:05] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10taavi) Is there [15:29:35] (03PS2) 10Papaul: Add puppetmaster1006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/928588 (https://phabricator.wikimedia.org/T334470) [15:30:15] (03CR) 10Papaul: [C: 03+2] Add puppetmaster1006 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/928588 (https://phabricator.wikimedia.org/T334470) (owner: 10Papaul) [15:30:50] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10taavi) Is there any activity [[ https://gerrit.wikimedia.org/r/q/owner:superpes15.itwiki%2540gmail.com+-project:operations/mediawiki-config | this Gerrit query ]]... [15:33:05] (03PS1) 10Arturo Borrero Gonzalez: cloud: codfw1dev: use new recursor address [puppet] - 10https://gerrit.wikimedia.org/r/928589 (https://phabricator.wikimedia.org/T338433) [15:35:02] (03PS1) 10Daniel Kinzler: Switch VisualEditor to not use RESTbase on English Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) [15:35:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Fully manage /etc/nftables/ in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/928463 (owner: 10Muehlenhoff) [15:39:51] (03CR) 10Volans: sre.hosts.reboot-cluster: allow all data centers and not just core ones (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [15:40:18] PROBLEM - Ensure legal html en.m.wp on en.m.wikipedia.org is CRITICAL: ERROR: copyright html not found for https://en.m.wikipedia.org/wiki/Main_Page (mobile site). https://wikitech.wikimedia.org/wiki/Check_legal_html [15:48:10] (03CR) 10Volans: sre.hosts.reboot-cluster: simplify Icinga logic (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928560 (owner: 10Volans) [15:49:08] jynus: Apparently we've switched to CC 4.0 ? [15:49:11] I'll update the check_legal [15:49:19] ha ha [15:49:35] legal just received an email [15:49:44] they will know [15:49:48] lol [15:50:04] should I wait for their go ahead to patch ? [15:50:18] (03PS1) 10Elukey: ml-services: add bloom-3b-gpu to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/928593 (https://phabricator.wikimedia.org/T334583) [15:50:35] let's first run the verbose version to see which check fails [15:51:07] I did :) [15:51:47] There's quite a few more failing actually [15:52:28] (03CR) 10Ladsgroup: "Jumping from hewiki to enwiki directly? Maybe group1 first" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [15:52:39] INFO: Expected word 3.0 is missing! [15:52:46] INFO: Expected word share alike is missing! [15:53:10] (03CR) 10Ladsgroup: "aah I see, medium is already included. Let's go to enwiki but I remember you said there was a bug that increased the latency, is it fixed " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [15:53:27] yeah, it's now sharealike [15:55:09] (03PS2) 10Arturo Borrero Gonzalez: cloudservices: codfw1dev: refresh VIPs [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) [15:55:28] actually, just changing the version works for me [15:55:40] 10SRE, 10ops-knams, 10DC-Ops: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10wiki_willy) [15:55:54] (03PS1) 10Jforrester: abstract-wikipedia alert: Increase timeout from 10s to 180s [puppet] - 10https://gerrit.wikimedia.org/r/928594 [15:55:59] (03CR) 10D3r1ck01: Switch VisualEditor to not use RESTbase on English Wikipedia. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [15:56:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [15:56:12] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [15:56:14] jynus: yep [15:56:24] so we were ready for the change :-D [15:56:37] I just didn't know it was so soon [15:57:07] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/928595 (owner: 10Clément Goubert) [15:57:20] (03CR) 10Elukey: [C: 03+2] ml-services: add bloom-3b-gpu to the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/928593 (https://phabricator.wikimedia.org/T334583) (owner: 10Elukey) [15:57:20] lets contact legal [15:57:37] after all we just run this check for them [15:57:45] Patch is ready, update whenever [15:57:48] I gotta go [15:57:55] yeah, no worries [15:57:57] I will handle that [15:58:04] or handle it to someone else [15:58:06] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2014.codfw.wmnet with OS bullseye [15:58:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye executed w... [15:58:37] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks Ben :)" [puppet] - 10https://gerrit.wikimedia.org/r/928558 (owner: 10Btullis) [15:58:44] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2014.codfw.wmnet with OS bullseye [15:58:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye [15:59:02] (03PS1) 10Papaul: Fix typo in stipe.pp for puppetmaster1006 [puppet] - 10https://gerrit.wikimedia.org/r/928596 (https://phabricator.wikimedia.org/T334479) [15:59:10] (03CR) 10Stef Dunlap: [C: 03+1] abstract-wikipedia alert: Increase timeout from 10s to 180s [puppet] - 10https://gerrit.wikimedia.org/r/928594 (owner: 10Jforrester) [15:59:35] (03CR) 10Ladsgroup: Switch VisualEditor to not use RESTbase on English Wikipedia. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [15:59:40] (03CR) 10Elukey: Fix typo in stipe.pp for puppetmaster1006 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/928596 (https://phabricator.wikimedia.org/T334479) (owner: 10Papaul) [15:59:46] (03CR) 10Papaul: [C: 03+2] Fix typo in stipe.pp for puppetmaster1006 [puppet] - 10https://gerrit.wikimedia.org/r/928596 (https://phabricator.wikimedia.org/T334479) (owner: 10Papaul) [16:00:05] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:00:49] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:01:00] spike in parsoid [16:02:55] (03CR) 10Jcrespo: [C: 04-1] "This fixes wikipedia (mobile and desktop) but breaks wikibooks, still showing the 3 version." [puppet] - 10https://gerrit.wikimedia.org/r/928595 (owner: 10Clément Goubert) [16:03:06] (03CR) 10Jcrespo: [C: 04-1] "Contacting legal" [puppet] - 10https://gerrit.wikimedia.org/r/928595 (owner: 10Clément Goubert) [16:04:57] (03CR) 10Btullis: [C: 03+2] Bump mediawiki_history_reduced version for aqs [puppet] - 10https://gerrit.wikimedia.org/r/928558 (owner: 10Btullis) [16:05:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [16:05:50] (03CR) 10Arturo Borrero Gonzalez: "I don't understand this PCC is not modifying the anycast healthchecker configuration https://puppet-compiler.wmflabs.org/output/928508/416" [puppet] - 10https://gerrit.wikimedia.org/r/928508 (https://phabricator.wikimedia.org/T307357) (owner: 10Arturo Borrero Gonzalez) [16:06:01] !log depooling eqiad sessionstore for Cassandra upgrade — T337426 [16:06:02] (03CR) 10D3r1ck01: Switch VisualEditor to not use RESTbase on English Wikipedia. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928590 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [16:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:04] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [16:06:09] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route depool sessionstore in eqiad: maintenance [16:06:21] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2014.codfw.wmnet with OS bullseye [16:06:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2014.codfw.wmnet with OS bullseye executed w... [16:08:59] Amir1, re enabling cache warming for enwiki: is the sum of traffic of group1 wikis equivalent to enwiki? I'm also trying to understand these workflow myself. Because if cache warming happened for the whole of group1 nicely without failures, then maybe doing enwiki (alone) is not that bad, assuming the traffic for g1 equates or is close to enwiki alone. [16:10:14] enwiki is basically roughly half of all of traffic [16:10:28] so it's pretty massive in all the ways [16:11:13] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool sessionstore in eqiad: maintenance [16:11:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:12:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance [16:17:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:17:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:19:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [16:19:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2140.codfw.wmnet with reason: Maintenance [16:20:02] (03CR) 10Elukey: [C: 03+1] Swap journal node analytics1069 with an-worker1142 [puppet] - 10https://gerrit.wikimedia.org/r/928349 (https://phabricator.wikimedia.org/T338336) (owner: 10Stevemunene) [16:20:33] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JArguello-WMF) [16:21:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [16:22:01] (03CR) 10Hokwelum: [C: 03+1] "looks good :-)" [puppet] - 10https://gerrit.wikimedia.org/r/927741 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [16:22:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2113.codfw.wmnet with reason: Maintenance [16:22:33] !log creating pre-upgrade Cassandra snapshots, sessionstore/eqiad — T337426 [16:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:37] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [16:24:19] (03CR) 10Eevans: [C: 03+2] sessionstore: upgrade sessionstore1001 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/928569 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [16:25:00] (03CR) 10BCornwall: [C: 03+1] sre.hosts.reboot-cluster: allow all data centers and not just core ones [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [16:26:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1118.eqiad.wmnet with reason: Maintenance [16:26:42] Upgrading Cassandra to 4.1.1, sessionstore1001 — T337426 [16:26:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1118.eqiad.wmnet with reason: Maintenance [16:26:47] !log Upgrading Cassandra to 4.1.1, sessionstore1001 — T337426 [16:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T336886)', diff saved to https://phabricator.wikimedia.org/P49279 and previous config saved to /var/cache/conftool/dbconfig/20230608-162650-ladsgroup.json [16:26:53] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:30:15] (03CR) 10ArielGlenn: [C: 03+2] for testing of dumps nfs shares, add conf files for other types of dumps [puppet] - 10https://gerrit.wikimedia.org/r/927741 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [16:30:45] (03CR) 10Michael Große: [C: 04-1] "Based on the parent task, this should only be merged/deployed after T333655 has been done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928601 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [16:32:08] (03CR) 10Michael Große: "This should be ready to be merged/deployed. It should be a noop. It is a prerequisite for enabling the new EntitySchema Datatype on testwi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928600 (https://phabricator.wikimedia.org/T332724) (owner: 10Michael Große) [16:32:25] (03CR) 10Eevans: [C: 03+2] sessionstore: upgrade sessionstore1002 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/928570 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [16:34:12] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart-reboot rolling restart_daemons on A:aqs [16:35:11] !log Upgrading Cassandra to 4.1.1, sessionstore1002 — T337426 [16:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:14] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [16:36:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host puppetmaster1006.eqiad.wmnet with OS bullseye [16:36:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host puppetmaster1006.eqiad.wmnet with OS bullseye executed with errors:... [16:37:50] (03CR) 10Btullis: [C: 03+2] Use standard uppercase for cumin alias P selector [puppet] - 10https://gerrit.wikimedia.org/r/928111 (owner: 10Btullis) [16:38:00] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-06-06-150200-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/928604 [16:38:06] (03PS2) 10Ssingh: sre.hosts.reboot-cluster: fix-ups for Traffic/SRE usage [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 [16:38:08] (03CR) 10Btullis: [C: 03+1] Swap journal node analytics1069 with an-worker1142 [puppet] - 10https://gerrit.wikimedia.org/r/928349 (https://phabricator.wikimedia.org/T338336) (owner: 10Stevemunene) [16:38:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host puppetmaster1006.eqiad.wmnet with OS bullseye [16:38:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host puppetmaster1006.eqiad.wmnet with OS bullseye [16:38:57] (03CR) 10Eevans: [C: 03+2] sessionstore: upgrade sessionstore1003 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/928571 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [16:39:14] (03CR) 10Ssingh: sre.hosts.reboot-cluster: fix-ups for Traffic/SRE usage (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 (owner: 10Ssingh) [16:39:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster1006.eqiad.wmnet with reason: host reimage [16:39:53] (03PS1) 10Ssingh: Revert "lvs2014: commission new LVS host (codfw hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/928606 [16:40:59] !log Upgrading Cassandra to 4.1.1, sessionstore1003 — T337426 [16:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:03] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [16:41:46] (03PS3) 10Ssingh: sre.hosts.reboot-cluster: fix-ups for Traffic/SRE usage [cookbooks] - 10https://gerrit.wikimedia.org/r/928546 [16:42:13] (03CR) 10Ssingh: [C: 03+2] Revert "lvs2014: commission new LVS host (codfw hardware refresh)" [puppet] - 10https://gerrit.wikimedia.org/r/928606 (owner: 10Ssingh) [16:42:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T336886)', diff saved to https://phabricator.wikimedia.org/P49280 and previous config saved to /var/cache/conftool/dbconfig/20230608-164228-ladsgroup.json [16:42:31] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [16:42:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetmaster1006.eqiad.wmnet with reason: host reimage [16:42:47] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-06-06-150200-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/928604 (owner: 10BryanDavis) [16:43:37] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-06-06-150200-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/928604 (owner: 10BryanDavis) [16:44:26] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928572 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [16:45:45] (03PS1) 10Hokwelum: Fix up more things in the README for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/928605 (https://phabricator.wikimedia.org/T325232) [16:46:43] !log Starting traffic test against sessionstore.svc.eqiad.wmnet — T337426 [16:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:46] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [16:50:54] (03PS2) 10Hokwelum: Fix up more things in the README for testing new dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/928605 (https://phabricator.wikimedia.org/T325232) [16:54:06] (03PS1) 10Andrew Bogott: wmcs-image-create: add some longer naps [puppet] - 10https://gerrit.wikimedia.org/r/928627 (https://phabricator.wikimedia.org/T338320) [16:56:59] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:aqs [16:57:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P49281 and previous config saved to /var/cache/conftool/dbconfig/20230608-165734-ladsgroup.json [16:58:29] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [16:59:35] (03CR) 10Stevemunene: [C: 03+2] Swap journal node analytics1069 with an-worker1142 [puppet] - 10https://gerrit.wikimedia.org/r/928349 (https://phabricator.wikimedia.org/T338336) (owner: 10Stevemunene) [17:00:05] bd808: #bothumor I � Unicode. All rise for Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1700). [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1700) [17:00:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:00:52] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetmaster1006.eqiad.wmnet with OS bullseye [17:01:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host puppetmaster1006.eqiad.wmnet with OS bullseye completed: - puppetma... [17:03:26] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-image-create: add some longer naps [puppet] - 10https://gerrit.wikimedia.org/r/928627 (https://phabricator.wikimedia.org/T338320) (owner: 10Andrew Bogott) [17:05:10] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:10:58] !log stevemunene@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [17:12:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P49282 and previous config saved to /var/cache/conftool/dbconfig/20230608-171240-ladsgroup.json [17:13:44] (03PS1) 10Jbond: puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) [17:14:07] (03CR) 10CI reject: [V: 04-1] puppetserver: Add private repo configurations [puppet] - 10https://gerrit.wikimedia.org/r/928628 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [17:14:31] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:16:21] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10Superpes15) >>! In T338468#8914550, @taavi wrote: > Is there any activity [[ https://gerrit.wikimedia.org/r/q/owner:superpes15.itwiki%2540gmail.com+-project:opera... [17:18:01] (03CR) 10Dzahn: [C: 03+2] abstract-wikipedia alert: Increase timeout from 10s to 180s [puppet] - 10https://gerrit.wikimedia.org/r/928594 (owner: 10Jforrester) [17:20:19] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10TheresNoTime) Either way, no harm in [[ https://wikitech.wikimedia.org/wiki/Deployments/Training#Get_training | signing up for some training ]] while you wait :-) [17:20:34] (HelmReleaseBadStatus) firing: Helm release developer-portal/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=developer-portal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:21:38] (03PS5) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [17:21:52] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [17:24:38] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:24:57] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10KFrancis) Hi all, I will need the volunteer's full name, mailing address, and email to process the NDA. Please send the following information to: kfrancis@wikimed... [17:25:34] (HelmReleaseBadStatus) resolved: Helm release developer-portal/main on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=developer-portal - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:27:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10Papaul) [17:27:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install puppetmaster1006 - https://phabricator.wikimedia.org/T334479 (10Papaul) 05Open→03Resolved @jbond this is complete [17:27:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T336886)', diff saved to https://phabricator.wikimedia.org/P49283 and previous config saved to /var/cache/conftool/dbconfig/20230608-172746-ladsgroup.json [17:27:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1119.eqiad.wmnet with reason: Maintenance [17:27:50] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [17:28:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1119.eqiad.wmnet with reason: Maintenance [17:28:35] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:30:16] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:30:23] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:30:24] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou) >> precise names of the fields in the data (we can look for this in realtime in the data when it starts flowing) > Sure,... [17:31:12] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:31:18] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:31:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10Papaul) a:05Jclark-ctr→03Papaul [17:34:16] (03PS1) 10Majavah: haproxy: Provide a custom error message for plaintext requests [puppet] - 10https://gerrit.wikimedia.org/r/928632 (https://phabricator.wikimedia.org/T338481) [17:35:29] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:36:06] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [17:36:43] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:38:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41630/console" [puppet] - 10https://gerrit.wikimedia.org/r/928632 (https://phabricator.wikimedia.org/T338481) (owner: 10Majavah) [17:39:07] (03PS2) 10Majavah: haproxy: Provide a custom error message for plaintext requests [puppet] - 10https://gerrit.wikimedia.org/r/928632 (https://phabricator.wikimedia.org/T338481) [17:39:50] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:40:15] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10Superpes15) >>! In T338468#8915005, @TheresNoTime wrote: > Either way, no harm in [[ https://wikitech.wikimedia.org/wiki/Deployments/Training#Get_training | signi... [17:41:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1128.eqiad.wmnet with reason: Maintenance [17:41:29] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41631/console" [puppet] - 10https://gerrit.wikimedia.org/r/928632 (https://phabricator.wikimedia.org/T338481) (owner: 10Majavah) [17:41:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1128.eqiad.wmnet with reason: Maintenance [17:41:34] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:41:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T336886)', diff saved to https://phabricator.wikimedia.org/P49284 and previous config saved to /var/cache/conftool/dbconfig/20230608-174135-ladsgroup.json [17:41:40] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [17:42:16] (03CR) 10Eevans: [C: 03+2] sessionstore: move per-host settings back to role [puppet] - 10https://gerrit.wikimedia.org/r/928572 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [17:42:34] (03PS6) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [17:46:16] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928573 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [17:46:27] (03PS7) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [17:48:32] (03PS2) 10Eevans: sessionstore: remove transitional settings [puppet] - 10https://gerrit.wikimedia.org/r/928573 (https://phabricator.wikimedia.org/T337426) [17:48:57] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928573 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [17:53:15] (03PS8) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [17:53:21] (03PS3) 10Eevans: sessionstore: remove transitional settings [puppet] - 10https://gerrit.wikimedia.org/r/928573 (https://phabricator.wikimedia.org/T337426) [17:54:14] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928573 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [17:57:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T336886)', diff saved to https://phabricator.wikimedia.org/P49285 and previous config saved to /var/cache/conftool/dbconfig/20230608-175732-ladsgroup.json [17:57:37] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [17:58:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q3:rack/setup/install pki-root1002 - https://phabricator.wikimedia.org/T334401 (10Papaul) @Jclark-ctr can you please update Netbox with the racking information for this server . thanks [17:58:37] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [17:59:30] (Device rebooted) firing: Alert for device ps1-d2-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:00:05] jeena and dduvall: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1800). [18:01:49] (03PS1) 10TrainBranchBot: group2 wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928634 (https://phabricator.wikimedia.org/T337526) [18:01:51] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928634 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [18:02:37] (03Merged) 10jenkins-bot: group2 wikis to 1.41.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928634 (https://phabricator.wikimedia.org/T337526) (owner: 10TrainBranchBot) [18:04:30] (Device rebooted) resolved: Device ps1-d2-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [18:05:41] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10Urbanecm_WMF) >>! In T338468#8915158, @Superpes15 wrote: >>>! In T338468#8915005, @TheresNoTime wrote: >> Either way, no harm in [[ https://wikitech.wikimedia.org... [18:07:54] (03CR) 10Eevans: [C: 03+2] sessionstore: remove transitional settings [puppet] - 10https://gerrit.wikimedia.org/r/928573 (https://phabricator.wikimedia.org/T337426) (owner: 10Eevans) [18:09:52] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.41.0-wmf.12 refs T337526 [18:09:56] T337526: 1.41.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T337526 [18:11:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10wiki_willy) p:05Medium→03High [18:12:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P49286 and previous config saved to /var/cache/conftool/dbconfig/20230608-181238-ladsgroup.json [18:12:50] PROBLEM - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:14:19] (03PS1) 10Dwisehaupt: Add cname for lp.email.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/928637 (https://phabricator.wikimedia.org/T336000) [18:15:14] PROBLEM - cassandra-a SSL 10.64.32.85:7001 on sessionstore1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:15:18] oh, there is something I missed! [18:15:30] (03PS1) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [18:16:00] PROBLEM - cassandra-a SSL 10.64.48.178:7001 on sessionstore1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:16:32] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans This port no longer used on Cassandra 4.1.1 https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:16:32] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.32.85:7001 on sessionstore1002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans This port no longer used on Cassandra 4.1.1 https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:17:24] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.178:7001 on sessionstore1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans This port no longer used on Cassandra 4.1.1 https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [18:18:13] (03PS2) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [18:18:48] !log (Re)pooling sessionstore/eqiad — T337426 [18:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:53] T337426: Upgrade sessionstore cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T337426 [18:19:02] (03PS3) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [18:19:12] !log eevans@cumin1001 START - Cookbook sre.discovery.service-route pool sessionstore in eqiad: maintenance [18:24:16] !log eevans@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool sessionstore in eqiad: maintenance [18:24:23] (03PS4) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [18:27:30] (03CR) 10Jgreen: [C: 03+2] Add cname for lp.email.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/928637 (https://phabricator.wikimedia.org/T336000) (owner: 10Dwisehaupt) [18:27:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P49287 and previous config saved to /var/cache/conftool/dbconfig/20230608-182745-ladsgroup.json [18:36:16] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host backup1010.mgmt.eqiad.wmnet with reboot policy FORCED [18:36:17] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host backup1011.mgmt.eqiad.wmnet with reboot policy FORCED [18:41:12] (03PS5) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [18:42:51] (03PS1) 10Slyngshede: C:IDM Add ldap group settings. [puppet] - 10https://gerrit.wikimedia.org/r/928641 [18:42:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T336886)', diff saved to https://phabricator.wikimedia.org/P49288 and previous config saved to /var/cache/conftool/dbconfig/20230608-184251-ladsgroup.json [18:42:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance [18:42:55] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [18:43:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance [18:43:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T336886)', diff saved to https://phabricator.wikimedia.org/P49289 and previous config saved to /var/cache/conftool/dbconfig/20230608-184312-ladsgroup.json [18:43:15] (03CR) 10CI reject: [V: 04-1] Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [18:43:23] (03PS2) 10Slyngshede: C:IDM Add ldap group settings. [puppet] - 10https://gerrit.wikimedia.org/r/928641 [18:45:43] (03PS2) 10JHathaway: DO NOT MERGE: Ensure profile::apt is applied first [puppet] - 10https://gerrit.wikimedia.org/r/927788 (https://phabricator.wikimedia.org/T338279) [18:45:57] (03PS3) 10Slyngshede: C:IDM Add ldap group settings. [puppet] - 10https://gerrit.wikimedia.org/r/928641 [18:47:04] (03PS1) 10JHathaway: initramfs::script: ensure initramfs-tools is installed [puppet] - 10https://gerrit.wikimedia.org/r/928642 [18:47:34] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41635/console" [puppet] - 10https://gerrit.wikimedia.org/r/928641 (owner: 10Slyngshede) [18:49:08] (03PS1) 10JHathaway: tshark: use a preseed file, rather than debconf::seen [puppet] - 10https://gerrit.wikimedia.org/r/928644 [18:50:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host backup1010.mgmt.eqiad.wmnet with reboot policy FORCED [19:00:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T336886)', diff saved to https://phabricator.wikimedia.org/P49290 and previous config saved to /var/cache/conftool/dbconfig/20230608-190016-ladsgroup.json [19:00:22] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:06:30] (03PS1) 10JHathaway: wmflib::dir::mkdir_p: exclude FHS dirs [puppet] - 10https://gerrit.wikimedia.org/r/928645 [19:08:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1011.mgmt.eqiad.wmnet with reboot policy FORCED [19:10:03] (03PS1) 10Ladsgroup: Externallinks: Make port part of the index [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928608 (https://phabricator.wikimedia.org/T337149) [19:10:38] jouncebot: nowandnext [19:10:38] For the next 0 hour(s) and 49 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1800) [19:10:38] In 0 hour(s) and 49 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T2000) [19:13:14] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/927796 (https://phabricator.wikimedia.org/T330490) (owner: 10JHathaway) [19:15:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P49291 and previous config saved to /var/cache/conftool/dbconfig/20230608-191522-ladsgroup.json [19:18:11] (03PS1) 10Andrew Bogott: Stop using local storage on cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/928646 [19:22:40] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host backup1010.mgmt.eqiad.wmnet with reboot policy FORCED [19:30:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P49292 and previous config saved to /var/cache/conftool/dbconfig/20230608-193028-ladsgroup.json [19:31:59] (03CR) 10Andrew Bogott: [C: 03+2] Stop using local storage on cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/928646 (owner: 10Andrew Bogott) [19:40:58] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS bullseye [19:43:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr) [19:45:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T336886)', diff saved to https://phabricator.wikimedia.org/P49293 and previous config saved to /var/cache/conftool/dbconfig/20230608-194534-ladsgroup.json [19:45:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1134.eqiad.wmnet with reason: Maintenance [19:45:38] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [19:45:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1134.eqiad.wmnet with reason: Maintenance [19:45:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T336886)', diff saved to https://phabricator.wikimedia.org/P49294 and previous config saved to /var/cache/conftool/dbconfig/20230608-194555-ladsgroup.json [19:46:39] (03PS1) 10JHathaway: dev env: nrpe listen on all interfaces in a container [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) [19:48:22] (03CR) 10CI reject: [V: 04-1] dev env: nrpe listen on all interfaces in a container [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [19:54:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T338499 (10phaultfinder) [19:54:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1028.eqiad.wmnet with reason: host reimage [19:55:09] jouncebot: nowandnext [19:55:09] For the next 0 hour(s) and 4 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T1800) [19:55:09] In 0 hour(s) and 4 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T2000) [19:56:59] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1028.eqiad.wmnet with reason: host reimage [20:00:05] brennen and TheresNoTime: #bothumor I � Unicode. All rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230608T2000). [20:01:03] Nothing in the window it seems :) [20:01:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host backup1010.mgmt.eqiad.wmnet with reboot policy FORCED [20:01:56] (03PS2) 10Jdlrobson: Remove VectorLimitedWidthIndicator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926626 (https://phabricator.wikimedia.org/T336197) (owner: 10Kimberly Sarabia) [20:02:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T336886)', diff saved to https://phabricator.wikimedia.org/P49295 and previous config saved to /var/cache/conftool/dbconfig/20230608-200204-ladsgroup.json [20:02:08] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:02:42] TheresNoTime: I'm going to do one [20:02:42] mwhaha [20:02:57] (03CR) 10Ladsgroup: [C: 03+2] Externallinks: Make port part of the index [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928608 (https://phabricator.wikimedia.org/T337149) (owner: 10Ladsgroup) [20:02:58] I was just going to say enjoy the peace [20:03:05] That can't happen now [20:03:05] :p [20:03:15] Amir1: are you backporting? [20:03:27] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/926626 < wondered if I could squeeze in some clean up (this configuration is dead code) [20:03:34] oh sure thing [20:03:44] thanks :) [20:03:50] want me to put it on wikitech:deployments ? [20:04:08] (03CR) 10Ladsgroup: [C: 03+2] Remove VectorLimitedWidthIndicator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926626 (https://phabricator.wikimedia.org/T336197) (owner: 10Kimberly Sarabia) [20:04:13] nah [20:04:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926626 (https://phabricator.wikimedia.org/T336197) (owner: 10Kimberly Sarabia) [20:05:05] (03Merged) 10jenkins-bot: Remove VectorLimitedWidthIndicator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/926626 (https://phabricator.wikimedia.org/T336197) (owner: 10Kimberly Sarabia) [20:05:23] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:926626|Remove VectorLimitedWidthIndicator (T336197)]] [20:05:26] T336197: Remove popup indicator on page load and associated configuration - https://phabricator.wikimedia.org/T336197 [20:06:49] !log ladsgroup@deploy1002 ladsgroup and ksarabia: Backport for [[gerrit:926626|Remove VectorLimitedWidthIndicator (T336197)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:12:20] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10KFrancis) Thank you! The agreement has been sent for signatures. I'll confirm when it's complete. [20:12:55] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:926626|Remove VectorLimitedWidthIndicator (T336197)]] (duration: 07m 32s) [20:12:59] T336197: Remove popup indicator on page load and associated configuration - https://phabricator.wikimedia.org/T336197 [20:13:09] deployed [20:13:13] Jdlrobson: ^ [20:15:23] (03PS1) 10David Martin: Add wikifunctions.ui stream to metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928654 (https://phabricator.wikimedia.org/T336722) [20:15:31] thanks Amir1 ! [20:15:37] (03CR) 10BryanDavis: [C: 03+2] python: Replace --mount with --wsgi-file in webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/925099 (https://phabricator.wikimedia.org/T337897) (owner: 10BryanDavis) [20:15:41] ^_^ [20:16:20] (03Merged) 10jenkins-bot: python: Replace --mount with --wsgi-file in webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/925099 (https://phabricator.wikimedia.org/T337897) (owner: 10BryanDavis) [20:17:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P49296 and previous config saved to /var/cache/conftool/dbconfig/20230608-201710-ladsgroup.json [20:17:59] (03PS2) 10JHathaway: dev env: nrpe listen on all interfaces in a container [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) [20:18:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928608 (https://phabricator.wikimedia.org/T337149) (owner: 10Ladsgroup) [20:19:41] (03PS2) 10Jforrester: [BETA CLUSTER] Add wikifunctions.ui stream to metawiki wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928654 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [20:20:34] (03Merged) 10jenkins-bot: Externallinks: Make port part of the index [core] (wmf/1.41.0-wmf.12) - 10https://gerrit.wikimedia.org/r/928608 (https://phabricator.wikimedia.org/T337149) (owner: 10Ladsgroup) [20:20:51] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:928608|Externallinks: Make port part of the index (T337149)]] [20:20:54] T337149: CAPTCHA required to edit any page on testwiki containing a link with no path - https://phabricator.wikimedia.org/T337149 [20:21:41] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1028.eqiad.wmnet with OS bullseye [20:22:27] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:928608|Externallinks: Make port part of the index (T337149)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:22:27] Amir1: Can you shout when you're done (or just push out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/928654 Beta-Cluster-only one)? [20:23:15] sure thing, I'll take care of it [20:23:24] PROBLEM - aqs endpoints health on aqs2001 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [20:23:34] Amir1: <3 [20:25:19] (03PS1) 10JHathaway: dev env: avoid kernel tweaks when in a container [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) [20:26:02] (03PS1) 10Andrea Denisse: Add Debian packaging for 21.3.0 [software/librenms] - 10https://gerrit.wikimedia.org/r/928658 (https://phabricator.wikimedia.org/T278309) [20:26:05] (03PS1) 10Andrea Denisse: Add missing build dependencies for the Debian package [software/librenms] - 10https://gerrit.wikimedia.org/r/928659 (https://phabricator.wikimedia.org/T278309) [20:27:09] (03PS2) 10Jforrester: Replace underscores with spaces in 4 Arabic sitenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan) [20:31:01] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:928608|Externallinks: Make port part of the index (T337149)]] (duration: 10m 10s) [20:31:08] T337149: CAPTCHA required to edit any page on testwiki containing a link with no path - https://phabricator.wikimedia.org/T337149 [20:32:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P49297 and previous config saved to /var/cache/conftool/dbconfig/20230608-203216-ladsgroup.json [20:33:13] (03PS1) 10JHathaway: dev env: don't manage resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/928661 (https://phabricator.wikimedia.org/T337972) [20:35:52] (03CR) 10Ladsgroup: [C: 03+2] [BETA CLUSTER] Add wikifunctions.ui stream to metawiki wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928654 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [20:36:40] (03Merged) 10jenkins-bot: [BETA CLUSTER] Add wikifunctions.ui stream to metawiki wgEventStreams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/928654 (https://phabricator.wikimedia.org/T336722) (owner: 10David Martin) [20:38:23] 10SRE, 10SRE-Access-Requests: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10KFrancis) The NDA is complete. Please proceed with the access request. [20:38:32] (03PS1) 10JHathaway: dev env: don't pull firewall rules from etcd [puppet] - 10https://gerrit.wikimedia.org/r/928662 (https://phabricator.wikimedia.org/T337972) [20:40:03] (03PS1) 10JHathaway: dev env: allow setting $site via an env var [puppet] - 10https://gerrit.wikimedia.org/r/928663 (https://phabricator.wikimedia.org/T337972) [20:41:06] Amir1, thanks for that piece of info, I had no idea. Wow! [20:45:31] Hey, was about to send someone the static reporting connectivity issue page but uh... https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue [20:45:46] "[68729a08240c466183d22539] /wiki/Reporting_a_connectivity_issue Shellbox\ShellboxError: Error creating directory shellbox-730c2b0c1e3faab6" [20:46:04] (03PS1) 10JHathaway: dev env: have ssh server use the dev environment ssh configs [puppet] - 10https://gerrit.wikimedia.org/r/928664 (https://phabricator.wikimedia.org/T337972) [20:47:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T336886)', diff saved to https://phabricator.wikimedia.org/P49298 and previous config saved to /var/cache/conftool/dbconfig/20230608-204722-ladsgroup.json [20:47:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:47:27] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [20:47:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:48:02] From the backtrace it looks like some issue running syntax highlighting in shellbox [20:48:48] (03PS1) 10JHathaway: dev env: disable the puppet agent [puppet] - 10https://gerrit.wikimedia.org/r/928665 (https://phabricator.wikimedia.org/T337972) [20:51:25] (03PS1) 10JHathaway: dev env: add a basic puppet enc [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) [20:53:11] (03PS1) 10JHathaway: dev env: get_config support for dev [puppet] - 10https://gerrit.wikimedia.org/r/928669 (https://phabricator.wikimedia.org/T337972) [20:53:50] rzl, cwhite, are either of you free to take a look at the above? Want to make sure I can send this email for someone's unrelated connectivity issue after all :P [20:53:50] (03CR) 10CI reject: [V: 04-1] dev env: add a basic puppet enc [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [20:54:13] (03PS1) 10JHathaway: dev env: add an insetup role for container builds [puppet] - 10https://gerrit.wikimedia.org/r/928670 (https://phabricator.wikimedia.org/T337972) [20:55:51] (03PS1) 10JHathaway: dev env: Add role::puppetserver::dev [puppet] - 10https://gerrit.wikimedia.org/r/928671 (https://phabricator.wikimedia.org/T337972) [20:56:13] * taavi wonders if he has the wikitech-static ssh credentials [20:56:30] hackerman [20:57:28] nope. I remember I considered doing something at some point that would have required those (a php or os upgrade iirc), but that apparently never happened [20:58:16] (03PS1) 10JHathaway: dev env: hiera data [puppet] - 10https://gerrit.wikimedia.org/r/928672 (https://phabricator.wikimedia.org/T337972) [21:00:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [21:00:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [21:01:18] perryprog: whoa, thanks [21:01:48] Yeah surprised no alarms went off [21:02:00] perryprog: fwiw you can still access the nonstatic page at https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue, even if the affected user can't -- I'll follow up about wikitech-static, appreciate the report [21:02:03] * perryprog nods [21:02:47] (03PS2) 10JHathaway: dev env: add a basic puppet enc [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) [21:03:25] (03PS2) 10JHathaway: dev env: hiera data [puppet] - 10https://gerrit.wikimedia.org/r/928672 (https://phabricator.wikimedia.org/T337972) [21:04:34] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928642 (owner: 10JHathaway) [21:04:52] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928642 (owner: 10JHathaway) [21:04:57] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928644 (owner: 10JHathaway) [21:05:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928645 (owner: 10JHathaway) [21:05:21] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:05:31] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928651 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:05:36] yeah, any wikitech-static page with tags is broken [21:05:40] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:05:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928661 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:05:49] Is wikitech-static not actually static! :O [21:05:57] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1010.eqiad.wmnet'] [21:05:58] it's... static-ish :) [21:05:58] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928661 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:06:05] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928657 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:06:18] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928663 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:06:28] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928663 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:06:31] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928662 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:06:45] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928662 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:06:47] Reporting_a_connectivity_issue is the only page that's *really* tragic to not have available on -static, so if we can't get this resolved quickly I'll just temporarily remove syntax highlighting from that page [21:06:50] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928664 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:06:54] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1011.eqiad.wmnet'] [21:06:54] but let's see if it's an easy fix, first [21:06:59] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['backup1010.eqiad.wmnet'] [21:07:04] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928664 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:07:09] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1010.eqiad.wmnet'] [21:07:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928665 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:07:17] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928665 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:07:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928669 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:07:32] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928669 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:07:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['backup1010.eqiad.wmnet'] [21:07:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['backup1011.eqiad.wmnet'] [21:07:45] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928670 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:07:46] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['backup1011.eqiad.wmnet'] [21:07:51] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928670 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:07:57] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928671 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:08:03] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928671 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:08:09] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:08:14] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928667 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:08:17] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['backup1011.eqiad.wmnet'] [21:08:18] perryprog: the mediawiki files on disk are more static than what the production cluster files are! [21:08:19] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/928672 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:08:25] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/928672 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [21:08:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr) [21:10:40] I would call it "hilariously ironic", but in any case it's greatly appreciated. For the person in question I more or less just repeated the non-technical bits of that page (e.g., redirects to #-tech or noc@) in case none of the other links work for them. (They're having what sounds like a routing or DNS issue, maybe? They're mentioning their devices are giving timeouts but exclusively on home wifi and not for any other site.) [21:11:15] Fingers crossed that they are technical because if it's still persisting it doesn't sound like an easy-to-diagnose issue. 🤔 [21:14:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [21:14:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [21:14:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T336886)', diff saved to https://phabricator.wikimedia.org/P49300 and previous config saved to /var/cache/conftool/dbconfig/20230608-211419-ladsgroup.json [21:14:23] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:23:46] (03PS9) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) [21:25:14] https://phabricator.wikimedia.org/T338520 for the above [21:26:04] Thanks! [21:29:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T336886)', diff saved to https://phabricator.wikimedia.org/P49301 and previous config saved to /var/cache/conftool/dbconfig/20230608-212957-ladsgroup.json [21:30:01] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [21:30:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [21:31:04] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1003.mgmt.eqiad.wmnet with reboot policy FORCED [21:32:58] (03CR) 10Ahmon Dancy: git::clone: Handle changes to origin URL and/or branch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/927750 (https://phabricator.wikimedia.org/T290260) (owner: 10Ahmon Dancy) [21:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P49302 and previous config saved to /var/cache/conftool/dbconfig/20230608-214503-ladsgroup.json [22:00:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P49303 and previous config saved to /var/cache/conftool/dbconfig/20230608-220009-ladsgroup.json [22:15:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T336886)', diff saved to https://phabricator.wikimedia.org/P49304 and previous config saved to /var/cache/conftool/dbconfig/20230608-221515-ladsgroup.json [22:15:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [22:15:19] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:15:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [22:15:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T336886)', diff saved to https://phabricator.wikimedia.org/P49305 and previous config saved to /var/cache/conftool/dbconfig/20230608-221536-ladsgroup.json [22:29:48] (03CR) 10Dzahn: [C: 03+2] site: remove gerrit1001 from gerrit role, rm hiera host data [puppet] - 10https://gerrit.wikimedia.org/r/919407 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [22:30:56] (03PS6) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [22:31:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T336886)', diff saved to https://phabricator.wikimedia.org/P49306 and previous config saved to /var/cache/conftool/dbconfig/20230608-223111-ladsgroup.json [22:31:19] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [22:33:09] (03CR) 10CI reject: [V: 04-1] Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) (owner: 10BCornwall) [22:33:22] (03PS1) 10Dzahn: site: fix typo in gerrit1001 role assignment [puppet] - 10https://gerrit.wikimedia.org/r/928676 (https://phabricator.wikimedia.org/T336427) [22:33:48] (03CR) 10Dzahn: [C: 03+2] site: fix typo in gerrit1001 role assignment [puppet] - 10https://gerrit.wikimedia.org/r/928676 (https://phabricator.wikimedia.org/T336427) (owner: 10Dzahn) [22:33:59] (03PS7) 10BCornwall: Create a CDN host reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/928638 (https://phabricator.wikimedia.org/T335835) [22:35:25] !log removing gerrit role from former gerrit prod machine gerrit1001, removes firewall rules, shell access, monitoring..etc [22:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:49] !log gerrit1001 - rmdir /etc/ssh/userkeys/gerrit.d which leads to puppet warnings because it cant remove empty dir [22:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on gerrit1001.wikimedia.org with reason: decom [22:39:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on gerrit1001.wikimedia.org with reason: decom [22:43:31] (03PS2) 10Dzahn: admin: remove contint-roots from releases hosts [puppet] - 10https://gerrit.wikimedia.org/r/928108 [22:44:53] (03CR) 10Dzahn: "either it's used or it's not used. having it setup but also hear that it's not used just doesn't go together well" [puppet] - 10https://gerrit.wikimedia.org/r/928108 (owner: 10Dzahn) [22:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P49307 and previous config saved to /var/cache/conftool/dbconfig/20230608-224617-ladsgroup.json [23:01:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P49308 and previous config saved to /var/cache/conftool/dbconfig/20230608-230123-ladsgroup.json [23:03:12] (03CR) 10Dzahn: [C: 04-1] ""Phabricator::Logmail[yearly_metrics]: has no parameter named 'month' "" [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [23:11:20] (03PS1) 10Dzahn: phabricator: add support for month parameter in logmail class [puppet] - 10https://gerrit.wikimedia.org/r/928680 (https://phabricator.wikimedia.org/T337388) [23:11:36] (03CR) 10Dzahn: [C: 04-1] "it will first need this https://gerrit.wikimedia.org/r/928680" [puppet] - 10https://gerrit.wikimedia.org/r/923367 (https://phabricator.wikimedia.org/T337388) (owner: 10Aklapper) [23:11:43] (03CR) 10CI reject: [V: 04-1] phabricator: add support for month parameter in logmail class [puppet] - 10https://gerrit.wikimedia.org/r/928680 (https://phabricator.wikimedia.org/T337388) (owner: 10Dzahn) [23:12:31] (03PS2) 10Dzahn: phabricator: add support for month parameter in logmail class [puppet] - 10https://gerrit.wikimedia.org/r/928680 (https://phabricator.wikimedia.org/T337388) [23:13:40] (03CR) 10Dzahn: [C: 04-1] "same here, we don't have the "month" parameter yet, but I uploaded a patch for that. as is should not be merged though" [puppet] - 10https://gerrit.wikimedia.org/r/922836 (https://phabricator.wikimedia.org/T337387) (owner: 10Aklapper) [23:14:46] (03CR) 10Dzahn: "Arnold, would like to chat with you about this some time." [puppet] - 10https://gerrit.wikimedia.org/r/922820 (https://phabricator.wikimedia.org/T337382) (owner: 10Aklapper) [23:16:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T336886)', diff saved to https://phabricator.wikimedia.org/P49309 and previous config saved to /var/cache/conftool/dbconfig/20230608-231629-ladsgroup.json [23:16:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [23:16:34] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:16:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1186.eqiad.wmnet with reason: Maintenance [23:16:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T336886)', diff saved to https://phabricator.wikimedia.org/P49310 and previous config saved to /var/cache/conftool/dbconfig/20230608-231650-ladsgroup.json [23:22:00] (03PS1) 10Jclark-ctr: Add backup101[0-1] site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/928681 (https://phabricator.wikimedia.org/T326684) [23:27:15] (03CR) 10Papaul: [V: 03+1] Add backup101[0-1] site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/928681 (https://phabricator.wikimedia.org/T326684) (owner: 10Jclark-ctr) [23:27:39] (03CR) 10Jclark-ctr: [C: 03+2] Add backup101[0-1] site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/928681 (https://phabricator.wikimedia.org/T326684) (owner: 10Jclark-ctr) [23:32:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T336886)', diff saved to https://phabricator.wikimedia.org/P49311 and previous config saved to /var/cache/conftool/dbconfig/20230608-233214-ladsgroup.json [23:32:18] T336886: Add user_is_temp field to the user table in MediaWiki core - https://phabricator.wikimedia.org/T336886 [23:32:45] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for eccenux - https://phabricator.wikimedia.org/T337121 (10KFrancis) Hi all, I am confirming the NDA is complete. Please proceed with the access request. Thanks! [23:38:39] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:40:35] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entry for pki-root - pt1979@cumin2002" [23:41:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entry for pki-root - pt1979@cumin2002" [23:41:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:42:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host pki-root1002.mgmt.eqiad.wmnet with reboot policy FORCED [23:47:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P49312 and previous config saved to /var/cache/conftool/dbconfig/20230608-234720-ladsgroup.json [23:51:14] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [23:51:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [23:54:28] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bullseye [23:54:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye [23:54:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1010.eqiad.wmnet with OS bullseye [23:54:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host backup1010.eqiad.wmnet with OS bullseye executed with err... [23:55:10] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host backup1011.eqiad.wmnet with OS bullseye [23:55:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host backup1011.eqiad.wmnet with OS bullseye [23:58:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence, and 2 others: Q4:rack/setup/install backup1010, backup1011 - https://phabricator.wikimedia.org/T326684 (10Jclark-ctr)