[00:00:05] RoanKattouw and Urbanecm: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T0000). [00:00:05] SCardenasM, eigyan, and RhinosF1: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:06] (03PS3) 10Catrope: bgwiki: fix setup for Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756712 (https://phabricator.wikimedia.org/T299224) (owner: 10Gerrit Patch Uploader) [00:01:39] (03CR) 10Catrope: [C: 03+2] bgwiki: fix setup for Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756712 (https://phabricator.wikimedia.org/T299224) (owner: 10Gerrit Patch Uploader) [00:02:03] I also added a patch at 21:01 [00:02:23] Juan_90264: Thanks, I see those now after refreshing [00:02:24] * RhinosF1 here [00:02:28] *at 00:01 [00:04:50] (03CR) 10Ryan Kemper: [C: 03+2] rdf query service: Include host header with proxy_pass [puppet] - 10https://gerrit.wikimedia.org/r/756724 (https://phabricator.wikimedia.org/T295676) (owner: 10Ebernhardson) [00:06:15] RoanKattouw: which mwdebug? [00:06:21] It's still not merged :( [00:06:45] Someone just +2ed a big stack of Wikibase patches and those are taking up all the CI resources [00:06:54] config patches are supposed to be prioritized, but I don't see that happening [00:07:06] yeah [00:09:16] RoanKattouw: afaik gate-and-submit as a whole has the highest priority [00:09:38] (config patches take priority in the pre-merge V+2) [00:10:53] (03Merged) 10jenkins-bot: bgwiki: fix setup for Draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756712 (https://phabricator.wikimedia.org/T299224) (owner: 10Gerrit Patch Uploader) [00:11:07] Also apparently amending an already-+2ed patch will immediately try to merge it again?! See https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/755730 [00:11:34] RhinosF1: Alright, ready on mwdebug1002, sorry for the delay [00:11:59] (03PS1) 10Zabe: Add ombuds.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/756732 (https://phabricator.wikimedia.org/T273323) [00:12:40] (03PS1) 10Zabe: Add ombuds.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/756733 (https://phabricator.wikimedia.org/T273323) [00:12:48] RoanKattouw: lgtm, please sync [00:13:35] (03PS1) 10Zabe: MWMultiVersion: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756734 (https://phabricator.wikimedia.org/T273323) [00:13:37] (03PS1) 10Zabe: InitialiseSettings: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756735 (https://phabricator.wikimedia.org/T273323) [00:13:55] zabe: that task is resolved now. Please check with OC if that's still desired. [00:14:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:34] urbanecm: off course [00:14:46] (03CR) 10jerkins-bot: [V: 04-1] InitialiseSettings: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756735 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [00:14:56] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:756712|bgwiki: fix setup for Draft namespace (T299224)]] (duration: 00m 49s) [00:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:00] T299224: bgwiki: Add draft namespace - https://phabricator.wikimedia.org/T299224 [00:15:17] (03PS6) 10Catrope: Enable wgMinervaEnableSiteNotice for bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756585 (https://phabricator.wikimedia.org/T299529) (owner: 10Juan90264) [00:15:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:15:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:33] (03CR) 10Catrope: [C: 03+2] Enable wgMinervaEnableSiteNotice for bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756585 (https://phabricator.wikimedia.org/T299529) (owner: 10Juan90264) [00:15:39] (03CR) 10jerkins-bot: [V: 04-1] MWMultiVersion: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756734 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [00:15:48] eigyan ,SCardenasM: Are you here for your deployments? [00:16:00] Yuo [00:16:03] Yes I am [00:16:29] Thanks :) your patches will be next after Juan_90264's patch [00:16:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:40] (y) [00:16:55] RoanKattouw great; thanks! [00:16:57] (03Merged) 10jenkins-bot: Enable wgMinervaEnableSiteNotice for bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756585 (https://phabricator.wikimedia.org/T299529) (owner: 10Juan90264) [00:17:19] Juan_90264: Your patch is on mwdebug1002, please test [00:17:25] Ok [00:17:55] (03PS2) 10Catrope: Lower The Wikipedia Library editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755834 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [00:17:59] (03CR) 10Catrope: [C: 03+2] Lower The Wikipedia Library editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755834 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [00:19:20] (03PS2) 10Zabe: MWMultiVersion: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756734 (https://phabricator.wikimedia.org/T273323) [00:19:23] (03Merged) 10jenkins-bot: Lower The Wikipedia Library editcount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755834 (https://phabricator.wikimedia.org/T288070) (owner: 10Scardenasmolinar) [00:20:47] (03PS2) 10Zabe: InitialiseSettings: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756735 (https://phabricator.wikimedia.org/T273323) [00:21:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:35] RoanKattouw: I approved [00:22:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:22:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:56] Great, deploying [00:23:38] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:756585|Enable wgMinervaEnableSiteNotice for bnwiki (T299529)]] (duration: 00m 49s) [00:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:42] T299529: Enable wgMinervaEnableSiteNotice for bnwiki - https://phabricator.wikimedia.org/T299529 [00:24:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:34] SCardenasM: Your patch is now on mwdebug1002 for testing [00:24:43] Thanks! Taking a look now [00:24:48] (03PS3) 10Catrope: [wmf-config]: Deploy fawiki test survey to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628) (owner: 10Eigyan) [00:24:52] (03CR) 10Catrope: [C: 03+2] [wmf-config]: Deploy fawiki test survey to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628) (owner: 10Eigyan) [00:25:38] (03Merged) 10jenkins-bot: [wmf-config]: Deploy fawiki test survey to beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755792 (https://phabricator.wikimedia.org/T297628) (owner: 10Eigyan) [00:28:04] RoanKattouw: LGTM! [00:29:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:09] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755834|Lower The Wikipedia Library editcount]] (duration: 00m 49s) [00:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:02] Oh and I just realized eigyan's patch is beta-only, so it doesn't need a deployment [00:30:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:30:22] It'll be deployed to beta by an automated process some time in the next 15-20 minutes [00:30:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:46] All done! Thanks everyone [00:31:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:39] (03PS1) 10Jdlrobson: Fix bug in SkinVersionLookup [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756696 (https://phabricator.wikimedia.org/T299971) [00:36:35] Thanks, @roan [00:36:52] thank you RoanKattouw [00:51:16] (03CR) 10Cwhite: [C: 03+1] site: add Prometheus role to codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/756603 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [00:51:29] (03CR) 10Cwhite: [C: 03+1] site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [00:51:44] (03CR) 10Cwhite: [C: 03+1] prometheus: disable rsync where not needed [puppet] - 10https://gerrit.wikimedia.org/r/756607 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [00:55:14] (03CR) 10jerkins-bot: [V: 04-1] Fix bug in SkinVersionLookup [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756696 (https://phabricator.wikimedia.org/T299971) (owner: 10Jdlrobson) [01:06:10] (03Abandoned) 10Urbanecm: [labs] Set GlobalBlockRemoteReasonUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743695 (https://phabricator.wikimedia.org/T243863) (owner: 10Urbanecm) [01:36:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:38:45] RECOVERY - Disk space on centrallog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [01:39:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T0200) [02:03:31] (03PS1) 10Jdlrobson: Opt in link should be different in migration mode [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756697 (https://phabricator.wikimedia.org/T299927) [02:07:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.19 [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756746 [02:07:09] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.19 [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756746 (owner: 10TrainBranchBot) [02:07:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:10:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:35] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.19 [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756746 (owner: 10TrainBranchBot) [02:27:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:28:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:39] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) [03:09:46] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) [03:27:55] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10Seddon) [03:33:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [03:34:34] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10Seddon) [03:34:44] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10Seddon) [03:38:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [03:40:33] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: labstore1007, miscweb1002, labstore1006, wdqs1010, build2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [03:48:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10RLazarus) [03:48:31] (03CR) 10EpicPupper: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/756698 (https://phabricator.wikimedia.org/T283273) (owner: 10EpicPupper) [05:11:59] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: wdqs1010, build2001, miscweb1002, labstore1007, labstore1006 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:58:11] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) @Volans any thoughts? I can try this reimage with you if that'd help with the troubleshooting. [05:58:13] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:00:27] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:01:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:01:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [06:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:01:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [06:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [06:01:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [06:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T285149)', diff saved to https://phabricator.wikimedia.org/P19078 and previous config saved to /var/cache/conftool/dbconfig/20220125-060128-marostegui.json [06:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:35] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [06:02:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1030 T299889', diff saved to https://phabricator.wikimedia.org/P19079 and previous config saved to /var/cache/conftool/dbconfig/20220125-060241-marostegui.json [06:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:45] T299889: Upgrade es2 to Bullseye - https://phabricator.wikimedia.org/T299889 [06:02:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T285149)', diff saved to https://phabricator.wikimedia.org/P19080 and previous config saved to /var/cache/conftool/dbconfig/20220125-060247-marostegui.json [06:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:27] (03PS1) 10Marostegui: es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756871 (https://phabricator.wikimedia.org/T299889) [06:05:15] (03CR) 10Marostegui: [C: 03+2] es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756871 (https://phabricator.wikimedia.org/T299889) (owner: 10Marostegui) [06:07:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1030.eqiad.wmnet with OS bullseye [06:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P19081 and previous config saved to /var/cache/conftool/dbconfig/20220125-061751-marostegui.json [06:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:52] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756873 (https://phabricator.wikimedia.org/T299046) [06:24:12] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756873 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [06:25:23] (03PS1) 10Marostegui: pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756874 (https://phabricator.wikimedia.org/T299046) [06:25:38] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756873 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [06:26:12] (03CR) 10Marostegui: [C: 03+2] pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756874 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [06:26:50] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1014 to master in pc3 T299046 (duration: 00m 49s) [06:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:54] T299046: Upgrade parsercache infra to Bullseye - https://phabricator.wikimedia.org/T299046 [06:31:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: switch production to v3 configuration api [puppet] - 10https://gerrit.wikimedia.org/r/754460 (owner: 10Giuseppe Lavagetto) [06:32:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P19082 and previous config saved to /var/cache/conftool/dbconfig/20220125-063256-marostegui.json [06:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:05] (03PS3) 10Giuseppe Lavagetto: envoy: switch production to v3 configuration api [puppet] - 10https://gerrit.wikimedia.org/r/754460 [06:33:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:33:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [06:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:14] (03PS1) 10Marostegui: Revert "es1030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756705 [06:34:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [06:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1030.eqiad.wmnet with OS bullseye [06:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:33] (03CR) 10Marostegui: [C: 03+2] Revert "es1030: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756705 (owner: 10Marostegui) [06:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19083 and previous config saved to /var/cache/conftool/dbconfig/20220125-063655-root.json [06:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:59] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:45:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={pdu_sentry4,swagger_check_restbase_eqsin} site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:47:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:48:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T285149)', diff saved to https://phabricator.wikimedia.org/P19084 and previous config saved to /var/cache/conftool/dbconfig/20220125-064801-marostegui.json [06:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:06] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [06:48:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:48:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [06:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [06:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [06:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:48:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T285149)', diff saved to https://phabricator.wikimedia.org/P19085 and previous config saved to /var/cache/conftool/dbconfig/20220125-064829-marostegui.json [06:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc1013.eqiad.wmnet with OS bullseye [06:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T285149)', diff saved to https://phabricator.wikimedia.org/P19086 and previous config saved to /var/cache/conftool/dbconfig/20220125-064936-marostegui.json [06:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:15] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_purge_parsercache_pc3.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19087 and previous config saved to /var/cache/conftool/dbconfig/20220125-065158-root.json [06:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P19088 and previous config saved to /var/cache/conftool/dbconfig/20220125-070441-marostegui.json [07:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19089 and previous config saved to /var/cache/conftool/dbconfig/20220125-070702-root.json [07:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:39] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @ArielGlenn Thank you for your feedback. I have created an other task here https://phabricator.wikimedia.org/T299993 [07:12:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1013.eqiad.wmnet with OS bullseye [07:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:06] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: labstore1007, labstore1006, build2001, wdqs1010, miscweb1002 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:18:27] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:19:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P19090 and previous config saved to /var/cache/conftool/dbconfig/20220125-071945-marostegui.json [07:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:15] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:22:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19091 and previous config saved to /var/cache/conftool/dbconfig/20220125-072206-root.json [07:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:14] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:34:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T285149)', diff saved to https://phabricator.wikimedia.org/P19092 and previous config saved to /var/cache/conftool/dbconfig/20220125-073450-marostegui.json [07:34:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:34:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:55] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T285149)', diff saved to https://phabricator.wikimedia.org/P19093 and previous config saved to /var/cache/conftool/dbconfig/20220125-073457-marostegui.json [07:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19094 and previous config saved to /var/cache/conftool/dbconfig/20220125-073709-root.json [07:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T285149)', diff saved to https://phabricator.wikimedia.org/P19095 and previous config saved to /var/cache/conftool/dbconfig/20220125-073805-marostegui.json [07:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:16] (03PS4) 10Giuseppe Lavagetto: tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) [07:43:40] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:52:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19096 and previous config saved to /var/cache/conftool/dbconfig/20220125-075213-root.json [07:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P19097 and previous config saved to /var/cache/conftool/dbconfig/20220125-075309-marostegui.json [07:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19098 and previous config saved to /var/cache/conftool/dbconfig/20220125-080717-root.json [08:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P19099 and previous config saved to /var/cache/conftool/dbconfig/20220125-080814-marostegui.json [08:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:44] (03PS1) 10Marostegui: Revert "pc1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756926 [08:20:47] (03CR) 10Marostegui: [C: 03+2] Revert "pc1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756926 (owner: 10Marostegui) [08:21:05] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Upgrade staging-eqiad kubernetes master to a full node [puppet] - 10https://gerrit.wikimedia.org/r/755977 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [08:21:45] (03CR) 10JMeybohm: [C: 03+2] Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [08:21:54] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756927 [08:22:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19100 and previous config saved to /var/cache/conftool/dbconfig/20220125-082220-root.json [08:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:44] (03Merged) 10jenkins-bot: Add kubestagemaster1001 to k8s_staging eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/755978 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [08:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T285149)', diff saved to https://phabricator.wikimedia.org/P19101 and previous config saved to /var/cache/conftool/dbconfig/20220125-082319-marostegui.json [08:23:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:23:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:23] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [08:23:24] sobanski: fyi the VRTS (https://phabricator.wikimedia.org/project/view/5725/) project you created on phab, is probably a dup of this one https://phabricator.wikimedia.org/project/board/210/ [08:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:26] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756927 (owner: 10Marostegui) [08:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T285149)', diff saved to https://phabricator.wikimedia.org/P19102 and previous config saved to /var/cache/conftool/dbconfig/20220125-082326-marostegui.json [08:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:09] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756927 (owner: 10Marostegui) [08:25:01] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubestagemaster1001.eqiad.wmnet [08:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:39] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Revert: Promote pc1013 to master in pc3 T299046 (duration: 00m 49s) [08:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:43] T299046: Upgrade parsercache infra to Bullseye - https://phabricator.wikimedia.org/T299046 [08:29:52] (03PS1) 10Marostegui: pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/756947 (https://phabricator.wikimedia.org/T299046) [08:30:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:53] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc1 [puppet] - 10https://gerrit.wikimedia.org/r/756947 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [08:32:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:32:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [08:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:41] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagemaster1001.eqiad.wmnet [08:32:41] !log kubernetes staging migrated tainted worker node setup - T290967 [08:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:47] T290967: kube-apiserver need to reach webhooks running inside of the cluster - https://phabricator.wikimedia.org/T290967 [08:33:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [08:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:22] (03CR) 10DCausse: wcqs: set QUERY_SERVICE env name with wcqs/wdqs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753973 (owner: 10DCausse) [08:35:00] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10MoritzMuehlenhoff) One future alternative may be to use new approval plugin introduced in sudo 1.9 which would allow to write a custom approval check with a... [08:37:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19103 and previous config saved to /var/cache/conftool/dbconfig/20220125-083724-root.json [08:37:26] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [08:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:43] 10Puppet, 10CFSSL-PKI, 10Infrastructure-Foundations, 10User-jbond: cfssl::cert dosn't refresh certificate if csr data changes - https://phabricator.wikimedia.org/T294832 (10JMeybohm) [08:44:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1013.eqiad.wmnet with OS buster [08:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:07] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1013.eqiad.wmnet with OS buster [08:45:47] !log draining instances off ganeti1005 for reimage [08:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:26] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [08:48:28] KubernetesRsyslogDown should clear in a minute [08:51:28] (03PS1) 10Marostegui: add_gb_by_central_id_T299827.py: Fixes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756948 (https://phabricator.wikimedia.org/T299827) [08:52:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19104 and previous config saved to /var/cache/conftool/dbconfig/20220125-085228-root.json [08:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:55] (03CR) 10Marostegui: "Tested without --run and worked fine" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756948 (https://phabricator.wikimedia.org/T299827) (owner: 10Marostegui) [08:53:43] (03PS2) 10Marostegui: add_gb_by_central_id_T299827.py: Fixes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756948 (https://phabricator.wikimedia.org/T299827) [09:04:13] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Volans) a:05cmooney→03jhathaway Thanks for checking @Ejegg. @jgleeson sure, we can do... [09:08:08] (03CR) 10MMandere: [C: 03+2] site: Add drmrs ncredir host [puppet] - 10https://gerrit.wikimedia.org/r/756613 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [09:11:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1013.eqiad.wmnet with OS buster [09:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:53] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1013.eqiad.wmnet with OS buster completed: - ganeti1013 (**PASS**)... [09:13:10] (03PS3) 10MMandere: site: Add drmrs ncredir host [puppet] - 10https://gerrit.wikimedia.org/r/756613 (https://phabricator.wikimedia.org/T282787) [09:15:19] (03PS1) 10Muehlenhoff: Make ganeti1027 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/756950 (https://phabricator.wikimedia.org/T293909) [09:16:04] (03CR) 10jerkins-bot: [V: 04-1] Make ganeti1027 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/756950 (https://phabricator.wikimedia.org/T293909) (owner: 10Muehlenhoff) [09:17:54] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10Volans) [09:19:45] (03PS2) 10Muehlenhoff: Make ganeti1027 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/756950 (https://phabricator.wikimedia.org/T293909) [09:20:21] (03CR) 10Ayounsi: P:installserver::proxy: Add domain whitelist to proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [09:22:25] (03CR) 10Volans: [C: 04-1] "Patch LGTM, voting -1 just because needs to wait the missing steps on task (L3, approval) before merging." [puppet] - 10https://gerrit.wikimedia.org/r/756708 (https://phabricator.wikimedia.org/T299072) (owner: 10JHathaway) [09:23:19] !log restarting blazegraph on wdqs1004 (jvm stuck for 1h) [09:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T285149)', diff saved to https://phabricator.wikimedia.org/P19105 and previous config saved to /var/cache/conftool/dbconfig/20220125-092346-marostegui.json [09:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:50] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:23:54] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host ncredir6001.drmrs.wmnet [09:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:27] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:25:26] (03CR) 10Ben Rohlfs: gerrit: move CI result table to a tab (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/756685 (owner: 10Hashar) [09:26:52] ^ when upstream at Google reviews your patch. I really love open source [09:27:15] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:27:31] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:27:58] (03PS3) 10Muehlenhoff: Make ganeti1027 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/756950 (https://phabricator.wikimedia.org/T293909) [09:37:13] (03CR) 10Ladsgroup: [C: 03+1] add_gb_by_central_id_T299827.py: Fixes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756948 (https://phabricator.wikimedia.org/T299827) (owner: 10Marostegui) [09:37:43] (03CR) 10Marostegui: [V: 03+2 C: 03+2] add_gb_by_central_id_T299827.py: Fixes [software/schema-changes] - 10https://gerrit.wikimedia.org/r/756948 (https://phabricator.wikimedia.org/T299827) (owner: 10Marostegui) [09:38:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:38:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T299827)', diff saved to https://phabricator.wikimedia.org/P19106 and previous config saved to /var/cache/conftool/dbconfig/20220125-093806-marostegui.json [09:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:10] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [09:38:19] PROBLEM - PyBal IPVS diff check on lvs6001 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2a02:ec80:600:ed1a::3:443, 2a02:ec80:600:ed1a::3:80, 185.15.58.226:80, 185.15.58.226:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:38:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P19107 and previous config saved to /var/cache/conftool/dbconfig/20220125-093850-marostegui.json [09:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T299827)', diff saved to https://phabricator.wikimedia.org/P19108 and previous config saved to /var/cache/conftool/dbconfig/20220125-093912-marostegui.json [09:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:08] !log mmandere@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir6001.drmrs.wmnet [09:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:21] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:24] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti1027 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/756950 (https://phabricator.wikimedia.org/T293909) (owner: 10Muehlenhoff) [09:40:41] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:40:45] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:03] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:42:27] (03PS2) 10ArielGlenn: clean up older enterprise html dumps, keep the last 6 runs [puppet] - 10https://gerrit.wikimedia.org/r/756596 (https://phabricator.wikimedia.org/T273585) [09:42:27] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:42:37] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:42:41] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:43:01] PROBLEM - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 8 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [09:43:07] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:43:10] mmandere: ^ these seem related to https://sal.toolforge.org/log/GjubkH4B1jz_IcWuX5KB [09:43:13] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:43:47] taavi: yes laucnhing ncredir instsnces in drmrs looking into it [09:44:03] (03CR) 10ArielGlenn: [C: 03+2] clean up older enterprise html dumps, keep the last 6 runs [puppet] - 10https://gerrit.wikimedia.org/r/756596 (https://phabricator.wikimedia.org/T273585) (owner: 10ArielGlenn) [09:45:47] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:46:28] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Volans) p:05Triage→03Medium [09:47:57] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:47:57] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:48:49] PROBLEM - PyBal connections to etcd on lvs6003 is CRITICAL: CRITICAL: 12 connections established with conf1006.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [09:50:00] 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10SCherukuwada) [09:50:17] ^^ aware of the error... launching ncredir vm in drmrs initiated that [09:50:23] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:50:25] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:51:07] PROBLEM - PyBal IPVS diff check on lvs6003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2a02:ec80:600:ed1a::3:443, 2a02:ec80:600:ed1a::3:80, 185.15.58.226:80, 185.15.58.226:443]) https://wikitech.wikimedia.org/wiki/PyBal [09:51:52] 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10Peachey88) [09:53:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P19109 and previous config saved to /var/cache/conftool/dbconfig/20220125-095355-marostegui.json [09:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:11] 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10Peachey88) [09:54:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19110 and previous config saved to /var/cache/conftool/dbconfig/20220125-095417-marostegui.json [09:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:23] jouncebot: nowandnext [09:54:23] No deployments scheduled for the next 2 hour(s) and 5 minute(s) [09:54:23] In 2 hour(s) and 5 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1200) [09:54:27] PROBLEM - Host ncredir-lb.drmrs.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [09:54:28] PROBLEM - Host ncredir-lb.drmrs.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:54:51] also expected I take it, mmandere [09:54:57] (03PS2) 10Majavah: Undeploy UserMerge (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755532 (https://phabricator.wikimedia.org/T216089) [09:55:13] * volans here but seems false alarm [09:55:16] is the ncredir-lb issue expected? [09:55:19] indeed [09:55:26] Emperor: drmrs is not live yet [09:55:30] (03CR) 10Jbond: [C: 03+1] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/756683 (owner: 10Cwhite) [09:55:30] * Emperor will stop twitching then :) [09:55:34] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs6001 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2a02:ec80:600:ed1a::3:443, 2a02:ec80:600:ed1a::3:80, 185.15.58.226:80, 185.15.58.226:443]) MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/PyBal [09:55:34] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs6001 is CRITICAL: CRITICAL: 8 connections established with conf1006.eqiad.wmnet:4001 (min=12) MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/PyBal [09:55:34] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs6003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([2a02:ec80:600:ed1a::3:443, 2a02:ec80:600:ed1a::3:80, 185.15.58.226:80, 185.15.58.226:443]) MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/PyBal [09:55:34] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs6003 is CRITICAL: CRITICAL: 12 connections established with conf1006.eqiad.wmnet:4001 (min=16) MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/PyBal [09:55:34] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir-https is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:34] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:35] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir-https is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:35] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:35] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir-https is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:36] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:36] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir-https is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:37] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/ulsfo/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:38] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:38] (03CR) 10Majavah: [C: 03+2] Undeploy UserMerge (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755532 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [09:55:38] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir-https is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:38] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:39] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir-https is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:39] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:40] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir-https is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:41] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:41] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir-https is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:55:42] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/ulsfo/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir is broken MMandere Aware of the problem, caused by launch of ncredir vm in drmrs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:56:19] (03Merged) 10jenkins-bot: Undeploy UserMerge (1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755532 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [09:56:59] godog: that's right [09:57:55] mmandere: ack, thanks [09:58:01] (03PS1) 10Marostegui: es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756952 (https://phabricator.wikimedia.org/T300005) [09:58:03] (03PS2) 10Ladsgroup: es2029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756586 (https://phabricator.wikimedia.org/T299911) [09:58:14] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es2029: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756586 (https://phabricator.wikimedia.org/T299911) (owner: 10Ladsgroup) [09:59:01] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:755532|Undeploy UserMerge (1) (T216089)]] (duration: 00m 49s) [09:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:05] T216089: Undeploy UserMerge Extension from WMF production - https://phabricator.wikimedia.org/T216089 [09:59:14] (03PS2) 10Majavah: Undeploy UserMerge (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755533 (https://phabricator.wikimedia.org/T216089) [09:59:23] (03CR) 10Majavah: [C: 03+2] Undeploy UserMerge (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755533 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [09:59:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:32] (03CR) 10Marostegui: [C: 03+2] es2020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756952 (https://phabricator.wikimedia.org/T300005) (owner: 10Marostegui) [09:59:46] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: disable rsync where not needed [puppet] - 10https://gerrit.wikimedia.org/r/756607 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [09:59:52] (03PS2) 10Filippo Giunchedi: prometheus: disable rsync where not needed [puppet] - 10https://gerrit.wikimedia.org/r/756607 (https://phabricator.wikimedia.org/T296199) [10:00:08] (03Merged) 10jenkins-bot: Undeploy UserMerge (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755533 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [10:00:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: reimage for upgrade - T299911 [10:00:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2020', diff saved to https://phabricator.wikimedia.org/P19111 and previous config saved to /var/cache/conftool/dbconfig/20220125-100036-marostegui.json [10:00:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:00:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2029.codfw.wmnet with reason: reimage for upgrade - T299911 [10:00:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:41] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [10:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:34] 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10SCherukuwada) Additional note: if we determine that this data is useful enough for us to import into our own data stores, the same service account can be used just as well to run pe... [10:01:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:58] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755533|Undeploy UserMerge (2) (T216089)]] (duration: 00m 49s) [10:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es2029.codfw.wmnet with OS bullseye [10:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2020.codfw.wmnet with OS bullseye [10:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:08] (03PS2) 10Majavah: Undeploy UserMerge (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755534 (https://phabricator.wikimedia.org/T216089) [10:03:14] (03CR) 10Majavah: [C: 03+2] Undeploy UserMerge (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755534 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [10:03:55] (03Merged) 10jenkins-bot: Undeploy UserMerge (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755534 (https://phabricator.wikimedia.org/T216089) (owner: 10Majavah) [10:04:57] !log taavi@deploy1002 Synchronized wmf-config/extension-list: Config: [[gerrit:755534|Undeploy UserMerge (3) (T216089)]] (duration: 00m 48s) [10:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:01] T216089: Undeploy UserMerge Extension from WMF production - https://phabricator.wikimedia.org/T216089 [10:05:44] (03CR) 10Filippo Giunchedi: "LGTM! see inline for a nit and comment" [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [10:06:17] (03CR) 10Filippo Giunchedi: [V: 03+1] "Thank you for the quick review! I'll merge/deploy later today" [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [10:06:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:08:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:36] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:08:48] (03PS3) 10Jbond: ecs: post-review [puppet] - 10https://gerrit.wikimedia.org/r/756630 [10:09:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T285149)', diff saved to https://phabricator.wikimedia.org/P19112 and previous config saved to /var/cache/conftool/dbconfig/20220125-100900-marostegui.json [10:09:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:09:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:04] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T285149)', diff saved to https://phabricator.wikimedia.org/P19113 and previous config saved to /var/cache/conftool/dbconfig/20220125-100907-marostegui.json [10:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P19114 and previous config saved to /var/cache/conftool/dbconfig/20220125-100921-marostegui.json [10:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T285149)', diff saved to https://phabricator.wikimedia.org/P19115 and previous config saved to /var/cache/conftool/dbconfig/20220125-101114-marostegui.json [10:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:34] (03CR) 10Filippo Giunchedi: [C: 03+2] site: add Prometheus role to codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/756603 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [10:11:41] (03PS2) 10Filippo Giunchedi: site: add Prometheus role to codfw hardware [puppet] - 10https://gerrit.wikimedia.org/r/756603 (https://phabricator.wikimedia.org/T296199) [10:12:10] (03PS4) 10Jbond: ecs: post-review [puppet] - 10https://gerrit.wikimedia.org/r/756630 [10:12:50] (03CR) 10jerkins-bot: [V: 04-1] ecs: post-review [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [10:13:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1013.eqiad.wmnet [10:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:32] (03PS5) 10Jbond: ecs: post-review [puppet] - 10https://gerrit.wikimedia.org/r/756630 [10:13:34] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:14:08] (03CR) 10jerkins-bot: [V: 04-1] ecs: post-review [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [10:14:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33414/console" [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [10:15:27] (03PS6) 10Jbond: ecs: post-review [puppet] - 10https://gerrit.wikimedia.org/r/756630 [10:16:23] (03CR) 10Jbond: "thanks for all the feedback, i have applied the suggested changes and will give it a shot" [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [10:18:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (38) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:18:01] (03CR) 10Jbond: [C: 03+2] ecs: post-review [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [10:18:41] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/drmrs/ncredir https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:18:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1013.eqiad.wmnet [10:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:28] (03PS1) 10MMandere: install_server: Add drmrs ncredir first instance [puppet] - 10https://gerrit.wikimedia.org/r/756953 (https://phabricator.wikimedia.org/T282787) [10:23:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (72) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:23:43] (03CR) 10Jbond: "thanks for all the feedback looks like things are working now <3" [puppet] - 10https://gerrit.wikimedia.org/r/756630 (owner: 10Jbond) [10:24:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T299827)', diff saved to https://phabricator.wikimedia.org/P19116 and previous config saved to /var/cache/conftool/dbconfig/20220125-102426-marostegui.json [10:24:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:24:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:31] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [10:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:24:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [10:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [10:24:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [10:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T299827)', diff saved to https://phabricator.wikimedia.org/P19117 and previous config saved to /var/cache/conftool/dbconfig/20220125-102448-marostegui.json [10:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:57] (03PS1) 10Arturo Borrero Gonzalez: wmcs: monitoring: sharpen primary/backup rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/756954 (https://phabricator.wikimedia.org/T300011) [10:26:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P19118 and previous config saved to /var/cache/conftool/dbconfig/20220125-102619-marostegui.json [10:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:33] (03CR) 10Vgutierrez: [C: 03+1] "vgutierrez@ganeti6001:~$ sudo -i gnt-instance show ncredir6001.drmrs.wmnet |grep MAC" [puppet] - 10https://gerrit.wikimedia.org/r/756953 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [10:26:35] (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: sharpen primary/backup rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/756954 (https://phabricator.wikimedia.org/T300011) (owner: 10Arturo Borrero Gonzalez) [10:28:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (72) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:28:04] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/drmrs/ncredir-https https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:28:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:28:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1022 T299123', diff saved to https://phabricator.wikimedia.org/P19119 and previous config saved to /var/cache/conftool/dbconfig/20220125-102912-marostegui.json [10:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:16] T299123: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 [10:30:14] (03PS1) 10Marostegui: es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756956 (https://phabricator.wikimedia.org/T299123) [10:30:30] (03CR) 10MMandere: [C: 03+2] install_server: Add drmrs ncredir first instance [puppet] - 10https://gerrit.wikimedia.org/r/756953 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [10:31:39] (03CR) 10Marostegui: [C: 03+2] es1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756956 (https://phabricator.wikimedia.org/T299123) (owner: 10Marostegui) [10:31:53] mmandere: ok to merge your puppet change? [10:32:26] marostegui: thank you, yes it is ok [10:32:35] mmandere: thanks - done [10:32:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:48] PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (72) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:33:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (72) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:33:19] marostegui: ack, thanks [10:33:50] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:34:59] (03PS1) 10Arturo Borrero Gonzalez: wmnet: make cloudmetrics1001 the backed of grafana/graphite endpoints [dns] - 10https://gerrit.wikimedia.org/r/756957 (https://phabricator.wikimedia.org/T300011) [10:36:07] !log nodetool removenode for restbase2011-c [10:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:57] (03PS2) 10Hnowlan: maps: add cassandra toggle, disable cassandra on maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) [10:37:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2029.codfw.wmnet with OS bullseye [10:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:47] (03PS3) 10Hnowlan: maps: add cassandra toggle, disable cassandra on maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) [10:38:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (72) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:38:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: (72) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:39:20] (03PS1) 10Arturo Borrero Gonzalez: wmcs: monitoring: make cloudmetrics1001 the primary [puppet] - 10https://gerrit.wikimedia.org/r/756958 (https://phabricator.wikimedia.org/T300011) [10:40:23] (03PS2) 10Hashar: gerrit: move CI result table to a tab [puppet] - 10https://gerrit.wikimedia.org/r/756685 [10:40:35] (03PS2) 10Arturo Borrero Gonzalez: wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints [dns] - 10https://gerrit.wikimedia.org/r/756957 (https://phabricator.wikimedia.org/T300011) [10:40:59] (03CR) 10Hashar: [C: 04-1] "Thank you for your hints Ben, I have incorporated them in the next patchset. Feel free to drop yourselves from the reviewer list, I am pro" [puppet] - 10https://gerrit.wikimedia.org/r/756685 (owner: 10Hashar) [10:41:20] (03CR) 10Hashar: [C: 04-1] "-1 it is still a WIP" [puppet] - 10https://gerrit.wikimedia.org/r/756685 (owner: 10Hashar) [10:41:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P19120 and previous config saved to /var/cache/conftool/dbconfig/20220125-104124-marostegui.json [10:41:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2020.codfw.wmnet with OS bullseye [10:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:30] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/drmrs/ncredir https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:41:50] (03PS2) 10Arturo Borrero Gonzalez: wmcs: monitoring: make cloudmetrics1001 the primary [puppet] - 10https://gerrit.wikimedia.org/r/756958 (https://phabricator.wikimedia.org/T300011) [10:42:10] (03PS1) 10Marostegui: Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756939 [10:43:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (72) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:43:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool es2020', diff saved to https://phabricator.wikimedia.org/P19121 and previous config saved to /var/cache/conftool/dbconfig/20220125-104331-marostegui.json [10:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:35] (03CR) 10Marostegui: [C: 03+2] Revert "es2020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756939 (owner: 10Marostegui) [10:43:56] (03PS1) 10Ladsgroup: Revert "es2029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756940 [10:44:05] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es2029: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756940 (owner: 10Ladsgroup) [10:44:25] (03PS1) 10Kosta Harlan: Add an image: update onboarding images for desktop [extensions/GrowthExperiments] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756941 (https://phabricator.wikimedia.org/T298109) [10:44:36] (03PS1) 10Kosta Harlan: Add an image: update onboarding images for desktop [extensions/GrowthExperiments] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756942 (https://phabricator.wikimedia.org/T298109) [10:45:21] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2011.eqiad.wmnet [10:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:34] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/drmrs/ncredir-https https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:46:08] (03PS1) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [10:46:40] (03PS4) 10Hnowlan: maps: add cassandra toggle, disable cassandra on maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) [10:47:22] (03CR) 10Giuseppe Lavagetto: "The diffs in the CI runs are due to the changes in fixtures." [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) (owner: 10Giuseppe Lavagetto) [10:47:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: monitoring: make cloudmetrics1001 the primary [puppet] - 10https://gerrit.wikimedia.org/r/756958 (https://phabricator.wikimedia.org/T300011) (owner: 10Arturo Borrero Gonzalez) [10:47:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints [dns] - 10https://gerrit.wikimedia.org/r/756957 (https://phabricator.wikimedia.org/T300011) (owner: 10Arturo Borrero Gonzalez) [10:47:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans) [10:47:57] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [10:48:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (72) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:48:06] (03PS1) 10Ladsgroup: es2027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756960 (https://phabricator.wikimedia.org/T299911) [10:49:19] (03PS2) 10Ladsgroup: es2027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756960 (https://phabricator.wikimedia.org/T299911) [10:49:21] (03PS1) 10Marostegui: es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756961 (https://phabricator.wikimedia.org/T300005) [10:49:24] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es2027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756960 (https://phabricator.wikimedia.org/T299911) (owner: 10Ladsgroup) [10:49:40] (03CR) 10Hnowlan: [C: 03+2] maps: add cassandra toggle, disable cassandra on maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/753057 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [10:50:01] !log disabling puppet on all maps hosts to test cassandra removal [10:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:21] (03CR) 10Marostegui: [C: 03+2] es2021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756961 (https://phabricator.wikimedia.org/T300005) (owner: 10Marostegui) [10:50:38] hnowlan: ok to merge your change? [10:50:54] marostegui: yes, please [10:51:04] hnowlan: done! [10:51:06] thanks! [10:52:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2027.codfw.wmnet with reason: reimage for upgrade - T299911 [10:52:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2027.codfw.wmnet with reason: reimage for upgrade - T299911 [10:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:59] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [10:53:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (72) Elasticsearch instance elastic2025-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es2027.codfw.wmnet with OS bullseye [10:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es2021.codfw.wmnet with OS bullseye [10:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T285149)', diff saved to https://phabricator.wikimedia.org/P19122 and previous config saved to /var/cache/conftool/dbconfig/20220125-105628-marostegui.json [10:56:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:56:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:33] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T285149)', diff saved to https://phabricator.wikimedia.org/P19123 and previous config saved to /var/cache/conftool/dbconfig/20220125-105636-marostegui.json [10:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:28] (03PS2) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [10:57:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T285149)', diff saved to https://phabricator.wikimedia.org/P19124 and previous config saved to /var/cache/conftool/dbconfig/20220125-105744-marostegui.json [10:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (49) Elasticsearch instance elastic2025-production-search-omega-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [10:58:02] (03PS4) 10Jelto: gitlab: update cloud hiera, refactor naming [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) [11:00:04] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [11:00:47] RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:03:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (42) Elasticsearch instance elastic2025-production-search-omega-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [11:03:39] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33415/console" [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [11:03:43] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:45] (03CR) 10JMeybohm: [C: 04-1] "PS2 was correct, but with PS3 you reverted the Bug # again." [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) (owner: 10AOkoth) [11:05:19] (03PS3) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [11:06:35] PROBLEM - cassandra service on maps1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:07:05] PROBLEM - cassandra CQL 10.64.0.12:9042 on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:07:28] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [11:07:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:07:57] !log temp disable alerting on prometheus200[56] - T296199 [11:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:02] T296199: Prometheus hardware refresh (+ Bullseye upgrade) - https://phabricator.wikimedia.org/T296199 [11:08:57] ACKNOWLEDGEMENT - cassandra CQL 10.64.0.12:9042 on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 9042: Connection refused Hnowlan Disabling Cassandra on all maps hosts before removal https://phabricator.wikimedia.org/T93886 [11:08:57] ACKNOWLEDGEMENT - cassandra service on maps1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan Disabling Cassandra on all maps hosts before removal https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:57] ACKNOWLEDGEMENT - cassandra CQL 10.64.0.18:9042 on maps1006 is CRITICAL: connect to address 10.64.0.18 and port 9042: Connection refused Hnowlan Disabling Cassandra on all maps hosts before removal https://phabricator.wikimedia.org/T93886 [11:08:57] ACKNOWLEDGEMENT - cassandra CQL 10.64.16.6:9042 on maps1007 is CRITICAL: connect to address 10.64.16.6 and port 9042: Connection refused Hnowlan Disabling Cassandra on all maps hosts before removal https://phabricator.wikimedia.org/T93886 [11:08:57] ACKNOWLEDGEMENT - cassandra service on maps1007 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan Disabling Cassandra on all maps hosts before removal https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:57] ACKNOWLEDGEMENT - cassandra CQL 10.64.16.27:9042 on maps1008 is CRITICAL: connect to address 10.64.16.27 and port 9042: Connection refused Hnowlan Disabling Cassandra on all maps hosts before removal https://phabricator.wikimedia.org/T93886 [11:08:58] ACKNOWLEDGEMENT - cassandra service on maps1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan Disabling Cassandra on all maps hosts before removal https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:08:58] ACKNOWLEDGEMENT - cassandra CQL 10.64.48.6:9042 on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 9042: Connection refused Hnowlan Disabling Cassandra on all maps hosts before removal https://phabricator.wikimedia.org/T93886 [11:08:59] ACKNOWLEDGEMENT - cassandra service on maps1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan Disabling Cassandra on all maps hosts before removal https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:09:07] (03CR) 10Jelto: [V: 03+1] gitlab: update cloud hiera, refactor naming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [11:10:20] (03CR) 10jerkins-bot: [V: 04-1] Add an image: update onboarding images for desktop [extensions/GrowthExperiments] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756942 (https://phabricator.wikimedia.org/T298109) (owner: 10Kosta Harlan) [11:12:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P19125 and previous config saved to /var/cache/conftool/dbconfig/20220125-111249-marostegui.json [11:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: (39) Elasticsearch instance elastic2025-production-search-omega-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [11:14:02] (03CR) 10Jbond: "LGTM, it would be nice to have some CI on this repo but not urgent" [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [11:14:07] (03CR) 10Jbond: [C: 03+1] Make a bundle signer return it's root CA [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [11:15:12] (03CR) 10Jbond: [C: 03+1] Add ca to multirootca.conf in simple-cfssl (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756616 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [11:15:19] (03PS2) 10Filippo Giunchedi: site: add Prometheus role to eqiad hardware [puppet] - 10https://gerrit.wikimedia.org/r/756604 (https://phabricator.wikimedia.org/T296199) [11:15:21] (03PS1) 10Filippo Giunchedi: hieradata: temp disable alerting for new prometheus hw [puppet] - 10https://gerrit.wikimedia.org/r/756965 (https://phabricator.wikimedia.org/T296199) [11:18:43] (03PS14) 10JMeybohm: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) [11:19:14] (03CR) 10JMeybohm: Add basic ingress support to chart common_templates (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [11:19:45] !log installing apache security updates [11:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:08] (03CR) 10Kosta Harlan: "recheck" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756942 (https://phabricator.wikimedia.org/T298109) (owner: 10Kosta Harlan) [11:20:53] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33416/console" [puppet] - 10https://gerrit.wikimedia.org/r/756965 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [11:21:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T299827)', diff saved to https://phabricator.wikimedia.org/P19126 and previous config saved to /var/cache/conftool/dbconfig/20220125-112111-marostegui.json [11:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:16] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [11:22:21] (03PS2) 10Filippo Giunchedi: hieradata: temp disable alerting for new prometheus hw [puppet] - 10https://gerrit.wikimedia.org/r/756965 (https://phabricator.wikimedia.org/T296199) [11:25:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add basic ingress support to chart common_templates (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [11:26:26] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: temp disable alerting for new prometheus hw [puppet] - 10https://gerrit.wikimedia.org/r/756965 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [11:27:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2027.codfw.wmnet with OS bullseye [11:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P19127 and previous config saved to /var/cache/conftool/dbconfig/20220125-112753-marostegui.json [11:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:11] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=aqs1011.eqiad.wmnet [11:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:05] (03PS1) 10Jbond: P:base::firewall: Add proemethous hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) [11:31:53] (03CR) 10jerkins-bot: [V: 04-1] P:base::firewall: Add proemethous hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [11:32:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2021.codfw.wmnet with OS bullseye [11:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:28] jouncebot: nowandnext [11:32:28] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [11:32:28] In 0 hour(s) and 27 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1200) [11:33:29] (03PS2) 10Jbond: P:base::firewall: Add proemethous hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) [11:34:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33418/console" [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [11:35:43] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) 05In progress→03Open [11:36:08] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) [11:36:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19128 and previous config saved to /var/cache/conftool/dbconfig/20220125-113616-marostegui.json [11:36:19] (03CR) 10MarcoAurelio: "If we're doing changes to recountCategories (756415), shall we wait until that is reviewed/merged/tested? I'd say it'd not be bad to test " [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [11:36:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:12] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T285149)', diff saved to https://phabricator.wikimedia.org/P19129 and previous config saved to /var/cache/conftool/dbconfig/20220125-114258-marostegui.json [11:43:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [11:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [11:43:03] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [11:43:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T285149)', diff saved to https://phabricator.wikimedia.org/P19130 and previous config saved to /var/cache/conftool/dbconfig/20220125-114311-marostegui.json [11:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:22] (03PS1) 10Ladsgroup: Revert "es2027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756943 [11:44:34] (03PS2) 10Ladsgroup: Revert "es2027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756943 [11:45:05] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10jcrespo) @herron @lmata What is the relationship between the columns at #sre-onfire and #sre-onfire-incident-docs "in review"? Is the first the review of incident docs a... [11:45:44] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es2027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756943 (owner: 10Ladsgroup) [11:48:57] (03PS3) 10Jbond: P:base::firewall: Add proemethous hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) [11:49:32] (03CR) 10jerkins-bot: [V: 04-1] P:base::firewall: Add proemethous hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [11:50:57] (03PS4) 10Jbond: P:base::firewall: Add proemethous hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) [11:51:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P19131 and previous config saved to /var/cache/conftool/dbconfig/20220125-115120-marostegui.json [11:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33420/console" [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [11:53:52] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 2 (cloudmetrics1001, ...), Fresh: 105 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:57:00] (03PS2) 10Lucas Werkmeister (WMDE): Enable statement usage tracking for Armenian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755330 (https://phabricator.wikimedia.org/T296382) (owner: 10Noa wmde) [11:57:43] !log oblivian@puppetmaster1001 conftool action : set/weight=1; selector: dc=eqiad,cluster=appserver,service=canary [11:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1200). [12:00:05] kostajh and Lucas_WMDE: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:08] o/ [12:00:14] I can deploy [12:01:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33424/console" [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [12:01:38] hi [12:01:55] thanks Lucas_WMDE [12:02:04] (03CR) 10Jbond: [V: 03+1] "This uses the role_hosts funct" [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [12:02:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add an image: update onboarding images for desktop [extensions/GrowthExperiments] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756941 (https://phabricator.wikimedia.org/T298109) (owner: 10Kosta Harlan) [12:02:26] let’s start by +2ing the backports, they’ll take a bit in gate-and-submit anyways [12:02:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add an image: update onboarding images for desktop [extensions/GrowthExperiments] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756942 (https://phabricator.wikimedia.org/T298109) (owner: 10Kosta Harlan) [12:02:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable statement usage tracking for Armenian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755330 (https://phabricator.wikimedia.org/T296382) (owner: 10Noa wmde) [12:03:05] kostajh: do you want to deploy your backports yourself (once they’re merged)? [12:04:02] Lucas_WMDE: I could do that, I don't think I need to sync the files in any particular order, do I? [12:04:08] (03Merged) 10jenkins-bot: Enable statement usage tracking for Armenian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/755330 (https://phabricator.wikimedia.org/T296382) (owner: 10Noa wmde) [12:04:20] I thought it probably makes sense to sync images/ before modules/ [12:04:27] so the files definitely exist [12:04:43] For context, the code in those patches is behind a feature flag [12:04:46] ah [12:05:06] and completely disabled for now? [12:05:10] or only enabled for some users? [12:05:12] (03PS5) 10Jbond: P:base::firewall: Add prometheus hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) [12:05:37] the feature is presented to some users on mobile; some experienced users/testers can manually opt in to the feature on desktop; this patch provides some onboarding specific to desktop users [12:06:10] we will probably start showing the feature to desktop users later today or tomorow, hence these two backports [12:06:14] I would probably still sync it in two steps, because why not (we should have enough time in the window too) [12:06:17] (03CR) 10Jbond: P:base::firewall: Add prometheus hosts to catch all ferm rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [12:06:20] but I won’t complain if you do it all at once either [12:06:21] ok [12:06:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T299827)', diff saved to https://phabricator.wikimedia.org/P19132 and previous config saved to /var/cache/conftool/dbconfig/20220125-120625-marostegui.json [12:06:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:06:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [12:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:30] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [12:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T299827)', diff saved to https://phabricator.wikimedia.org/P19133 and previous config saved to /var/cache/conftool/dbconfig/20220125-120632-marostegui.json [12:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:54] I have both kids at home due to covid quarantine so I might ask you to take over at some point. but let's see how it goes :) [12:07:05] sure :) [12:07:31] testing my config change on mwdebug1001 [12:08:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:45] seems to work, syncing [12:10:04] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:755330|Enable statement usage tracking for Armenian Wikipedia (hywiki) (T296382)]] (duration: 00m 50s) [12:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:09] T296382: Enable statement usage tracking on hywiki - https://phabricator.wikimedia.org/T296382 [12:12:40] just waiting for Zuul now [12:12:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:12:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T299827)', diff saved to https://phabricator.wikimedia.org/P19134 and previous config saved to /var/cache/conftool/dbconfig/20220125-121343-marostegui.json [12:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:47] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [12:15:01] (03CR) 10JMeybohm: [C: 03+2] Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:16:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:50] (03CR) 10JMeybohm: Add ca to multirootca.conf in simple-cfssl (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756616 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [12:17:45] !log removal of restbase2011 from cassandra cluster complete [12:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:39] (03Merged) 10jenkins-bot: Add basic ingress support to chart common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/732374 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:20:45] (03CR) 10JMeybohm: Make a bundle signer return it's root CA (032 comments) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [12:20:51] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) >>! In T299527#7645818, @Cmjohnson wrote: > @MoritzMuehlenhoff 1013 is finished, ganeti1014 will need me to do a hard power cycl... [12:23:57] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Add MEDIAWIKI_PROXY_API_BASE_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/752608 (https://phabricator.wikimedia.org/T298857) (owner: 10Gergő Tisza) [12:24:51] (03PS8) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [12:25:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33426/console" [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:25:58] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10aborrero) [12:26:28] (03Merged) 10jenkins-bot: Add an image: update onboarding images for desktop [extensions/GrowthExperiments] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756941 (https://phabricator.wikimedia.org/T298109) (owner: 10Kosta Harlan) [12:26:31] (03Merged) 10jenkins-bot: Add an image: update onboarding images for desktop [extensions/GrowthExperiments] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756942 (https://phabricator.wikimedia.org/T298109) (owner: 10Kosta Harlan) [12:26:43] yay zuul [12:26:55] Lucas_WMDE: OK i'll start with wmf.19 [12:26:59] \o/ [12:27:25] ok [12:27:48] (03Merged) 10jenkins-bot: linkrecommendation: Add MEDIAWIKI_PROXY_API_BASE_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/752608 (https://phabricator.wikimedia.org/T298857) (owner: 10Gergő Tisza) [12:28:07] Lucas_WMDE: the directory for wmf.19 doesn't exist yet, so I think there's nothing more to do? [12:28:14] probably not [12:28:20] ok, that was easy [12:28:22] on to the next [12:28:24] ^^ [12:28:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19135 and previous config saved to /var/cache/conftool/dbconfig/20220125-122848-marostegui.json [12:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:56] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): spicerack: introduce GridEngine controller - https://phabricator.wikimedia.org/T300032 (10aborrero) [12:30:13] (03CR) 10Jbond: reposync: add initial repo sync class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [12:31:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool es1031 (T299911)', diff saved to https://phabricator.wikimedia.org/P19136 and previous config saved to /var/cache/conftool/dbconfig/20220125-123303-ladsgroup.json [12:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:07] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [12:33:09] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans) [12:33:12] Lucas_WMDE: it looks good. so, I need to sync the image files first [12:33:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:33:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:20] (03PS9) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [12:33:21] alright [12:33:40] is the syntax for that documented somewhere? [12:34:02] ah, I see it in https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#Full_deployment [12:34:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33427/console" [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:34:29] (03PS4) 10Lucas Werkmeister (WMDE): Introduce $wmgEntityUsageModifierLimitsStatement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754937 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [12:34:31] (03PS5) 10Lucas Werkmeister (WMDE): Enable usage tracking for statement for cebwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/754933 (https://phabricator.wikimedia.org/T296384) (owner: 10Noa wmde) [12:35:09] Lucas_WMDE: something like this? `scap sync-file php-1.38.0-wmf.18/extensions/GrowthExperiments/images 'Backport (1/2): [[gerrit:756941|Add an image: update onboarding images for desktop (T298109)]]'` [12:35:10] T298109: Add an image: onboarding graphics (desktop) - https://phabricator.wikimedia.org/T298109 [12:35:11] yup, you’ll just need to tab-complete your way into the right directory [12:35:16] hang on [12:35:20] yup, that looks good to me [12:35:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:32] Lucas_WMDE: thanks. I nearly ran that from `/srv/mediawiki-staging/php-1.38.0-wmf.18` instead of `/srv/mediawiki-staging`. Would that have caused a problem? I assume scap would just not find the file and stop before doing anything problematic [12:36:38] !log kharlan@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/GrowthExperiments/images: Backport (1/2): [[gerrit:756941|Add an image: update onboarding images for desktop (T298109)]] (duration: 00m 50s) [12:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:46] I don’t know if that would’ve caused a problem [12:37:12] (03PS14) 10Jbond: reposync: add initial repo sync class [puppet] - 10https://gerrit.wikimedia.org/r/747091 [12:37:41] !log kharlan@deploy1002 Synchronized php-1.38.0-wmf.18/extensions/GrowthExperiments/modules: Backport (2/2): [[gerrit:756941|Add an image: update onboarding images for desktop (T298109)]] (duration: 00m 49s) [12:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:48] OK, I'm done [12:37:52] cool [12:38:02] I don’t think there’s anything else to do [12:38:16] thought about backporting some more fixes for those annoying numRows warnings that keep filling up logstash [12:38:41] but one fix would require a force-merge due to broken CI, and the other is in FlaggedRevs and also required a follow-up fix because the initial patch was broken [12:38:45] so I’ll just not do that [12:38:56] !log UTC morning backport window done [12:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:21] ok [12:39:22] thanks! [12:41:34] PROBLEM - Check for large files in client bucket on ncredir6001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.136.0.20: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [12:42:48] (03PS1) 10Ladsgroup: es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756971 (https://phabricator.wikimedia.org/T299911) [12:43:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T285149)', diff saved to https://phabricator.wikimedia.org/P19138 and previous config saved to /var/cache/conftool/dbconfig/20220125-124330-marostegui.json [12:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:35] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [12:43:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P19139 and previous config saved to /var/cache/conftool/dbconfig/20220125-124352-marostegui.json [12:43:54] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es1031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756971 (https://phabricator.wikimedia.org/T299911) (owner: 10Ladsgroup) [12:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:02] PROBLEM - Check size of conntrack table on ncredir6001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.136.0.20: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:46:26] PROBLEM - Check systemd state on ncredir6001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.136.0.20: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: vps: create_instance_with_prefix: drop unused project parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/756587 (owner: 10Arturo Borrero Gonzalez) [12:46:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: openstack: fix security group functions [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/756588 (owner: 10Arturo Borrero Gonzalez) [12:47:41] (KubernetesRsyslogDown) firing: rsyslog on kubestage1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [12:48:52] PROBLEM - Check the NTP synchronisation status of timesyncd on ncredir6001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.136.0.20: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [12:50:47] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync on production [12:50:49] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync on staging [12:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:24] PROBLEM - Check whether ferm is active by checking the default input chain on ncredir6001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.136.0.20: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:51:24] PROBLEM - configured eth on ncredir6001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.136.0.20: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:51:50] RECOVERY - Check for large files in client bucket on ncredir6001 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [12:51:57] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync on production [12:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:28] RECOVERY - Check size of conntrack table on ncredir6001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [12:52:52] RECOVERY - Check systemd state on ncredir6001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33429/console" [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:54:25] (03PS10) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [12:55:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33430/console" [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:55:42] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync on production [12:55:43] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync on staging [12:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:22] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync on production [12:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P19140 and previous config saved to /var/cache/conftool/dbconfig/20220125-125835-marostegui.json [12:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T299827)', diff saved to https://phabricator.wikimedia.org/P19141 and previous config saved to /var/cache/conftool/dbconfig/20220125-125857-marostegui.json [12:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:01] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [12:59:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:59:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [12:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [12:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:59:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T299827)', diff saved to https://phabricator.wikimedia.org/P19142 and previous config saved to /var/cache/conftool/dbconfig/20220125-125923-marostegui.json [12:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T299827)', diff saved to https://phabricator.wikimedia.org/P19143 and previous config saved to /var/cache/conftool/dbconfig/20220125-130032-marostegui.json [13:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:32] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:cluster::management: Add reposync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:01:37] (03CR) 10Jbond: [C: 03+2] reposync: add initial repo sync class [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [13:02:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:cluster::management: Add reposync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:03:52] (03CR) 10MF-Warburg: [C: 03+1] incubatorwiki: Increase AbuseFilter thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756572 (https://phabricator.wikimedia.org/T299868) (owner: 10MarcoAurelio) [13:06:24] PROBLEM - MariaDB read only es3 on es1031 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [13:06:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: reimage for upgrade - T299911 [13:06:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: reimage for upgrade - T299911 [13:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:38] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [13:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:16] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:12:15] (03PS1) 10Jbond: C:reposync: create hooks directory [puppet] - 10https://gerrit.wikimedia.org/r/756974 [13:13:09] (03CR) 10Jbond: [C: 03+2] C:reposync: create hooks directory [puppet] - 10https://gerrit.wikimedia.org/r/756974 (owner: 10Jbond) [13:13:35] (03PS2) 10Majavah: wikitech: use ldap-rw.$SITE for ldap access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752296 (https://phabricator.wikimedia.org/T295150) [13:13:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P19144 and previous config saved to /var/cache/conftool/dbconfig/20220125-131340-marostegui.json [13:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:55] (03PS1) 10Kosta Harlan: linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/756975 (https://phabricator.wikimedia.org/T298857) [13:15:07] (03CR) 10DCausse: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [13:15:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19145 and previous config saved to /var/cache/conftool/dbconfig/20220125-131537-marostegui.json [13:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:57] (03CR) 10Majavah: [C: 03+2] wikitech: use ldap-rw.$SITE for ldap access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752296 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [13:16:09] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/756975 (https://phabricator.wikimedia.org/T298857) (owner: 10Kosta Harlan) [13:16:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1030 to es2 master T299889', diff saved to https://phabricator.wikimedia.org/P19146 and previous config saved to /var/cache/conftool/dbconfig/20220125-131622-marostegui.json [13:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:26] T299889: Upgrade es2 to Bullseye - https://phabricator.wikimedia.org/T299889 [13:16:40] (03Merged) 10jenkins-bot: wikitech: use ldap-rw.$SITE for ldap access [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752296 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [13:16:50] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10AniketArs) Hi Jhathaway, 1. aniket.code.ars@gmail.com 2. Mainly i will be generating `embeddings` of image using one of `tensorflow model` so that i can... [13:17:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1026 T299889', diff saved to https://phabricator.wikimedia.org/P19147 and previous config saved to /var/cache/conftool/dbconfig/20220125-131727-marostegui.json [13:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:05] (03PS1) 10Marostegui: es1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756976 (https://phabricator.wikimedia.org/T299889) [13:18:59] (03CR) 10Marostegui: [C: 03+2] es1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756976 (https://phabricator.wikimedia.org/T299889) (owner: 10Marostegui) [13:19:19] !log taavi@deploy1002 Synchronized wmf-config/wikitech.php: Config: [[gerrit:752296|wikitech: use ldap-rw.$SITE for ldap access (T295150)]] (duration: 00m 49s) [13:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:24] T295150: Rename ldap-labs cluster - https://phabricator.wikimedia.org/T295150 [13:19:42] (03Merged) 10jenkins-bot: linkrecommendation: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/756975 (https://phabricator.wikimedia.org/T298857) (owner: 10Kosta Harlan) [13:20:10] RECOVERY - Check the NTP synchronisation status of timesyncd on ncredir6001 is OK: OK: synced at Tue 2022-01-25 13:20:09 UTC. https://wikitech.wikimedia.org/wiki/NTP [13:20:35] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [13:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:37] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [13:20:38] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [13:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:23] (03CR) 10Filippo Giunchedi: "LGTM! Modulo comment re: role_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [13:21:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:48] RECOVERY - Check whether ferm is active by checking the default input chain on ncredir6001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:21:48] RECOVERY - configured eth on ncredir6001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:22:01] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync on staging [13:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:06] \o, I'm working on a deployment of the linkrecommendation chart, but getting timeouts when attempting to verify the deployment on staging [13:25:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es1031.eqiad.wmnet with OS bullseye [13:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:25:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:51] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bullseye [13:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:59] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye [13:27:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T285149)', diff saved to https://phabricator.wikimedia.org/P19148 and previous config saved to /var/cache/conftool/dbconfig/20220125-132844-marostegui.json [13:28:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:28:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:49] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [13:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T285149)', diff saved to https://phabricator.wikimedia.org/P19149 and previous config saved to /var/cache/conftool/dbconfig/20220125-132852-marostegui.json [13:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T285149)', diff saved to https://phabricator.wikimedia.org/P19150 and previous config saved to /var/cache/conftool/dbconfig/20220125-132958-marostegui.json [13:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:26] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on gitlab-runner1001.eqiad.wmnet with reason: move gitlab-runner1001 to new ganeti row [13:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:28] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on gitlab-runner1001.eqiad.wmnet with reason: move gitlab-runner1001 to new ganeti row [13:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P19151 and previous config saved to /var/cache/conftool/dbconfig/20220125-133042-marostegui.json [13:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:53] !log oblivian@puppetmaster1001 conftool action : set/weight=1; selector: dc=drmrs,cluster=ncredir [13:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:15] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=drmrs,cluster=ncredir,name=ncredir6001.drmrs.wmnet [13:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1005.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [13:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1005.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [13:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:32] <_joe_> !log restarted pybal on lvs6003 [13:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:00] RECOVERY - PyBal connections to etcd on lvs6003 is OK: OK: 16 connections established with conf1006.eqiad.wmnet:4001 (min=16) https://wikitech.wikimedia.org/wiki/PyBal [13:35:18] RECOVERY - PyBal IPVS diff check on lvs6003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:36:59] (03PS6) 10Jbond: P:base::firewall: Add prometheus hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) [13:37:02] (03PS1) 10Tks4Fish: bgwiki: Add 'wgNamespaceRobotPolicies' for Draft (Talk) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756978 (https://phabricator.wikimedia.org/T299224) [13:38:56] !log jelto@cumin1001 START - Cookbook sre.hosts.decommission for hosts gitlab-runner1001.eqiad.wmnet [13:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33431/console" [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [13:40:46] (03CR) 10Jbond: [V: 03+1] "updated" [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [13:41:58] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [13:42:13] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1005 [13:43:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1026.eqiad.wmnet with OS bullseye [13:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:55] (03PS1) 10Filippo Giunchedi: prometheus: refactor rsync in a standalone profile [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) [13:44:50] (03PS7) 10Jbond: WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [13:45:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P19153 and previous config saved to /var/cache/conftool/dbconfig/20220125-134503-marostegui.json [13:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T299827)', diff saved to https://phabricator.wikimedia.org/P19154 and previous config saved to /var/cache/conftool/dbconfig/20220125-134547-marostegui.json [13:45:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [13:45:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [13:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:55] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [13:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T299827)', diff saved to https://phabricator.wikimedia.org/P19155 and previous config saved to /var/cache/conftool/dbconfig/20220125-134557-marostegui.json [13:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [13:46:44] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab-runner1001.eqiad.wmnet [13:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T299827)', diff saved to https://phabricator.wikimedia.org/P19156 and previous config saved to /var/cache/conftool/dbconfig/20220125-134704-marostegui.json [13:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:12] PROBLEM - PyBal BGP sessions are established on lvs6003 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [13:48:41] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1022.eqiad.wmnet with OS bullseye [13:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:49] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye executed with errors: - es1022 (**FAIL**) - Downtimed on Icinga - Disabled P... [13:50:45] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bullseye [13:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:49] 10SRE, 10ops-codfw: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 (10Papaul) @maostegui I have time to do this today. I will ping you once on site to depool the server. Thanks [13:50:54] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye [13:50:58] 10SRE, 10ops-codfw: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 (10Papaul) p:05Triage→03Medium [13:51:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,gitlab_runner} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:51:23] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33432/console" [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [13:52:09] RECOVERY - PyBal IPVS diff check on lvs6001 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:52:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2086 (s7,s8) T299882', diff saved to https://phabricator.wikimedia.org/P19157 and previous config saved to /var/cache/conftool/dbconfig/20220125-135212-marostegui.json [13:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:17] T299882: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 [13:53:19] (03PS1) 10Marostegui: db2086: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756981 (https://phabricator.wikimedia.org/T299882) [13:53:29] RECOVERY - PyBal connections to etcd on lvs6001 is OK: OK: 12 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:53:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:53:38] 10SRE, 10ops-codfw, 10Patch-For-Review: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 (10Marostegui) @Papaul the host is now depooled and with mysql stopped. You can reboot it or power it off anytime you want [13:54:09] (03CR) 10Marostegui: [C: 03+2] db2086: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/756981 (https://phabricator.wikimedia.org/T299882) (owner: 10Marostegui) [13:55:43] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es1022.eqiad.wmnet with OS bullseye [13:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:50] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye executed with errors: - es1022 (**FAIL**) - Removed from Puppet and PuppetDB i... [13:55:53] PROBLEM - PyBal BGP sessions are established on lvs6001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [13:56:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1031.eqiad.wmnet with OS bullseye [13:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:49] (03CR) 10jerkins-bot: [V: 04-1] WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [13:57:07] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33433/console" [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [13:59:26] I ran some commands on the wrong hosts, if you see anything about es1034, let me know [13:59:44] thankfully I didn't restart mysql [14:00:06] jouncebot: next [14:00:06] In 2 hour(s) and 59 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1700) [14:00:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P19158 and previous config saved to /var/cache/conftool/dbconfig/20220125-140008-marostegui.json [14:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:55] (03PS1) 10Majavah: openstack: haproxy site definition is not a profile [puppet] - 10https://gerrit.wikimedia.org/r/756982 [14:01:57] (03PS1) 10Majavah: openstack::haproxy: add more flexibility for frontends [puppet] - 10https://gerrit.wikimedia.org/r/756983 [14:02:02] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] cirrus: move to search.d.w cert [puppet] - 10https://gerrit.wikimedia.org/r/756595 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [14:02:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19159 and previous config saved to /var/cache/conftool/dbconfig/20220125-140209-marostegui.json [14:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:02] (03CR) 10Ottomata: P:installserver::proxy: Add domain whitelist to proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T298087) (owner: 10Jbond) [14:03:04] (03CR) 10jerkins-bot: [V: 04-1] openstack::haproxy: add more flexibility for frontends [puppet] - 10https://gerrit.wikimedia.org/r/756983 (owner: 10Majavah) [14:03:41] (03CR) 10jerkins-bot: [V: 04-1] openstack: haproxy site definition is not a profile [puppet] - 10https://gerrit.wikimedia.org/r/756982 (owner: 10Majavah) [14:03:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,gitlab_runner} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:05:50] (03PS2) 10Majavah: openstack: haproxy site definition is not a profile [puppet] - 10https://gerrit.wikimedia.org/r/756982 [14:05:52] (03PS2) 10Majavah: openstack::haproxy: add more flexibility for frontends [puppet] - 10https://gerrit.wikimedia.org/r/756983 [14:06:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base::firewall: Add prometheus hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756966 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [14:06:53] (03CR) 10jerkins-bot: [V: 04-1] openstack::haproxy: add more flexibility for frontends [puppet] - 10https://gerrit.wikimedia.org/r/756983 (owner: 10Majavah) [14:07:37] (03PS1) 10Eigyan: [wmf-config] Undeploy gdi survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756985 (https://phabricator.wikimedia.org/T299913) [14:07:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:08:48] (03CR) 10Majavah: "pcc: https://puppet-compiler.wmflabs.org/pcc-worker1001/33436/" [puppet] - 10https://gerrit.wikimedia.org/r/756982 (owner: 10Majavah) [14:09:23] (03PS3) 10Majavah: openstack::haproxy: add more flexibility for frontends [puppet] - 10https://gerrit.wikimedia.org/r/756983 [14:09:41] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:10:23] (03PS1) 10Jbond: Revert "P:base::firewall: Add prometheus hosts to catch all ferm rule" [puppet] - 10https://gerrit.wikimedia.org/r/756987 [14:10:37] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:base::firewall: Add prometheus hosts to catch all ferm rule" [puppet] - 10https://gerrit.wikimedia.org/r/756987 (owner: 10Jbond) [14:10:55] (03PS1) 10Jbond: P:base::firewall: Add prometheus hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756988 (https://phabricator.wikimedia.org/T291946) [14:10:57] (03PS1) 10Majavah: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/756989 [14:12:30] (03PS1) 10Marostegui: Revert "es1026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756990 [14:12:47] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [14:13:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1026.eqiad.wmnet with OS bullseye [14:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:19] PROBLEM - Check systemd state on kubestagetcd2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:25] PROBLEM - Check systemd state on mw2367 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:31] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04454 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:13:37] PROBLEM - Check systemd state on clouddb1018 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:37] PROBLEM - Check systemd state on ganeti2017 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:39] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:41] PROBLEM - Check systemd state on mc2021 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:42] PROBLEM - Check systemd state on mw2381 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:46] ^^ this was me change has been reverted will clean up now [14:14:03] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:05] PROBLEM - Check systemd state on db1103 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:07] PROBLEM - Check systemd state on kubernetes1014 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:07] PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:10] (03CR) 10Majavah: "pcc: https://puppet-compiler.wmflabs.org/pcc-worker1003/33438/" [puppet] - 10https://gerrit.wikimedia.org/r/756983 (owner: 10Majavah) [14:14:15] PROBLEM - Check systemd state on mw2294 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:19] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:25] PROBLEM - Check systemd state on restbase1022 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:27] PROBLEM - Check systemd state on db1164 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:29] PROBLEM - Check systemd state on analytics1058 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:31] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:31] PROBLEM - Check systemd state on db1154 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:32] PROBLEM - Check systemd state on mc1054 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:33] PROBLEM - Check systemd state on restbase2015 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:33] PROBLEM - Check systemd state on irc1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:35] PROBLEM - Check systemd state on db2149 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:37] PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:37] PROBLEM - Check systemd state on ganeti2023 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:41] PROBLEM - Check systemd state on ores2005 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:42] PROBLEM - Check systemd state on db2079 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:42] PROBLEM - Check systemd state on ldap-replica1003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:51] PROBLEM - Check systemd state on logstash2024 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:51] PROBLEM - Check systemd state on db1119 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:52] PROBLEM - Check systemd state on db1109 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:53] PROBLEM - Check systemd state on dragonfly-supernode2001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:55] PROBLEM - Check systemd state on kubernetes2015 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:55] PROBLEM - Check systemd state on mc2020 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:57] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:57] PROBLEM - Check systemd state on logstash1034 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:57] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:00] (03PS2) 10Majavah: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/756989 [14:15:01] PROBLEM - Check systemd state on mw1406 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:02] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:02] PROBLEM - Check systemd state on restbase1021 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:07] PROBLEM - Check systemd state on druid1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:07] PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:09] PROBLEM - Check systemd state on wtp1030 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:11] PROBLEM - Check systemd state on miscweb1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:13] PROBLEM - Check systemd state on parse2008 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:13] PROBLEM - Check systemd state on people2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T285149)', diff saved to https://phabricator.wikimedia.org/P19160 and previous config saved to /var/cache/conftool/dbconfig/20220125-141513-marostegui.json [14:15:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:15:15] PROBLEM - Check systemd state on poolcounter1004 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [14:15:17] PROBLEM - Check systemd state on parse2020 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:18] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [14:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T285149)', diff saved to https://phabricator.wikimedia.org/P19161 and previous config saved to /var/cache/conftool/dbconfig/20220125-141520-marostegui.json [14:15:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19162 and previous config saved to /var/cache/conftool/dbconfig/20220125-141520-root.json [14:15:22] RECOVERY - Check systemd state on clouddb1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:25] PROBLEM - Check systemd state on mw2312 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:25] PROBLEM - Check systemd state on mw2278 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:25] PROBLEM - Check systemd state on mw2368 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:25] PROBLEM - Check systemd state on ml-etcd2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:29] (03CR) 10Marostegui: [C: 03+2] Revert "es1026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756990 (owner: 10Marostegui) [14:15:29] PROBLEM - Check systemd state on ping1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031 (T299911)', diff saved to https://phabricator.wikimedia.org/P19163 and previous config saved to /var/cache/conftool/dbconfig/20220125-141538-ladsgroup.json [14:15:39] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:41] PROBLEM - Check systemd state on mw1405 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:43] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [14:15:45] PROBLEM - Check systemd state on wdqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:47] RECOVERY - Check systemd state on db1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:52] PROBLEM - Check systemd state on mw2291 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:53] PROBLEM - Check systemd state on mw1391 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:01] (03CR) 10Majavah: "This access change lets members of the NDA and WMF groups (who have permissions to launch manual Jenkins jobs which lets them use the `pcc" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/756989 (owner: 10Majavah) [14:16:05] PROBLEM - Check systemd state on parse2019 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:05] PROBLEM - Check systemd state on mw2352 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:07] PROBLEM - Check systemd state on mw1360 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:09] RECOVERY - Check systemd state on db1164 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:13] RECOVERY - Check systemd state on analytics1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:15] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:15] RECOVERY - Check systemd state on db1154 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:17] RECOVERY - Check systemd state on mc1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:19] RECOVERY - Check systemd state on irc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:21] RECOVERY - Check systemd state on db2149 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:23] RECOVERY - Check systemd state on ganeti2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:31] PROBLEM - Check systemd state on parse2004 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:31] RECOVERY - Check systemd state on ldap-replica1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:31] RECOVERY - Check systemd state on db2079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:43] RECOVERY - Check systemd state on logstash2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:45] RECOVERY - Check systemd state on db1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:45] RECOVERY - Check systemd state on db1109 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:47] RECOVERY - Check systemd state on dragonfly-supernode2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:49] RECOVERY - Check systemd state on kubernetes2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:49] RECOVERY - Check systemd state on mc2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:51] PROBLEM - Check systemd state on mw1321 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:51] RECOVERY - Check systemd state on ms-be1036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:52] RECOVERY - Check systemd state on logstash1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:52] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:59] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:59] RECOVERY - Check systemd state on kubestagetcd2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:03] RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:07] PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:09] RECOVERY - Check systemd state on miscweb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P19164 and previous config saved to /var/cache/conftool/dbconfig/20220125-141714-marostegui.json [14:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:20] what did I do? [14:17:21] (03CR) 10EllenR: [C: 03+1] [wmf-config] Undeploy gdi survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756985 (https://phabricator.wikimedia.org/T299913) (owner: 10Eigyan) [14:17:21] RECOVERY - Check systemd state on ganeti2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:25] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:25] RECOVERY - Check systemd state on mc2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:27] RECOVERY - Check systemd state on mw2312 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:27] RECOVERY - Check systemd state on mw2278 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:27] RECOVERY - Check systemd state on mw2368 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:27] RECOVERY - Check systemd state on mw2381 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:27] RECOVERY - Check systemd state on ml-etcd2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:45] RECOVERY - Check systemd state on mw1405 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:47] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [14:17:47] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:53] RECOVERY - Check systemd state on kubernetes1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:53] RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:55] RECOVERY - Check systemd state on mw2291 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:55] RECOVERY - Check systemd state on mw1391 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:01] RECOVERY - Check systemd state on mw2294 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:05] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:07] RECOVERY - Check systemd state on parse2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:09] RECOVERY - Check systemd state on mw2352 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:11] RECOVERY - Check systemd state on mw1360 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:13] RECOVERY - Check systemd state on restbase1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:21] RECOVERY - Check systemd state on restbase2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:25] RECOVERY - Check systemd state on ms-be1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:33] RECOVERY - Check systemd state on parse2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:33] RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:55] RECOVERY - Check systemd state on mw1321 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:01] RECOVERY - Check systemd state on mw1406 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:05] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:05] RECOVERY - Check systemd state on restbase1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:07] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:11] RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:13] RECOVERY - Check systemd state on wtp1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:13] RECOVERY - Check systemd state on mw2367 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:13] RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:17] RECOVERY - Check systemd state on parse2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:17] RECOVERY - Check systemd state on people2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:17] RECOVERY - Check systemd state on poolcounter1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:18] (03PS1) 10Ladsgroup: Revert "es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756991 [14:19:19] RECOVERY - Check systemd state on parse2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:31] RECOVERY - Check systemd state on ping1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:39] Amir1: you didn't do anything :-) these were caused by https://gerrit.wikimedia.org/r/756966 [14:19:45] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:49] RECOVERY - Check systemd state on wdqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:52] (03PS2) 10Ladsgroup: Revert "es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756991 [14:19:56] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es1031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756991 (owner: 10Ladsgroup) [14:20:22] moritzm: Every issue that happens is my fault even if proven otherwise [14:21:04] (03CR) 10Bking: [C: 03+2] deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:21:15] sorry for the scare Amir1 this ones on me ;) [14:23:18] !log jelto@cumin1001 START - Cookbook sre.dns.wipe-cache gitlab-runner1001.eqiad.wmnet on all recursors [14:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:21] !log jelto@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) gitlab-runner1001.eqiad.wmnet on all recursors [14:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:40] (03PS3) 10Majavah: Review access change [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/756989 [14:24:36] (03PS5) 10Bking: deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) [14:24:55] (03CR) 10Bking: [V: 03+2 C: 03+2] deployment-prep: add 3 elastic nodes in preparation for bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/756643 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [14:25:44] (03PS2) 10Jbond: P:base::firewall: Add prometheus hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756988 (https://phabricator.wikimedia.org/T291946) [14:26:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove logpager from s8 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19165 and previous config saved to /var/cache/conftool/dbconfig/20220125-142614-marostegui.json [14:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:19] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [14:26:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33439/console" [puppet] - 10https://gerrit.wikimedia.org/r/756988 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [14:27:38] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:base::firewall: Add prometheus hosts to catch all ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/756988 (https://phabricator.wikimedia.org/T291946) (owner: 10Jbond) [14:30:13] (03PS1) 10Ladsgroup: es1034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757008 (https://phabricator.wikimedia.org/T299911) [14:30:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19166 and previous config saved to /var/cache/conftool/dbconfig/20220125-143024-root.json [14:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031', diff saved to https://phabricator.wikimedia.org/P19167 and previous config saved to /var/cache/conftool/dbconfig/20220125-143043-ladsgroup.json [14:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T299827)', diff saved to https://phabricator.wikimedia.org/P19168 and previous config saved to /var/cache/conftool/dbconfig/20220125-143218-marostegui.json [14:32:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [14:32:23] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [14:32:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [14:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T299827)', diff saved to https://phabricator.wikimedia.org/P19169 and previous config saved to /var/cache/conftool/dbconfig/20220125-143232-marostegui.json [14:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:45] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [14:33:20] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es1034: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757008 (https://phabricator.wikimedia.org/T299911) (owner: 10Ladsgroup) [14:33:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T299827)', diff saved to https://phabricator.wikimedia.org/P19170 and previous config saved to /var/cache/conftool/dbconfig/20220125-143338-marostegui.json [14:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:09] (03PS1) 10Dzahn: add emoji to index.html for demonstration purposes [container/miscweb] - 10https://gerrit.wikimedia.org/r/757009 [14:36:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [14:37:26] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:37:45] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [14:42:50] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [14:43:00] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004345 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:43:00] PROBLEM - Elasticsearch HTTPS for relforge-eqiad on relforge1003 is CRITICAL: SSL CRITICAL - failed to verify search.svc.eqiad.wmnet against relforge.svc.eqiad.wmnet, relforge1003.eqiad.wmnet, relforge1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Search [14:43:46] godog: ^ see that ssl alert, maybe related to your search ssl change? [14:44:04] taavi: mmhh yes, thank you taking a look [14:44:16] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:45:22] PROBLEM - Elasticsearch HTTPS for relforge-eqiad-small-alpha on relforge1003 is CRITICAL: SSL CRITICAL - failed to verify search.svc.eqiad.wmnet against relforge.svc.eqiad.wmnet, relforge1003.eqiad.wmnet, relforge1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Search [14:45:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19171 and previous config saved to /var/cache/conftool/dbconfig/20220125-144528-root.json [14:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031', diff saved to https://phabricator.wikimedia.org/P19172 and previous config saved to /var/cache/conftool/dbconfig/20220125-144548-ladsgroup.json [14:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:08] PROBLEM - Check systemd state on kubernetes2015 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:50] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [14:48:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19173 and previous config saved to /var/cache/conftool/dbconfig/20220125-144843-marostegui.json [14:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:14] RECOVERY - Check systemd state on kubernetes2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: sre.ganeti.makevm: Allow passing a secondary disk - https://phabricator.wikimedia.org/T300046 (10MoritzMuehlenhoff) [14:49:24] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: sre.ganeti.makevm: Allow passing a secondary disk - https://phabricator.wikimedia.org/T300046 (10MoritzMuehlenhoff) p:05Triage→03Low [14:50:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:50:44] (03PS4) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [14:51:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [14:52:57] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [14:53:45] PROBLEM - Elasticsearch HTTPS for relforge-eqiad on relforge1004 is CRITICAL: SSL CRITICAL - failed to verify search.svc.eqiad.wmnet against relforge.svc.eqiad.wmnet, relforge1003.eqiad.wmnet, relforge1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Search [14:54:18] (03PS1) 10Jbond: WIP: drop individual promethus node ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/757010 [14:55:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [14:56:19] (03CR) 10jerkins-bot: [V: 04-1] WIP: drop individual promethus node ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/757010 (owner: 10Jbond) [14:56:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [14:56:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [14:58:21] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:58:25] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:58:25] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:58:25] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:59:02] (03PS1) 10Kosta Harlan: Revert "linkrecommendation: Bump chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/756992 [14:59:09] (03CR) 10Dzahn: [C: 03+2] add emoji to index.html for demonstration purposes [container/miscweb] - 10https://gerrit.wikimedia.org/r/757009 (owner: 10Dzahn) [14:59:11] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:59:12] (03CR) 10Kosta Harlan: [C: 03+2] Revert "linkrecommendation: Bump chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/756992 (owner: 10Kosta Harlan) [14:59:13] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:59:15] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:59:17] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [14:59:50] 10SRE, 10Infrastructure-Foundations, 10Maps, 10netbox: Postgres puppet modules use MD5 for users by default - https://phabricator.wikimedia.org/T300048 (10hnowlan) [14:59:56] (03PS5) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [15:00:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19174 and previous config saved to /var/cache/conftool/dbconfig/20220125-150031-root.json [15:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031 (T299911)', diff saved to https://phabricator.wikimedia.org/P19175 and previous config saved to /var/cache/conftool/dbconfig/20220125-150052-ladsgroup.json [15:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:58] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [15:01:55] did the ferm spam earlier actually disrupt some traffic for etcd? [15:02:03] I'm trying to dig through some debugging and that was a hint [15:02:23] Expecting some cassandra failures on maps2* hosts as the service is removed from puppet - they'll self-resolve after a time [15:02:28] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [15:02:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1034.eqiad.wmnet with reason: Maintenance [15:02:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1034.eqiad.wmnet with reason: Maintenance [15:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:53] (Device rebooted) firing: (2) Device rebooted - https://alerts.wikimedia.org [15:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1034 (T299911)', diff saved to https://phabricator.wikimedia.org/P19176 and previous config saved to /var/cache/conftool/dbconfig/20220125-150256-ladsgroup.json [15:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:01] (03Merged) 10jenkins-bot: Revert "linkrecommendation: Bump chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/756992 (owner: 10Kosta Harlan) [15:03:03] (03PS1) 10Filippo Giunchedi: elasticsearch: keep certificate_name as server_name [puppet] - 10https://gerrit.wikimedia.org/r/757013 (https://phabricator.wikimedia.org/T299633) [15:03:09] (03Merged) 10jenkins-bot: add emoji to index.html for demonstration purposes [container/miscweb] - 10https://gerrit.wikimedia.org/r/757009 (owner: 10Dzahn) [15:03:13] !log lvs600[13]: restarting pybal [15:03:13] RECOVERY - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:03:13] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:19] PROBLEM - Elasticsearch HTTPS for relforge-eqiad-small-alpha on relforge1004 is CRITICAL: SSL CRITICAL - failed to verify search.svc.eqiad.wmnet against relforge.svc.eqiad.wmnet, relforge1003.eqiad.wmnet, relforge1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Search [15:03:19] RECOVERY - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:03:21] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:03:22] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:03:23] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: keep certificate_name as server_name [puppet] - 10https://gerrit.wikimedia.org/r/757013 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [15:03:33] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [15:03:33] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:36] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [15:03:37] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [15:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P19177 and previous config saved to /var/cache/conftool/dbconfig/20220125-150348-marostegui.json [15:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:52] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync on staging [15:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:56] RECOVERY - Host ncredir-lb.drmrs.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 86.14 ms [15:03:57] RECOVERY - Host ncredir-lb.drmrs.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 86.15 ms [15:04:15] (03PS2) 10Filippo Giunchedi: elasticsearch: keep certificate_name as server_name [puppet] - 10https://gerrit.wikimedia.org/r/757013 (https://phabricator.wikimedia.org/T299633) [15:04:37] PROBLEM - cassandra service on maps2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:04:38] !log lvs6002: restarting pybal [15:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:43] PROBLEM - cassandra CQL 10.192.0.155:9042 on maps2005 is CRITICAL: connect to address 10.192.0.155 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:04:45] RECOVERY - PyBal BGP sessions are established on lvs6003 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [15:04:46] PROBLEM - LVS ncredir drmrs port 80/tcp - Non canonical domains redirect service IPv4 #page on ncredir-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:04:46] PROBLEM - cassandra service on maps2005 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:04:46] PROBLEM - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:50] PROBLEM - LVS ncredir drmrs port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:04:57] PROBLEM - cassandra CQL 10.192.16.107:9042 on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:05:03] ignore those pages, sorry [15:05:06] * volans ignoring [15:05:17] PROBLEM - cassandra service on maps2006 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:05:19] RECOVERY - PyBal BGP sessions are established on lvs6001 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [15:05:31] PROBLEM - cassandra CQL 10.192.16.31:9042 on maps2006 is CRITICAL: connect to address 10.192.16.31 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:05:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T285149)', diff saved to https://phabricator.wikimedia.org/P19178 and previous config saved to /var/cache/conftool/dbconfig/20220125-150539-marostegui.json [15:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:43] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [15:05:45] RECOVERY - snapshot of s4 in codfw on alert1001 is OK: Last snapshot for s4 at codfw (db2139.codfw.wmnet:3314) taken on 2022-01-25 13:09:02 (1595 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:06:05] PROBLEM - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:44] (03CR) 10Bking: [C: 03+1] elasticsearch: keep certificate_name as server_name [puppet] - 10https://gerrit.wikimedia.org/r/757013 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [15:07:07] RECOVERY - PyBal BGP sessions are established on lvs6002 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=drmrs+prometheus/ops [15:07:21] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:07:53] (Device rebooted) firing: (2) Device rebooted - https://alerts.wikimedia.org [15:07:59] PROBLEM - cassandra CQL 10.192.48.165:9042 on maps2008 is CRITICAL: connect to address 10.192.48.165 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:08:01] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:52] ACKNOWLEDGEMENT - LVS ncredir drmrs port 80/tcp - Non canonical domains redirect service IPv4 #page on ncredir-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black new service instance / datacenter, still provisioning and troubleshooting. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:08:53] ACKNOWLEDGEMENT - LVS ncredir-https drmrs port 443/tcp - Non canonical redirect service IPv4 #page on ncredir-lb.drmrs.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black new service instance / datacenter, still provisioning and troubleshooting. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:08:54] ACKNOWLEDGEMENT - LVS ncredir drmrs port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black new service instance / datacenter, still provisioning and troubleshooting. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:08:55] ACKNOWLEDGEMENT - LVS ncredir-https drmrs port 443/tcp - Non canonical redirect service IPv6 #page on ncredir-lb.drmrs.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Brandon Black new service instance / datacenter, still provisioning and troubleshooting. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:09:01] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1009.eqiad.wmnet [15:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:09] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2009.codfw.wmnet [15:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:13] (03PS1) 10Majavah: hieradata: cloud: Set monitoring_hosts as empty [puppet] - 10https://gerrit.wikimedia.org/r/757014 [15:10:33] PROBLEM - cassandra CQL 10.192.32.46:9042 on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:10:35] PROBLEM - cassandra service on maps2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:11:28] (03CR) 10Majavah: "PCC shows an expected change in a random cloud host and no-op on a random production host: https://puppet-compiler.wmflabs.org/pcc-worker1" [puppet] - 10https://gerrit.wikimedia.org/r/757014 (owner: 10Majavah) [15:12:01] ACKNOWLEDGEMENT - cassandra CQL 10.192.0.155:9042 on maps2005 is CRITICAL: connect to address 10.192.0.155 and port 9042: Connection refused Hnowlan Cassandra is being removed from maps https://phabricator.wikimedia.org/T93886 [15:12:01] ACKNOWLEDGEMENT - cassandra service on maps2005 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan Cassandra is being removed from maps https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:12:01] ACKNOWLEDGEMENT - cassandra CQL 10.192.16.31:9042 on maps2006 is CRITICAL: connect to address 10.192.16.31 and port 9042: Connection refused Hnowlan Cassandra is being removed from maps https://phabricator.wikimedia.org/T93886 [15:12:01] ACKNOWLEDGEMENT - cassandra service on maps2006 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan Cassandra is being removed from maps https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:12:01] ACKNOWLEDGEMENT - Check systemd state on maps2007 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra.service Hnowlan Cassandra is being removed from maps https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:01] ACKNOWLEDGEMENT - cassandra CQL 10.192.32.46:9042 on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 9042: Connection refused Hnowlan Cassandra is being removed from maps https://phabricator.wikimedia.org/T93886 [15:12:02] ACKNOWLEDGEMENT - cassandra service on maps2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan Cassandra is being removed from maps https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:12:02] ACKNOWLEDGEMENT - Check systemd state on maps2008 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra.service Hnowlan Cassandra is being removed from maps https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:03] ACKNOWLEDGEMENT - cassandra CQL 10.192.48.165:9042 on maps2008 is CRITICAL: connect to address 10.192.48.165 and port 9042: Connection refused Hnowlan Cassandra is being removed from maps https://phabricator.wikimedia.org/T93886 [15:12:03] ACKNOWLEDGEMENT - cassandra service on maps2008 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan Cassandra is being removed from maps https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:12:03] ACKNOWLEDGEMENT - cassandra CQL 10.192.16.107:9042 on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 9042: Connection refused Hnowlan Cassandra is being removed from maps https://phabricator.wikimedia.org/T93886 [15:12:04] ACKNOWLEDGEMENT - cassandra service on maps2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed Hnowlan Cassandra is being removed from maps https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:12:05] ACKNOWLEDGEMENT - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra.service Hnowlan Cassandra is being removed from maps https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:53] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [15:13:31] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33441/console" [puppet] - 10https://gerrit.wikimedia.org/r/757013 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [15:14:07] RECOVERY - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:14:07] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:15:10] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] elasticsearch: keep certificate_name as server_name [puppet] - 10https://gerrit.wikimedia.org/r/757013 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [15:15:25] PROBLEM - cassandra CQL 10.192.48.166:9042 on maps2010 is CRITICAL: connect to address 10.192.48.166 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:15:25] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19179 and previous config saved to /var/cache/conftool/dbconfig/20220125-151536-root.json [15:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:23] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:17:01] RECOVERY - Check systemd state on maps2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:13] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mx1001.wikimedia.org with reason: kernel testing [15:17:15] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mx1001.wikimedia.org with reason: kernel testing [15:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:17] RECOVERY - Check systemd state on maps2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:27] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:29] RECOVERY - Check systemd state on maps2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:45] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [15:17:51] PROBLEM - cassandra service on maps2010 is CRITICAL: CRITICAL - Expecting active but unit cassandra is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:18:17] !log mmandere@cumin1001 conftool action : select; selector: cluster=necredir,dc=drmrs [15:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:37] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:18:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T299827)', diff saved to https://phabricator.wikimedia.org/P19180 and previous config saved to /var/cache/conftool/dbconfig/20220125-151852-marostegui.json [15:18:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:18:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [15:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:57] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [15:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T299827)', diff saved to https://phabricator.wikimedia.org/P19181 and previous config saved to /var/cache/conftool/dbconfig/20220125-151900-marostegui.json [15:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T299827)', diff saved to https://phabricator.wikimedia.org/P19182 and previous config saved to /var/cache/conftool/dbconfig/20220125-152006-marostegui.json [15:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:19] PROBLEM - Check systemd state on kubernetes1007 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P19183 and previous config saved to /var/cache/conftool/dbconfig/20220125-152044-marostegui.json [15:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es1034.eqiad.wmnet with OS bullseye [15:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:59] !log mmandere@cumin1001 conftool action : set/pooled=yes; selector: name=ncredir6002.* [15:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:45] (Device rebooted) firing: (2) Device rebooted - https://alerts.wikimedia.org [15:24:14] (03Abandoned) 10Hnowlan: image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/739618 (owner: 10PipelineBot) [15:24:20] (03Abandoned) 10Hnowlan: image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736565 (owner: 10PipelineBot) [15:24:24] (03Abandoned) 10Hnowlan: image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734438 (owner: 10PipelineBot) [15:24:55] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:24:55] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:24:58] (03CR) 10Hnowlan: [C: 03+2] image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/742271 (owner: 10PipelineBot) [15:24:58] !log mmandere@cumin1001 START - Cookbook sre.ganeti.makevm for new host ncredir6002.drmrs.wmnet [15:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:18] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:25:18] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1007 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:25:20] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:25:22] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:25:22] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:25:26] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:25:44] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:25:44] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:25:44] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:25:45] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:26:04] RECOVERY - Check systemd state on kubernetes1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:26] RECOVERY - Elasticsearch HTTPS for relforge-eqiad-small-alpha on relforge1003 is OK: SSL OK - Certificate relforge.svc.eqiad.wmnet valid until 2026-03-18 02:55:32 +0000 (expires in 1512 days) https://wikitech.wikimedia.org/wiki/Search [15:26:26] RECOVERY - Elasticsearch HTTPS for relforge-eqiad on relforge1003 is OK: SSL OK - Certificate relforge.svc.eqiad.wmnet valid until 2026-03-18 02:55:32 +0000 (expires in 1512 days) https://wikitech.wikimedia.org/wiki/Search [15:26:26] RECOVERY - Elasticsearch HTTPS for relforge-eqiad on relforge1004 is OK: SSL OK - Certificate relforge.svc.eqiad.wmnet valid until 2026-03-18 02:55:32 +0000 (expires in 1512 days) https://wikitech.wikimedia.org/wiki/Search [15:26:26] RECOVERY - Elasticsearch HTTPS for relforge-eqiad-small-alpha on relforge1004 is OK: SSL OK - Certificate relforge.svc.eqiad.wmnet valid until 2026-03-18 02:55:32 +0000 (expires in 1512 days) https://wikitech.wikimedia.org/wiki/Search [15:26:40] that was me ^ [15:27:37] (03PS6) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [15:27:45] (Device rebooted) firing: (2) Device rebooted - https://alerts.wikimedia.org [15:28:14] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:28:14] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:28:45] (03Merged) 10jenkins-bot: image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/742271 (owner: 10PipelineBot) [15:29:37] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [15:30:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19184 and previous config saved to /var/cache/conftool/dbconfig/20220125-153040-root.json [15:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:21] (03PS7) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [15:31:48] !log centrallog1001:~# lvextend --resizefs --size +23G /dev/centrallog1001-vg/data [15:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:45] (Device rebooted) firing: (2) Device rebooted - https://alerts.wikimedia.org [15:32:58] !log jelto@cumin1001 START - Cookbook sre.ganeti.makevm for new host gitlab-runner1001.eqiad.wmnet [15:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:04] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:33:30] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [15:34:19] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host es1022.eqiad.wmnet with OS bullseye [15:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:26] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye [15:35:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19185 and previous config saved to /var/cache/conftool/dbconfig/20220125-153511-marostegui.json [15:35:14] (03PS8) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [15:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:32] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:35:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P19186 and previous config saved to /var/cache/conftool/dbconfig/20220125-153548-marostegui.json [15:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:21] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [15:37:45] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [15:38:12] !log mmandere@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir6002.drmrs.wmnet [15:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:51] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:40:11] (03PS9) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [15:41:27] (03PS2) 10Arturo Borrero Gonzalez: wmcs: monitoring: sharpen primary/backup rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/756954 (https://phabricator.wikimedia.org/T300011) [15:42:02] (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: sharpen primary/backup rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/756954 (https://phabricator.wikimedia.org/T300011) (owner: 10Arturo Borrero Gonzalez) [15:42:23] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [15:42:46] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [15:43:06] (03PS10) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [15:43:08] (03PS3) 10Arturo Borrero Gonzalez: wmcs: monitoring: sharpen primary/backup rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/756954 (https://phabricator.wikimedia.org/T300011) [15:43:33] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:45:36] (03CR) 10Andrew Bogott: wmcs: monitoring: sharpen primary/backup rsync setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756954 (https://phabricator.wikimedia.org/T300011) (owner: 10Arturo Borrero Gonzalez) [15:45:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19187 and previous config saved to /var/cache/conftool/dbconfig/20220125-154543-root.json [15:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:51] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/ulsfo/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:45:51] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:46:20] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Volans) The issue here was that both `NIC.Integrated.1-1-1` and `NIC.Integrated.1-3-1` had the `LegacyBootProto` set to `PXE` while the host has a cable only on the 3rd NIC (see [[ https://netbox.wikimedia.org/... [15:47:46] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [15:47:59] PROBLEM - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/drmrs/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:48:19] PROBLEM - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqsin/ncredir-https is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:50:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P19189 and previous config saved to /var/cache/conftool/dbconfig/20220125-155017-marostegui.json [15:50:19] (03PS1) 10Filippo Giunchedi: ssl: remove search.svc keypair [puppet] - 10https://gerrit.wikimedia.org/r/757020 (https://phabricator.wikimedia.org/T299633) [15:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:42] 10SRE, 10SRE-swift-storage: reimaging swift backends should set swift UID/GID to match filesystems - https://phabricator.wikimedia.org/T300057 (10MatthewVernon) [15:50:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T285149)', diff saved to https://phabricator.wikimedia.org/P19190 and previous config saved to /var/cache/conftool/dbconfig/20220125-155053-marostegui.json [15:50:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [15:50:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [15:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [15:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T285149)', diff saved to https://phabricator.wikimedia.org/P19191 and previous config saved to /var/cache/conftool/dbconfig/20220125-155101-marostegui.json [15:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T285149)', diff saved to https://phabricator.wikimedia.org/P19192 and previous config saved to /var/cache/conftool/dbconfig/20220125-155207-marostegui.json [15:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1034.eqiad.wmnet with OS bullseye [15:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:49] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [15:53:28] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:53:28] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:53:30] PROBLEM - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/esams/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:53:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/ncredir is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [15:54:09] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33442/console" [puppet] - 10https://gerrit.wikimedia.org/r/757020 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [15:55:34] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:55:36] (03PS4) 10Andrew Bogott: wmcs: monitoring: sharpen primary/backup rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/756954 (https://phabricator.wikimedia.org/T300011) (owner: 10Arturo Borrero Gonzalez) [15:56:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1034 (T299911)', diff saved to https://phabricator.wikimedia.org/P19193 and previous config saved to /var/cache/conftool/dbconfig/20220125-155604-ladsgroup.json [15:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:09] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [15:57:49] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [15:58:52] PROBLEM - Host db2086.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:59:08] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10herron) >>! In T292254#7648473, @jcrespo wrote: > @herron @lmata What is the relationship between the columns at #sre-onfire and #sre-onfire-incident-docs "in review"? I... [16:00:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19194 and previous config saved to /var/cache/conftool/dbconfig/20220125-160047-root.json [16:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:22] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10jcrespo) >>! In T292254#7649349, @herron wrote: > Since we're evaluating using the scorecard from Q2 2021/2022 onward I didn't want to mix the queues between stalled doc... [16:01:42] (03CR) 10Accraze: [C: 03+1] "Nice one Luca!" [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [16:01:48] (03PS5) 10Andrew Bogott: wmcs: monitoring: sharpen primary/backup rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/756954 (https://phabricator.wikimedia.org/T300011) (owner: 10Arturo Borrero Gonzalez) [16:02:52] 10SRE-OnFire (FY2021/2022-Q3): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) @herron @jcrespo +1 to untagging stalled tasks and sunsetting the old docs tag; I think we can revisit having a different tag for docs once the process is more ma... [16:02:53] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [16:03:49] (03CR) 10Cwhite: [C: 04-1] logstash: improve filter for ORES (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [16:05:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T299827)', diff saved to https://phabricator.wikimedia.org/P19195 and previous config saved to /var/cache/conftool/dbconfig/20220125-160522-marostegui.json [16:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:26] T299827: Add gb_by_central_id column to globalblocks table - https://phabricator.wikimedia.org/T299827 [16:05:57] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1022.eqiad.wmnet with OS bullseye [16:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:04] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host es1022.eqiad.wmnet with OS bullseye completed: - es1022 (**WARN**) - Removed from Puppet and PuppetDB if present... [16:06:36] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) Test transceiver QSFP-PLR4 40G QSFP+ FROM FS ` dell-spine1# show interface transceiver Ethernet 4 Ethernet4 --------------------------------------------------------------------- Attribute :... [16:06:42] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-test-coord1001.eqiad.wmnet with reason: Still troubleshooting mariadb issues [16:06:44] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-test-coord1001.eqiad.wmnet with reason: Still troubleshooting mariadb issues [16:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P19196 and previous config saved to /var/cache/conftool/dbconfig/20220125-160712-marostegui.json [16:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:52] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [16:08:04] 10Puppet, 10Infrastructure-Foundations, 10Release-Engineering-Team (Yak Shaving 🐃🪒), 10User-brennen: logspam-watch: UTF-8 errors for some input - https://phabricator.wikimedia.org/T292246 (10brennen) 05Open→03Resolved [16:10:42] RECOVERY - Host db2086.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.53 ms [16:11:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1034', diff saved to https://phabricator.wikimedia.org/P19197 and previous config saved to /var/cache/conftool/dbconfig/20220125-161108-ladsgroup.json [16:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:44] (03PS1) 10MMandere: install_server: Add drmrs ncredir second instance [puppet] - 10https://gerrit.wikimedia.org/r/757024 (https://phabricator.wikimedia.org/T282787) [16:12:02] (03CR) 10Jbond: [C: 03+1] "policy wise i think this is fine, however im not familiar with gerrit syntax so have added antonie" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/756989 (owner: 10Majavah) [16:12:37] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10MarkTraceur) In case this is a thing that needs a manager to approve, I am here to approve it! [16:12:45] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Volans) 05Open→03Resolved Root cause found, problem solved, host reimaged. Resolving. [16:12:45] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [16:12:53] (03PS1) 10MVernon: install_server: swift UID/GID should match filesystems (if present) [puppet] - 10https://gerrit.wikimedia.org/r/757025 (https://phabricator.wikimedia.org/T300057) [16:13:02] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: monitoring: sharpen primary/backup rsync setup [puppet] - 10https://gerrit.wikimedia.org/r/756954 (https://phabricator.wikimedia.org/T300011) (owner: 10Arturo Borrero Gonzalez) [16:14:02] (03CR) 10Elukey: logstash: improve filter for ORES (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [16:15:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19198 and previous config saved to /var/cache/conftool/dbconfig/20220125-161550-root.json [16:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:05] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) Thank you @Volans for troubleshooting this issue! [16:17:19] (03PS1) 10Majavah: hieradata: remove cloud-cumin-01,02 [puppet] - 10https://gerrit.wikimedia.org/r/757026 (https://phabricator.wikimedia.org/T255980) [16:18:36] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) Testing transceiver 10GB SFP+ from Finisar ` dell-leaf2# show interface transceiver Ethernet 0 Ethernet0 --------------------------------------------------------------------- Attribute : Value... [16:18:47] !log updating firmware ganeti1014 T299527 [16:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:52] T299527: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 [16:19:19] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) Testing transveiver 10G SFP= from FS ~~~ dell-leaf2# show interface transceiver Ethernet 1 Ethernet1 --------------------------------------------------------------------- Attribute : Value/Sta... [16:20:40] (03CR) 10BBlack: [C: 03+1] install_server: Add drmrs ncredir second instance [puppet] - 10https://gerrit.wikimedia.org/r/757024 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:21:19] (03PS11) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [16:21:41] !log updating firmware ganeti1005 T299527 [16:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:07] (03CR) 10MMandere: [C: 03+2] install_server: Add drmrs ncredir second instance [puppet] - 10https://gerrit.wikimedia.org/r/757024 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:22:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P19199 and previous config saved to /var/cache/conftool/dbconfig/20220125-162217-marostegui.json [16:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:46] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [16:23:21] (03CR) 10jerkins-bot: [V: 04-1] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [16:24:45] 10SRE, 10ops-codfw: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 (10Papaul) Before firmware upgrade ` BIOS Version 2.4.3 Firmware Version 2.40.40.40 ` After firmware upgrade ` [16:26:07] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) @hnowlan Can i just update the nic firmware or does this need scheduled downtime? [16:26:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1034', diff saved to https://phabricator.wikimedia.org/P19200 and previous config saved to /var/cache/conftool/dbconfig/20220125-162613-ladsgroup.json [16:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:43] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) Testing Juniper DAC ` dell-leaf2# show interface transceiver Ethernet 2 Ethernet2 --------------------------------------------------------------------- Attribute : Value/State ---------------... [16:27:35] (03PS12) 10Elukey: logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) [16:28:16] PROBLEM - Host restbase2011 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:16] (03CR) 10Cwhite: [C: 03+2] logstash: remove event.duration when value is hyphen [puppet] - 10https://gerrit.wikimedia.org/r/756683 (owner: 10Cwhite) [16:30:04] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:30:06] RECOVERY - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:30:22] RECOVERY - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:30:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19201 and previous config saved to /var/cache/conftool/dbconfig/20220125-163054-root.json [16:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:58] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:31:00] (03CR) 10Elukey: logstash: improve filter for ORES (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [16:31:16] RECOVERY - Confd template for /srv/config-master/pybal/esams/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:31:16] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:31:24] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:31:24] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:31:26] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:31:32] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:31:35] (03PS2) 10Jbond: WIP: drop individual promethus node ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/757010 [16:31:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19202 and previous config saved to /var/cache/conftool/dbconfig/20220125-163141-root.json [16:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:22] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:00] (03CR) 10jerkins-bot: [V: 04-1] WIP: drop individual promethus node ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/757010 (owner: 10Jbond) [16:33:04] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:04] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:04] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:04] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:04] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:04] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:04] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:05] (03PS1) 10Ladsgroup: Revert "es1034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756993 [16:33:05] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/ncredir-https on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:06] RECOVERY - Confd template for /srv/config-master/pybal/eqsin/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:06] RECOVERY - Confd template for /srv/config-master/pybal/esams/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:33:13] (03PS2) 10Ladsgroup: Revert "es1034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756993 [16:33:17] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es1034: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756993 (owner: 10Ladsgroup) [16:33:38] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) Testing Juniper 40G transceiver ` dell-spine1# show interface transceiver Ethernet 12 Ethernet12 --------------------------------------------------------------------- Attribute : Value/State... [16:33:41] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) >>! In T299652#7649471, @Cmjohnson wrote: > @hnowlan Can i just update the nic firmware or does this need scheduled downtime? I... [16:37:04] (03PS1) 10Jcrespo: exim: Silently block spam email from given source [puppet] - 10https://gerrit.wikimedia.org/r/757031 [16:37:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T285149)', diff saved to https://phabricator.wikimedia.org/P19203 and previous config saved to /var/cache/conftool/dbconfig/20220125-163721-marostegui.json [16:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:28] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [16:37:30] RECOVERY - Maps - OSM synchronization lag - eqiad on alert1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 5.985e+04 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [16:37:48] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:39:34] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Clarify whether members of ldap/nda should be added to #WMF-NDA - https://phabricator.wikimedia.org/T299839 (10Volans) Adding #WMF-NDA-Requests, @mark, @faidon and @MoritzMuehlenhoff for SRE, #security and @KFrancis for feedback. One thing to clarify is how... [16:39:44] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:40:08] RECOVERY - Confd template for /srv/config-master/pybal/drmrs/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:40:29] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [16:40:44] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) @MoritzMuehlenhoff both 1014 and 1005 have been updated. [16:41:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1034 (T299911)', diff saved to https://phabricator.wikimedia.org/P19204 and previous config saved to /var/cache/conftool/dbconfig/20220125-164118-ladsgroup.json [16:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:23] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [16:42:49] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) @hnowlan Yes, we can do them in any order you see fit. I would like to trial run one first to make sure the idrac update works... [16:43:06] (03CR) 10Jcrespo: "Hello, Faidon," [puppet] - 10https://gerrit.wikimedia.org/r/757031 (owner: 10Jcrespo) [16:43:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Make es1031 master of es3 T299911', diff saved to https://phabricator.wikimedia.org/P19206 and previous config saved to /var/cache/conftool/dbconfig/20220125-164324-ladsgroup.json [16:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:15] !log deploy updated patch for T285116 [16:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19207 and previous config saved to /var/cache/conftool/dbconfig/20220125-164645-root.json [16:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:52] RECOVERY - Confd template for /srv/config-master/pybal/codfw/ncredir-https on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [16:47:26] (03PS1) 10Ladsgroup: es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757032 (https://phabricator.wikimedia.org/T299911) [16:47:41] (KubernetesRsyslogDown) firing: rsyslog on kubestage1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [16:47:51] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) >>! In T299652#7649526, @Cmjohnson wrote: > @hnowlan Yes, we can do them in any order you see fit. I would like to trial run one... [16:48:02] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es1028: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757032 (https://phabricator.wikimedia.org/T299911) (owner: 10Ladsgroup) [16:48:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance [16:48:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance [16:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1028 (T299911)', diff saved to https://phabricator.wikimedia.org/P19208 and previous config saved to /var/cache/conftool/dbconfig/20220125-164900-ladsgroup.json [16:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:04] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [16:52:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:33] RECOVERY - LVS ncredir drmrs port 80/tcp - Non canonical domains redirect service IPv6 #page on ncredir-lb.drmrs.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.174 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:55:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:55:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:41] (03CR) 10Filippo Giunchedi: "Thanks for tackling this!" [puppet] - 10https://gerrit.wikimedia.org/r/757025 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [16:56:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:53] (03CR) 10Cwhite: [C: 03+2] elasticsearch: write curator logs to stdout [puppet] - 10https://gerrit.wikimedia.org/r/756053 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [16:59:29] RECOVERY - LVS ncredir drmrs port 80/tcp - Non canonical domains redirect service IPv4 #page on ncredir-lb.drmrs.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:00:05] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1700). [17:00:05] RoanKattouw and Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:39] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/757036 [17:01:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19209 and previous config saved to /var/cache/conftool/dbconfig/20220125-170148-root.json [17:01:50] o/ [17:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:58] p/ [17:02:01] * o/ [17:02:02] !log upgrade elasticsearch-curator on apifeatureusage1001 [17:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:13] (03PS2) 10MVernon: install_server: swift UID/GID should match filesystems (if present) [puppet] - 10https://gerrit.wikimedia.org/r/757025 (https://phabricator.wikimedia.org/T300057) [17:03:34] (03PS1) 10SBassett: Escape various messages in WikibaseMediaInfo [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756994 (https://phabricator.wikimedia.org/T299289) [17:05:39] (03CR) 10MVernon: install_server: swift UID/GID should match filesystems (if present) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757025 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [17:06:36] (03PS2) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/757036 [17:07:34] 10SRE, 10ops-codfw: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 (10Papaul) 05Open→03Resolved @Marostegui this is complete [17:08:12] 10SRE, 10ops-codfw: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 (10Marostegui) Thank you! [17:10:31] RoanKattouw, Lucas_WMDE: whoops, sorry I'm late! looking now [17:10:52] thanks! [17:10:56] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/757025 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [17:13:51] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [17:14:01] (03CR) 10SBassett: [C: 03+2] Escape various messages in WikibaseMediaInfo [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756994 (https://phabricator.wikimedia.org/T299289) (owner: 10SBassett) [17:14:15] (03CR) 10SBassett: Escape various messages in WikibaseMediaInfo [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756994 (https://phabricator.wikimedia.org/T299289) (owner: 10SBassett) [17:14:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Cmjohnson) cr1-eqiad et-1/0/2 ----> lsw1-e1-eqiad et-0/0/48 connected using patch panel number 2190001 and cable ID's lsw1-demarc (new ca... [17:16:05] (03CR) 10RLazarus: [C: 03+2] doc.wikimedia.org CSP: Allow XHR requests to Wikipedia and Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/754048 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [17:16:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19210 and previous config saved to /var/cache/conftool/dbconfig/20220125-171652-root.json [17:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:38] (03Abandoned) 10Muehlenhoff: oozie: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts [puppet] - 10https://gerrit.wikimedia.org/r/746802 (owner: 10Muehlenhoff) [17:17:59] (03PS1) 10Cwhite: opensearch: write curator logs to stdout [puppet] - 10https://gerrit.wikimedia.org/r/757040 [17:18:51] (Device rebooted) resolved: (2) Device rebooted - https://alerts.wikimedia.org [17:19:35] (03CR) 10Cwhite: [C: 03+2] opensearch: write curator logs to stdout [puppet] - 10https://gerrit.wikimedia.org/r/757040 (owner: 10Cwhite) [17:19:49] RoanKattouw: merged yours, and puppet is running on doc*, stand by to test :) [17:20:26] RoanKattouw: okay, give it a try [17:21:38] Lucas_WMDE: check me, it looks like the only affected host should be labstore1006.wm.o, is that correct? [17:21:50] I think so, yes [17:21:52] rzl: It works, thank you! [17:22:01] that rsync is only supposed to run on one host and IIRC it was indeed 1006 [17:22:30] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33444/console" [puppet] - 10https://gerrit.wikimedia.org/r/755352 (https://phabricator.wikimedia.org/T299358) (owner: 10Lucas Werkmeister (WMDE)) [17:22:33] Lucas_WMDE: okay perfect -- PCC in parse_commit mode was trying to run on a hojillion machines for some reason :) [17:22:38] aha there we go [17:22:42] RoanKattouw: great, thanks! [17:23:16] (03CR) 10RLazarus: [V: 03+1 C: 03+2] nginxlogs: Move rsync globs to --include/--exclude [puppet] - 10https://gerrit.wikimedia.org/r/755352 (https://phabricator.wikimedia.org/T299358) (owner: 10Lucas Werkmeister (WMDE)) [17:23:26] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/757036 (owner: 10Ahmon Dancy) [17:23:45] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [17:24:49] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/757036 (owner: 10Ahmon Dancy) [17:25:03] Lucas_WMDE: merged -- would you like me to run puppet on labstore1006? it looks like there's not much to test, unless you want to make sure the systemd unit looks as expected [17:25:25] I don’t think there’s much to test, the next run would be in 8 hours or so [17:25:38] and I could check if the files appear as expected [17:25:56] (so far we don’t even know that this was the only / relevant problem, but I’m pretty sure the code as it stood didn’t do the right thing) [17:26:36] nod [17:26:53] well, I'll leave it there :) good luck with the next run! let me know if I can do anything further [17:27:00] ok thanks! [17:27:17] (03CR) 10jerkins-bot: [V: 04-1] Escape various messages in WikibaseMediaInfo [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756994 (https://phabricator.wikimedia.org/T299289) (owner: 10SBassett) [17:28:45] (Device rebooted) firing: (2) Device rebooted - https://alerts.wikimedia.org [17:29:38] (03PS1) 10Cwhite: opensearch: restore curator logging filters [puppet] - 10https://gerrit.wikimedia.org/r/757041 [17:30:02] (03PS2) 10Cwhite: logstash: switch to opensearch output plugin on production logstash [puppet] - 10https://gerrit.wikimedia.org/r/755812 (https://phabricator.wikimedia.org/T299168) [17:31:08] (03CR) 10SBassett: "recheck" [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756994 (https://phabricator.wikimedia.org/T299289) (owner: 10SBassett) [17:31:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19211 and previous config saved to /var/cache/conftool/dbconfig/20220125-173156-root.json [17:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:09] (03CR) 10Cwhite: [C: 03+2] opensearch: restore curator logging filters [puppet] - 10https://gerrit.wikimedia.org/r/757041 (owner: 10Cwhite) [17:33:19] (03PS1) 10Elukey: logstash: add docker support in the Makefile [puppet] - 10https://gerrit.wikimedia.org/r/757042 (https://phabricator.wikimedia.org/T300051) [17:33:45] (Device rebooted) firing: (2) Device rebooted - https://alerts.wikimedia.org [17:36:36] (03CR) 10Elukey: "I am wondering if this could be made more DRY sharing the target, but I am not familiar with a way to do it (something matching podman|doc" [puppet] - 10https://gerrit.wikimedia.org/r/757042 (https://phabricator.wikimedia.org/T300051) (owner: 10Elukey) [17:36:54] (03CR) 10SBassett: [C: 03+2] Escape various messages in WikibaseMediaInfo [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756994 (https://phabricator.wikimedia.org/T299289) (owner: 10SBassett) [17:37:28] (03PS1) 10JHathaway: icinga: add additional users to fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/757043 (https://phabricator.wikimedia.org/T298649) [17:38:27] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/757043 (https://phabricator.wikimedia.org/T298649) (owner: 10JHathaway) [17:38:45] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [17:40:41] 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, and 2 others: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10Catrope) 05Open→03Resolved a:03Catrope [17:43:42] 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, 10SecTeam-Processed: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10sbassett) [17:45:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es1028.eqiad.wmnet with OS bullseye [17:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19212 and previous config saved to /var/cache/conftool/dbconfig/20220125-174659-root.json [17:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:34] (03PS2) 10Elukey: logstash: add docker support in the Makefile [puppet] - 10https://gerrit.wikimedia.org/r/757042 (https://phabricator.wikimedia.org/T300051) [17:48:51] 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, 10SecTeam-Processed: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10Jdforrester-WMF) > `Refused to load the image 'https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/…' because it viol... [17:53:52] (03Merged) 10jenkins-bot: Escape various messages in WikibaseMediaInfo [extensions/WikibaseMediaInfo] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756994 (https://phabricator.wikimedia.org/T299289) (owner: 10SBassett) [17:56:44] (03CR) 10Dzahn: [C: 03+1] "looks good now https://puppet-compiler.wmflabs.org/pcc-worker1003/33445/" [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [17:57:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:19] (03CR) 10Dzahn: [C: 03+2] gitlab: update cloud hiera, refactor naming [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [17:58:50] (03PS1) 10DCausse: aptrepo: add an elastic68 component [puppet] - 10https://gerrit.wikimedia.org/r/757046 (https://phabricator.wikimedia.org/T295666) [17:59:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:59:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] chrisalbon and accraze: That opportune time is upon us again. Time for a Services – Graphoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1800). [18:00:43] (03CR) 10Dzahn: "noop (but compiler finds test instance) in cloud https://puppet-compiler.wmflabs.org/pcc-worker1003/33447/" [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [18:00:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19213 and previous config saved to /var/cache/conftool/dbconfig/20220125-180203-root.json [18:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:08] (03CR) 10Dzahn: "noop confirmed on gitlab1001 and gitlab2001 in prod" [puppet] - 10https://gerrit.wikimedia.org/r/754063 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [18:10:04] 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, 10SecTeam-Processed: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10Catrope) 05Resolved→03Open Ugh. Yes. Thanks for pointing that out. [18:11:26] (03PS1) 10Catrope: doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) [18:13:32] jouncebot next [18:13:32] In 0 hour(s) and 46 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1900) [18:13:36] 10SRE, 10Codex, 10Security-Team, 10WVUI, and 3 others: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10Catrope) Re-adding #security-team because we unfortunately needed a follow-up [18:14:56] !log train 1.38.0-wmf.19 (T293960): no open blockers, starting stage-train script shortly [18:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:01] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [18:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19214 and previous config saved to /var/cache/conftool/dbconfig/20220125-181706-root.json [18:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1028.eqiad.wmnet with OS bullseye [18:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:23:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T299911)', diff saved to https://phabricator.wikimedia.org/P19215 and previous config saved to /var/cache/conftool/dbconfig/20220125-182435-ladsgroup.json [18:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:39] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [18:25:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:48] (03CR) 10DCausse: [C: 03+1] Upgrade to elasticsearch 6.8.23 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/755750 (https://phabricator.wikimedia.org/T294499) (owner: 10EJoseph) [18:28:32] !log installing policykit-1 security updates on buster [18:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:29] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply on production [18:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:02] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10faidon) Hi @AndyRussG - you mentioned that "//[Bing] has an option to import domain verifications from Google Search Console//"; is there another option, such as doing the Bing doma... [18:30:47] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: sync on production [18:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:53] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply on production [18:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19216 and previous config saved to /var/cache/conftool/dbconfig/20220125-183210-root.json [18:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:55] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: sync on production [18:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:09] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply on production [18:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:38] jouncebot: nowandnext [18:36:38] For the next 0 hour(s) and 23 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1800) [18:36:38] In 0 hour(s) and 23 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1900) [18:38:12] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: sync on production [18:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Cmjohnson thanks! Unfortunately seems to be some kind of problem. Neither are showing up. **cr1-eqiad et-1/0/2 ----> lsw1-e1-eq... [18:38:42] (03CR) 10Volans: "LGTM but missing one bit, see inline." [puppet] - 10https://gerrit.wikimedia.org/r/757043 (https://phabricator.wikimedia.org/T298649) (owner: 10JHathaway) [18:39:38] (03PS1) 10Ladsgroup: Revert "es1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756996 [18:39:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P19217 and previous config saved to /var/cache/conftool/dbconfig/20220125-183940-ladsgroup.json [18:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:44] (03PS2) 10Ladsgroup: Revert "es1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756996 [18:41:06] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es1028: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/756996 (owner: 10Ladsgroup) [18:44:33] !log jelto@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab-runner1001.eqiad.wmnet [18:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:27] 10SRE, 10SRE-Access-Requests: Bing Webmaster Tools access request for Andrew Green - https://phabricator.wikimedia.org/T298723 (10AndyRussG) Thanks so much, @faidon! Yes, there is another option, and yes, agreed it would definitely be preferable. The other option is, "Add your site manually: Add your site to B... [18:47:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19218 and previous config saved to /var/cache/conftool/dbconfig/20220125-184714-root.json [18:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:21] (03PS3) 10Jbond: profile: drop individual prometheus node ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/757010 [18:50:19] (03PS4) 10Jbond: profile: drop individual prometheus node ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/757010 [18:54:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P19219 and previous config saved to /var/cache/conftool/dbconfig/20220125-185444-ladsgroup.json [18:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:56:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:45] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757055 [18:57:47] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757055 (owner: 10Brennen Bearnes) [18:58:29] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757055 (owner: 10Brennen Bearnes) [18:58:34] !log brennen@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.19 refs T293960 [18:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:38] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [19:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T1900) [19:02:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:03:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:05] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1006.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [19:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1006.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [19:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:24] (03PS3) 10Ahmon Dancy: MWMultiVersion.php: Flexible wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 [19:05:46] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) >>! In T299527#7649517, @Cmjohnson wrote: > @MoritzMuehlenhoff both 1014 and 1005 have been updated. Thanks, an additional serv... [19:05:58] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [19:06:40] (03PS4) 10Ahmon Dancy: MWMultiVersion.php: Flexible wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 [19:07:41] (03CR) 10Ahmon Dancy: [C: 03+1] MWMultiVersion.php: Flexible wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 (owner: 10Ahmon Dancy) [19:08:53] (03PS5) 10Jbond: profile: drop individual prometheus node ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/757010 [19:09:05] (03CR) 10SBassett: doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [19:09:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T299911)', diff saved to https://phabricator.wikimedia.org/P19220 and previous config saved to /var/cache/conftool/dbconfig/20220125-190949-ladsgroup.json [19:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:54] T299911: Upgrade es3 to Bullseye - https://phabricator.wikimedia.org/T299911 [19:10:46] 10SRE, 10Codex, 10WVUI, 10ContentSecurityPolicy, and 2 others: WVUI and Codex demos: CSP stopping typeahead input demos working - https://phabricator.wikimedia.org/T285570 (10sbassett) >>! In T285570#7649901, @Catrope wrote: > Re-adding #security-team because we unfortunately needed a follow-up I left a c... [19:12:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Make es1028 master of es3 T299911', diff saved to https://phabricator.wikimedia.org/P19221 and previous config saved to /var/cache/conftool/dbconfig/20220125-191238-ladsgroup.json [19:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:44] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 7.569e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [19:21:24] (03CR) 10Ahmon Dancy: "Hi Timo. I was looking at updateinterwikicache.py (in the scap codebase) recently and felt some of the same cringes that you described in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) (owner: 10Krinkle) [19:22:52] (03PS1) 10Dzahn: miscweb: bump version to 2022-01-25-150544-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757060 [19:23:04] (03CR) 10jerkins-bot: [V: 04-1] miscweb: bump version to 2022-01-25-150544-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/757060 (owner: 10Dzahn) [19:25:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:42] (03PS2) 10Catrope: doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) [19:31:11] (03CR) 10Catrope: doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [19:31:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:31:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:08] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) lets go with restbase1019 @hnowlan [19:35:33] !log updating firmware ganeti1006 T299527 [19:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:37] T299527: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 [19:36:32] (03CR) 10Majavah: scap: Remove commit and sync steps from 'update-interwiki-cache' (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) (owner: 10Krinkle) [19:37:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host backup1008.eqiad.wmnet with OS buster [19:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster [19:39:16] (03CR) 10Jdlrobson: "recheck" [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756696 (https://phabricator.wikimedia.org/T299971) (owner: 10Jdlrobson) [19:41:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson [19:44:41] (03CR) 10Jbond: "Sorry its big but its all very simlar and simple, also pcc passes without issue 😊" [puppet] - 10https://gerrit.wikimedia.org/r/757010 (owner: 10Jbond) [19:45:13] (03PS6) 10Jbond: profile: drop individual prometheus node ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/757010 [19:49:49] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) [19:50:05] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10Cmjohnson) 1006 has been updated [19:50:35] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.19 refs T293960 (duration: 52m 01s) [19:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:40] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [19:53:56] (03CR) 10SBassett: doc.wikimedia.org CSP: Also allow images from upload.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757049 (https://phabricator.wikimedia.org/T285570) (owner: 10Catrope) [19:54:07] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:00:04] brennen and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220125T2000). [20:00:59] o/ [20:01:20] o/ [20:01:22] !log train 1.38.0-wmf.19 (T293960): testwiki sync finished, still no open blockers, proceeding to group0 [20:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:27] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [20:02:38] (03PS1) 10Brennen Bearnes: group0 wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757077 [20:02:40] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757077 (owner: 10Brennen Bearnes) [20:03:45] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757077 (owner: 10Brennen Bearnes) [20:03:57] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1008.eqiad.wmnet with OS buster [20:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host backup1008.eqiad.wmnet with OS buster completed: - backup1008... [20:05:07] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.19 refs T293960 [20:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:53] (03PS1) 10Jdlrobson: Do not load common.js twice [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756997 (https://phabricator.wikimedia.org/T300070) [20:11:51] (03PS1) 10Jdlrobson: Do not load common.js twice [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756998 (https://phabricator.wikimedia.org/T300070) [20:12:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:12:24] (03Abandoned) 10BBlack: drmrs: ncredir puppetization [puppet] - 10https://gerrit.wikimedia.org/r/748790 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [20:12:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10Cmjohnson) [20:14:09] (03PS1) 10Jdlrobson: Enable migration mode on Italian and MediaWIki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757087 (https://phabricator.wikimedia.org/T299927) [20:14:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup1008 - https://phabricator.wikimedia.org/T294974 (10Cmjohnson) 05Open→03Resolved ready! [20:16:33] (03CR) 10Cwhite: [C: 03+2] logstash: switch to opensearch output plugin on production logstash [puppet] - 10https://gerrit.wikimedia.org/r/755812 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [20:17:32] !log begin transition to logstash output opensearch plugin T299168 [20:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:36] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [20:17:56] 10SRE, 10Patch-For-Review, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) https://wikitech.wikimedia.org/wiki/Miscweb#How_this_service_was_made [20:18:22] (03PS5) 10BBlack: drmrs: various minor global config [puppet] - 10https://gerrit.wikimedia.org/r/748757 (https://phabricator.wikimedia.org/T282787) [20:18:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:31] (03CR) 10jerkins-bot: [V: 04-1] Enable migration mode on Italian and MediaWIki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757087 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [20:25:13] (03PS1) 10Cmjohnson: Updating netboot.cfg and site.pp with cloudvirt1047 [puppet] - 10https://gerrit.wikimedia.org/r/757091 (https://phabricator.wikimedia.org/T293391) [20:27:33] 10SRE, 10ops-codfw: Degraded RAID on restbase2011 - https://phabricator.wikimedia.org/T299871 (10Eevans) >>! In T299871#7646679, @wiki_willy wrote: > Hi @Eevans - since the refresh for this host was just installed via T294377, are you ok if we ignore this alert and resolve the ticket? Thanks, Willy Yes, that... [20:28:39] (03CR) 10Cmjohnson: [C: 03+2] Updating netboot.cfg and site.pp with cloudvirt1047 [puppet] - 10https://gerrit.wikimedia.org/r/757091 (https://phabricator.wikimedia.org/T293391) (owner: 10Cmjohnson) [20:28:53] (03PS17) 10Herron: prometheus: add blackbox generic "watchrat" http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [20:29:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:07] (03CR) 10Herron: prometheus: add blackbox generic "watchrat" http/s static check support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [20:29:15] (03PS18) 10Herron: prometheus: add blackbox generic "watchrat" http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [20:30:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Cmjohnson) [20:31:37] (03PS2) 10Jdlrobson: Enable migration mode on Italian and MediaWIki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757087 (https://phabricator.wikimedia.org/T299927) [20:35:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:35:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:03] (03CR) 10Herron: "LGTM overall! please see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [20:41:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:21] (03CR) 10BBlack: [C: 03+2] drmrs: various minor global config [puppet] - 10https://gerrit.wikimedia.org/r/748757 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [20:42:44] ^ this may cause some alert spam, pre-apoligies (will be trying to prevent it, but it's never perfect) [20:47:41] (KubernetesRsyslogDown) firing: rsyslog on kubestage1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [20:50:47] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway) [20:52:31] (03PS1) 10Andrew Bogott: Revert "wmcs: monitoring: make cloudmetrics1001 the primary" [puppet] - 10https://gerrit.wikimedia.org/r/757097 (https://phabricator.wikimedia.org/T300011) [20:55:17] (03PS2) 10Andrew Bogott: Revert "wmcs: monitoring: make cloudmetrics1001 the primary" [puppet] - 10https://gerrit.wikimedia.org/r/757097 (https://phabricator.wikimedia.org/T300011) [20:57:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway) @odimitrijevic would you kindly approve her access. We are skipping the managerial approval as okayed by @faidon on IRC. [20:57:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10jhathaway) a:03jhathaway [21:00:14] (03PS3) 10Andrew Bogott: Revert "wmcs: monitoring: make cloudmetrics1001 the primary" [puppet] - 10https://gerrit.wikimedia.org/r/757097 (https://phabricator.wikimedia.org/T300011) [21:03:07] !log end transition to logstash output opensearch plugin T299168 [21:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:12] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [21:04:13] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmcs: monitoring: make cloudmetrics1001 the primary" [puppet] - 10https://gerrit.wikimedia.org/r/757097 (https://phabricator.wikimedia.org/T300011) (owner: 10Andrew Bogott) [21:11:38] (03PS1) 10Andrew Bogott: Revert "wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints" [dns] - 10https://gerrit.wikimedia.org/r/757101 (https://phabricator.wikimedia.org/T300011) [21:13:27] (03CR) 10Andrew Bogott: [C: 03+2] Revert "wmnet: make cloudmetrics1001 the backend of grafana/graphite endpoints" [dns] - 10https://gerrit.wikimedia.org/r/757101 (https://phabricator.wikimedia.org/T300011) (owner: 10Andrew Bogott) [21:15:01] (03PS1) 10Andrew Bogott: Make cloudmetrics1003 the primary cloudmetrics host, again. [dns] - 10https://gerrit.wikimedia.org/r/757102 (https://phabricator.wikimedia.org/T297814) [21:16:21] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudmetrics1003 the primary cloudmetrics host, again. [dns] - 10https://gerrit.wikimedia.org/r/757102 (https://phabricator.wikimedia.org/T297814) (owner: 10Andrew Bogott) [21:20:37] !log bblack@cumin1001 conftool action : set/weight=1; selector: dc=drmrs,service=ats-tls [21:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:48] !log bblack@cumin1001 conftool action : set/weight=1; selector: dc=drmrs,service=varnish-fe [21:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:57] !log bblack@cumin1001 conftool action : set/weight=100; selector: dc=drmrs,service=ats-be [21:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:36] (03PS4) 10RLazarus: imagecatalog: Add an hourly systemd timer to scan for what's currently running [puppet] - 10https://gerrit.wikimedia.org/r/748876 (https://phabricator.wikimedia.org/T287130) [21:27:58] (03CR) 10RLazarus: [C: 03+2] imagecatalog: Add an hourly systemd timer to scan for what's currently running [puppet] - 10https://gerrit.wikimedia.org/r/748876 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [21:42:32] (03PS4) 10AGueyte: WIP: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [21:43:07] (03CR) 10AGueyte: WIP: Update Event Stream for IPInfo events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [21:43:34] (03CR) 10jerkins-bot: [V: 04-1] WIP: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [21:51:06] (03PS5) 10AGueyte: WIP: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [21:51:49] (03CR) 10jerkins-bot: [V: 04-1] WIP: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) (owner: 10AGueyte) [21:53:01] (03PS1) 10Andrew Bogott: Make cloudmetrics1003/1004 monitoring hosts, 1001/1002 spare systems. [puppet] - 10https://gerrit.wikimedia.org/r/757110 (https://phabricator.wikimedia.org/T300011) [21:54:04] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudmetrics1003/1004 monitoring hosts, 1001/1002 spare systems. [puppet] - 10https://gerrit.wikimedia.org/r/757110 (https://phabricator.wikimedia.org/T300011) (owner: 10Andrew Bogott) [21:55:31] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:01:35] (03PS2) 10CDanis: Add a start_timestamp constraint [software/statograph] - 10https://gerrit.wikimedia.org/r/756041 (https://phabricator.wikimedia.org/T298619) [22:02:27] (KubernetesRsyslogDown) resolved: rsyslog on kubestage1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [22:03:05] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) @AniketArs thanks for the followup. Unfortunately since you pasted your private ssh key in your first message, you will need to regenerate bo... [22:03:24] (03PS3) 10CDanis: Add a start_timestamp constraint [software/statograph] - 10https://gerrit.wikimedia.org/r/756041 (https://phabricator.wikimedia.org/T298619) [22:03:51] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:21] ^ that's me, looking [22:05:21] (03PS4) 10CDanis: Add a start_timestamp constraint [software/statograph] - 10https://gerrit.wikimedia.org/r/756041 (https://phabricator.wikimedia.org/T298619) [22:07:37] (03CR) 10CDanis: [C: 03+2] Add a start_timestamp constraint (032 comments) [software/statograph] - 10https://gerrit.wikimedia.org/r/756041 (https://phabricator.wikimedia.org/T298619) (owner: 10CDanis) [22:09:50] (03Merged) 10jenkins-bot: Add a start_timestamp constraint [software/statograph] - 10https://gerrit.wikimedia.org/r/756041 (https://phabricator.wikimedia.org/T298619) (owner: 10CDanis) [22:14:07] (03PS2) 10CDanis: sre.network.cf: Provide some advice in the event of errors [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 [22:14:11] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) @Seddon this is my first time handling access requests, so I apologize if I get something wrong. My understanding is that for shell access we ask folks to sign the... [22:14:37] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) a:03jhathaway [22:15:39] (03PS3) 10CDanis: sre.network.cf: Provide some advice in the event of errors [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 [22:15:52] (03CR) 10CDanis: sre.network.cf: Provide some advice in the event of errors (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 (owner: 10CDanis) [22:16:27] (03Abandoned) 10CDanis: esams-offline: route heavy EU bytes users to codfw [dns] - 10https://gerrit.wikimedia.org/r/574493 (owner: 10CDanis) [22:23:09] 10SRE, 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10jhathaway) a:03jhathaway [22:23:32] 10SRE, 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10jhathaway) @SCherukuwada this sounds reasonable, but since this is my first time approving Google Search Console access I am going to discuss with @Volans before approving,... [22:35:28] 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Dzahn) >>! In T297127#7618429, @Platonides wrote: > I would have expected the wikitech timeline to contain a final entry for "mx2001" back into (i.e. T2... [22:36:29] (03PS2) 10JHathaway: icinga: add additional users to fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/757043 (https://phabricator.wikimedia.org/T298649) [22:37:25] (03CR) 10JHathaway: icinga: add additional users to fr-tech-ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757043 (https://phabricator.wikimedia.org/T298649) (owner: 10JHathaway) [22:51:45] (03PS3) 10Ryan Kemper: elasticsearch: hiera for new eqiad nodes (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/736116 (https://phabricator.wikimedia.org/T294805) [22:51:47] (03PS3) 10Ryan Kemper: elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/736117 (https://phabricator.wikimedia.org/T294805) [22:51:49] (03PS3) 10Ryan Kemper: elasticsearch: new master config (step 3) [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) [22:51:51] (03PS5) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [22:52:07] (03CR) 10Ryan Kemper: elasticsearch: hiera for new eqiad nodes (step 1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736116 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [22:52:23] (03CR) 10Ryan Kemper: elasticsearch: hiera for new eqiad nodes (step 1) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736116 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [22:52:52] (03PS4) 10Ryan Kemper: elasticsearch: hiera for new eqiad nodes (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/736116 (https://phabricator.wikimedia.org/T294805) [22:52:54] (03PS4) 10Ryan Kemper: elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/736117 (https://phabricator.wikimedia.org/T294805) [22:52:56] (03PS4) 10Ryan Kemper: elasticsearch: new master config (step 3) [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) [22:52:58] (03PS6) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [22:57:02] 10SRE, 10SRE-Access-Requests: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10NRodriguez) Big apologies for the hiccup, I've generated a new key: > ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINSGlZdKKkUD0ra0jpnABXYQXRLowZe/q3fm49cDVGkM nrodriguez@wikimedia.org [22:57:35] 10SRE, 10SRE-Access-Requests: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10NRodriguez) a:05NRodriguez→03Jelto [22:59:27] (03PS7) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [23:00:51] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 107 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [23:13:15] (03PS5) 10Ryan Kemper: elasticsearch: new master config (step 3) [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) [23:13:17] (03PS8) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [23:13:47] (03CR) 10Ryan Kemper: elasticsearch: new master config (step 3) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [23:16:26] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: hiera for new eqiad nodes (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/736116 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [23:20:23] !log T294805 [Elastic] Merged https://gerrit.wikimedia.org/r/736116, step 1 of bringing new eqiad 10G refresh hosts into service [23:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:28] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [23:33:48] (03PS9) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [23:42:22] !log T294805 [Elastic] Step 2: Disabling puppet in advance of merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/736117 [23:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:26] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [23:42:50] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/736117 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [23:44:12] (03PS1) 10Ryan Kemper: Revert "elasticsearch: activate role (step 2)" [puppet] - 10https://gerrit.wikimedia.org/r/757002 [23:45:11] (03CR) 10Ryan Kemper: [C: 03+2] Revert "elasticsearch: activate role (step 2)" [puppet] - 10https://gerrit.wikimedia.org/r/757002 (owner: 10Ryan Kemper) [23:50:23] (03PS1) 10Ryan Kemper: Revert "Revert "elasticsearch: activate role (step 2)"" [puppet] - 10https://gerrit.wikimedia.org/r/757003 [23:54:32] (03PS2) 10Ryan Kemper: elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/757003 (https://phabricator.wikimedia.org/T294805)