[00:00:05] RoanKattouw and Urbanecm: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220126T0000). [00:00:05] dontpanic and Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:47] (03PS6) 10Ryan Kemper: elasticsearch: new master config (step 3) [puppet] - 10https://gerrit.wikimedia.org/r/736118 (https://phabricator.wikimedia.org/T294805) [00:00:49] (03PS10) 10Ryan Kemper: elasticsearch: decom elastic10[32-47] (step 4) [puppet] - 10https://gerrit.wikimedia.org/r/736119 (https://phabricator.wikimedia.org/T294805) [00:02:10] present [00:03:05] (03CR) 10Ryan Kemper: [C: 03+2] elasticsearch: activate role (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/757003 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [00:03:24] I can deploy [00:03:36] !log T294805 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/757003; running puppet on `elastic1068` to make it join the fleet [00:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:42] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [00:03:42] thanks RoanKattouw [00:04:29] (03PS1) 10Catrope: Untie Wikimedia message boxes from on-wiki messageboxes [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/757004 (https://phabricator.wikimedia.org/T270796) [00:05:55] (03CR) 10Catrope: [C: 03+2] Do not load common.js twice [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756997 (https://phabricator.wikimedia.org/T300070) (owner: 10Jdlrobson) [00:06:01] (03CR) 10Catrope: [C: 03+2] Do not load common.js twice [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756998 (https://phabricator.wikimedia.org/T300070) (owner: 10Jdlrobson) [00:06:28] Jdlrobson: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/755997 is unmerged in master and has an unmerged Depends-On [00:07:30] I created a cherry-pick without noticing those issues, but 1) it's best practice to ensure patches are merged before submitting them for deployment, 2) with the unmerged Depends-On CI will refuse to merge the patch unless I manually remove that directive [00:08:03] (03Abandoned) 10Catrope: Untie Wikimedia message boxes from on-wiki messageboxes [core] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/757004 (https://phabricator.wikimedia.org/T270796) (owner: 10Catrope) [00:08:45] dontpanic: Are you here for your bgwiki deployment? [00:08:53] (03CR) 10Cwhite: logstash: add docker support in the Makefile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757042 (https://phabricator.wikimedia.org/T300051) (owner: 10Elukey) [00:09:18] (03PS1) 10Ryan Kemper: Revert "elasticsearch: activate role (step 2)" [puppet] - 10https://gerrit.wikimedia.org/r/757005 (https://phabricator.wikimedia.org/T294805) [00:10:01] PROBLEM - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2022-01-18 00:00:01 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [00:10:13] (03CR) 10Ryan Kemper: [C: 03+2] Revert "elasticsearch: activate role (step 2)" [puppet] - 10https://gerrit.wikimedia.org/r/757005 (https://phabricator.wikimedia.org/T294805) (owner: 10Ryan Kemper) [00:10:29] RoanKattouw: looks like i put the wrong patch in the deploy calendar sorry [00:10:30] Jdlrobson: How would you like to proceed? Your "do not load twice" patches are good to go once they go through CI (usually takes ~20 mins), but the "untie messageboxes" patch needs review and has a dependency. The config patch can be deployed in theory, but based on how you ordered your request I'm guessing you wanted that to go out last after all the bug fixes [00:10:34] Oooh OK [00:11:16] oh sorry i misunderstood [00:11:34] So I'm looking to backport the do not load twice patches [00:11:43] !log T294805 Reverted https://gerrit.wikimedia.org/r/c/operations/puppet/+/757003 (elasticsearch-oss dependency issues, will pick this back up tomorrow); re-enabling puppet across elastic1* [00:11:46] and this config patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/757087 [00:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:52] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [00:12:00] not sure "Untie Wikimedia message boxes from on-wiki messageboxes" should not be on the calendar [00:12:18] OK, I'll just skip the untie patch then [00:12:30] https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/756696 [00:12:33] is the one I meant to link to [00:12:35] Updating calendar [00:12:44] (03CR) 10Catrope: [C: 03+2] Fix bug in SkinVersionLookup [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756696 (https://phabricator.wikimedia.org/T299971) (owner: 10Jdlrobson) [00:12:59] I've updated calendar [00:13:03] Thanks! [00:13:11] copy and paste fail :) [00:15:38] (03CR) 10Cwhite: "Good enough to ship." [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [00:17:53] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:40] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [00:29:58] (03CR) 10jerkins-bot: [V: 04-1] Do not load common.js twice [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756997 (https://phabricator.wikimedia.org/T300070) (owner: 10Jdlrobson) [00:32:34] RoanKattouw: CI error is unrelated [00:32:53] Error in "add link.link inspector appears after clicking through task from Special:Homepage" has been flaking a bit recently [00:33:02] will likely need to try merging again [00:33:02] :/ [00:33:27] (03CR) 10Catrope: [V: 03+2 C: 03+2] Do not load common.js twice [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756997 (https://phabricator.wikimedia.org/T300070) (owner: 10Jdlrobson) [00:34:03] I've just overridden it, I didn't want to wait another 20 mins [00:34:13] Or, almost 30 rather [00:35:19] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2042 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [00:36:35] (03PS1) 10Ebernhardson: Correct wcqs event stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757122 [00:37:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:52] RoanKattouw: i know the feeling [00:39:52] (03Merged) 10jenkins-bot: Do not load common.js twice [skins/Vector] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/756998 (https://phabricator.wikimedia.org/T300070) (owner: 10Jdlrobson) [00:39:58] (03Merged) 10jenkins-bot: Fix bug in SkinVersionLookup [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756696 (https://phabricator.wikimedia.org/T299971) (owner: 10Jdlrobson) [00:40:44] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10Seddon) [00:41:27] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:44:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:12] Alright, looks like everything is finally merged [00:46:45] Jdlrobson: OK, your wmf.18/wmf.19 patches are on mwdebug1002, please test [00:46:57] If things look good I'll merge the config patch and get it up there too [00:47:08] Or, alternatively, if it's hard to test without the config patch I can merge it now [00:47:26] RoanKattouw: looking [00:48:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:20] ok SkinVersionLookup is good. Just testing common.js now [00:51:59] common.js also good [00:52:17] OK great [00:52:18] Ah we didn't merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/757087 yet [00:52:21] okay so that's all good to ship [00:52:32] OK, I'll deploy those and merge the config patch in the meantime [00:52:38] sweet [00:53:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:56] (03PS3) 10Catrope: Enable migration mode on Italian and MediaWIki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757087 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [00:54:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:54:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:12] (03CR) 10Catrope: [C: 03+2] Enable migration mode on Italian and MediaWIki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757087 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [00:55:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:26] (03Merged) 10jenkins-bot: Enable migration mode on Italian and MediaWIki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757087 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [00:56:21] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.19/skins/Vector/: Backport: [[gerrit:756998|Do not load common.js twice (T300070)]] (duration: 02m 43s) [00:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:25] T300070: [Regression] User common.js/css is being loaded twice - https://phabricator.wikimedia.org/T300070 [00:59:08] Jdlrobson: The config patch is now ready for testing [00:59:38] RoanKattouw: testing thanks [01:00:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:56] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.18/skins/Vector/: Backport: [[gerrit:756997|Do not load common.js twice (T300070)]] and [[gerrit:756696|Fix bug in SkinVersionLookup (T299971)]] (duration: 00m 51s) [01:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:01] T299971: [subtask] Problem in Legacy Vector calculation ([{reqId}] {exception_url} PHP Notice: Undefined index: data-user-page ) - https://phabricator.wikimedia.org/T299971 [01:01:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:01:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:33] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:37] Good to sync RoanKattouw [01:03:59] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757087|Enable migration mode on Italian and MediaWIki.org (T299927)]] (duration: 00m 54s) [01:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:03] T299927: Deploy new Vector skin to all projects - https://phabricator.wikimedia.org/T299927 [01:05:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:43] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2042 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:07:33] RoanKattouw: thank you! [01:16:09] (03PS1) 10Ebernhardson: rdf query service: Use constant filename for defaults [puppet] - 10https://gerrit.wikimedia.org/r/757124 (https://phabricator.wikimedia.org/T299222) [01:16:45] (03CR) 10jerkins-bot: [V: 04-1] rdf query service: Use constant filename for defaults [puppet] - 10https://gerrit.wikimedia.org/r/757124 (https://phabricator.wikimedia.org/T299222) (owner: 10Ebernhardson) [01:18:32] (03PS2) 10Ebernhardson: rdf query service: Use constant filename for defaults [puppet] - 10https://gerrit.wikimedia.org/r/757124 (https://phabricator.wikimedia.org/T299222) [01:21:57] (03PS3) 10Ebernhardson: rdf query service: Use constant filename for defaults [puppet] - 10https://gerrit.wikimedia.org/r/757124 (https://phabricator.wikimedia.org/T299222) [01:35:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:38:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:04:49] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [02:36:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [02:44:50] (03CR) 10Ryan Kemper: [C: 03+2] ssl: remove search.svc keypair [puppet] - 10https://gerrit.wikimedia.org/r/757020 (https://phabricator.wikimedia.org/T299633) (owner: 10Filippo Giunchedi) [02:49:11] !log [WDQS] T299098 `ryankemper@wdqs2003:~$ sudo pool` (forgot to pool after dcops fixed hw issue) [02:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:16] T299098: hw troubleshooting: IPMI Power Supply Failure (PS2) for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 [03:08:47] (03Abandoned) 10Krinkle: scap: Remove commit and sync steps from 'update-interwiki-cache' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599147 (https://phabricator.wikimedia.org/T247107) (owner: 10Krinkle) [03:08:57] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:43] (03CR) 10Krinkle: "Would this result in a mw-cli process mixing code from both srv/mediawiki and -staging at the same time? e.g. wikiversions from one but co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 (owner: 10Ahmon Dancy) [03:30:26] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.100`. Pre-deploy tests passing on canary `wdqs1003` [03:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:31:39] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@dc7c5ac]: 0.3.100 [03:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:39] !log [WDQS Deploy] Tests passing following deploy of `0.3.100` on canary `wdqs1003`; proceeding to rest of fleet [03:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:15] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@dc7c5ac]: 0.3.100 (duration: 08m 35s) [03:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:33] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [03:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:36] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [03:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:39] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [03:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:47] 10SRE, 10docker-pkg, 10serviceops, 10Patch-For-Review, 10Release Pipeline (Blubber): Container image lifecycle management - https://phabricator.wikimedia.org/T287130 (10RLazarus) The hourly `imagecatalog record` timer is working on deploy1002. It's failing on deploy2002, because something keeps overwriti... [04:40:08] ACKNOWLEDGEMENT - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service RLazarus https://phabricator.wikimedia.org/T287130#7651203 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Pairing tool for new SREs using sudo under supervision - https://phabricator.wikimedia.org/T299989 (10RLazarus) Thanks for checking it out! Even if sudo_pair were only available in bullseye at first, that would be a huge step forward, because we'd be able to... [04:56:20] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [04:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:56] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@dc7c5ac] (wcqs): Deploy 0.3.100 to WCQS [04:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:17] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@dc7c5ac] (wcqs): Deploy 0.3.100 to WCQS (duration: 02m 21s) [05:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:58] (03PS1) 10Andrew Bogott: profile::wmcs::monitoring: update comment about time interval [puppet] - 10https://gerrit.wikimedia.org/r/757155 [05:20:15] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::monitoring: update comment about time interval [puppet] - 10https://gerrit.wikimedia.org/r/757155 (owner: 10Andrew Bogott) [06:17:31] (03PS2) 10Juan90264: bgwiki: Add 'wgNamespaceRobotPolicies' for Draft (Talk) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756978 (https://phabricator.wikimedia.org/T299224) (owner: 10Tks4Fish) [06:22:06] (03PS1) 10Marostegui: Revert "db2086: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757129 [06:22:22] (03PS1) 10Marostegui: Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757130 [06:23:07] (03CR) 10Marostegui: [C: 03+2] Revert "db2086: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757129 (owner: 10Marostegui) [06:23:16] (03CR) 10Marostegui: [C: 03+2] Revert "es1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757130 (owner: 10Marostegui) [06:24:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2086 (s7,s8) T299882', diff saved to https://phabricator.wikimedia.org/P19229 and previous config saved to /var/cache/conftool/dbconfig/20220126-062406-marostegui.json [06:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:12] T299882: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 [06:24:22] 10SRE, 10ops-codfw: Reset db2086's idrac - https://phabricator.wikimedia.org/T299882 (10Marostegui) Host repooled [06:26:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [06:30:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [06:30:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [06:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T285149)', diff saved to https://phabricator.wikimedia.org/P19230 and previous config saved to /var/cache/conftool/dbconfig/20220126-063037-marostegui.json [06:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:41] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [06:31:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1020 T300005', diff saved to https://phabricator.wikimedia.org/P19231 and previous config saved to /var/cache/conftool/dbconfig/20220126-063149-marostegui.json [06:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:54] T300005: Upgrade es4 to Bullseye - https://phabricator.wikimedia.org/T300005 [06:33:59] (03PS1) 10Marostegui: es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757287 (https://phabricator.wikimedia.org/T300005) [06:34:39] (03CR) 10Marostegui: [C: 03+2] es1020: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757287 (https://phabricator.wikimedia.org/T300005) (owner: 10Marostegui) [06:36:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [06:36:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T285149)', diff saved to https://phabricator.wikimedia.org/P19232 and previous config saved to /var/cache/conftool/dbconfig/20220126-063644-marostegui.json [06:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:49] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [06:39:15] (03PS1) 10Marostegui: mariadb: x1 codfw disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757288 (https://phabricator.wikimedia.org/T300099) [06:40:01] (03CR) 10Marostegui: [C: 03+2] mariadb: x1 codfw disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757288 (https://phabricator.wikimedia.org/T300099) (owner: 10Marostegui) [06:41:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2096.codfw.wmnet with OS bullseye [06:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2115.codfw.wmnet with OS bullseye [06:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:38] ACKNOWLEDGEMENT - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (111 Connection refused) Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:46:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchangeslinked from s8 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19233 and previous config saved to /var/cache/conftool/dbconfig/20220126-064653-marostegui.json [06:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:58] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [06:51:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P19234 and previous config saved to /var/cache/conftool/dbconfig/20220126-065149-marostegui.json [06:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P19235 and previous config saved to /var/cache/conftool/dbconfig/20220126-070654-marostegui.json [07:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:35] (03CR) 10Elukey: logstash: improve filter for ORES (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [07:14:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2096.codfw.wmnet with OS bullseye [07:14:32] (03PS3) 10Elukey: logstash: add docker support in the Makefile [puppet] - 10https://gerrit.wikimedia.org/r/757042 (https://phabricator.wikimedia.org/T300051) [07:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2131.codfw.wmnet with OS bullseye [07:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:21] (03CR) 10Elukey: logstash: add docker support in the Makefile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757042 (https://phabricator.wikimedia.org/T300051) (owner: 10Elukey) [07:17:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2115.codfw.wmnet with OS bullseye [07:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T285149)', diff saved to https://phabricator.wikimedia.org/P19236 and previous config saved to /var/cache/conftool/dbconfig/20220126-072200-marostegui.json [07:22:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [07:22:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [07:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:05] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [07:22:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T285149)', diff saved to https://phabricator.wikimedia.org/P19237 and previous config saved to /var/cache/conftool/dbconfig/20220126-072211-marostegui.json [07:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T285149)', diff saved to https://phabricator.wikimedia.org/P19238 and previous config saved to /var/cache/conftool/dbconfig/20220126-072317-marostegui.json [07:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P19239 and previous config saved to /var/cache/conftool/dbconfig/20220126-073822-marostegui.json [07:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:01] (03CR) 10Legoktm: [V: 03+1 C: 03+1] mediawiki::maintenance: Run recountCategories.php monthly on all wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [07:42:21] (03PS1) 10Majavah: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757377 [07:42:56] (03CR) 10Majavah: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757377 (owner: 10Majavah) [07:43:08] (03CR) 10Elukey: "Adding Ben to this code review as well. We are planning to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/742747, that change" [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [07:43:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1020.eqiad.wmnet with OS bullseye [07:43:30] (03CR) 10Elukey: [V: 03+1] "To avoid forgetting: let's deploy this change with https://gerrit.wikimedia.org/r/c/operations/puppet/+/755435" [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [07:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:39] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757377 (owner: 10Majavah) [07:45:42] !log taavi@deploy1002 Synchronized wmf-config/interwiki.php: Config: [[gerrit:757377|Update interwiki cache]] (duration: 00m 52s) [07:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:37] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1020.eqiad.wmnet with OS bullseye [07:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:50:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [07:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2131.codfw.wmnet with OS bullseye [07:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [07:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:23] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Marostegui) I think es1020 is having the same issue - @Volans do you have the fixing command somewhere? Given that es1020 and es1022 are from the same batch it could make sense. Also, expecting es1021 (current... [07:53:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P19240 and previous config saved to /var/cache/conftool/dbconfig/20220126-075326-marostegui.json [07:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:56] (03PS1) 10Jelto: install_server: update MAC address of gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/757378 (https://phabricator.wikimedia.org/T295481) [08:01:37] 10SRE, 10Data-Engineering, 10observability, 10serviceops: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10elukey) [08:06:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/757378 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [08:08:27] (03CR) 10Jelto: [C: 03+2] install_server: update MAC address of gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/757378 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [08:08:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T285149)', diff saved to https://phabricator.wikimedia.org/P19241 and previous config saved to /var/cache/conftool/dbconfig/20220126-080831-marostegui.json [08:08:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [08:08:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [08:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [08:08:37] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [08:08:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [08:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T285149)', diff saved to https://phabricator.wikimedia.org/P19242 and previous config saved to /var/cache/conftool/dbconfig/20220126-080842-marostegui.json [08:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T285149)', diff saved to https://phabricator.wikimedia.org/P19243 and previous config saved to /var/cache/conftool/dbconfig/20220126-080948-marostegui.json [08:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:26] (03PS1) 10Gergő Tisza: linkrecommendation: Disable FLASK_PROFILE for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/757383 (https://phabricator.wikimedia.org/T296334) [08:18:27] (03PS1) 10Marostegui: Revert "mariadb: x1 codfw disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757131 [08:18:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1013.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [08:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:05] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: x1 codfw disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757131 (owner: 10Marostegui) [08:20:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1013.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [08:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [08:24:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P19244 and previous config saved to /var/cache/conftool/dbconfig/20220126-082453-marostegui.json [08:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1014.eqiad.wmnet with OS buster [08:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:00] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1014.eqiad.wmnet with OS buster [08:31:55] !log sign puppet cert for gitlab-runner1001.eqiad.wmnet [08:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P19245 and previous config saved to /var/cache/conftool/dbconfig/20220126-083958-marostegui.json [08:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:26] !log draining instances off ganeti1015 for reimage [08:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:43] (03CR) 10MVernon: [C: 03+2] install_server: swift UID/GID should match filesystems (if present) [puppet] - 10https://gerrit.wikimedia.org/r/757025 (https://phabricator.wikimedia.org/T300057) (owner: 10MVernon) [08:48:36] PROBLEM - Host gitlab-runner1001 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:02] ^ thats me [08:49:48] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10MatthewVernon) [08:50:42] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: reimaging swift backends should set swift UID/GID to match filesystems - https://phabricator.wikimedia.org/T300057 (10MatthewVernon) 05Open→03Resolved Fixed by CR 757025 (hopefully!) [08:51:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,gitlab_runner} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:53:03] (03CR) 10Filippo Giunchedi: [V: 03+1] prometheus: refactor rsync in a standalone profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [08:53:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:55:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T285149)', diff saved to https://phabricator.wikimedia.org/P19246 and previous config saved to /var/cache/conftool/dbconfig/20220126-085503-marostegui.json [08:55:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [08:55:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [08:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:08] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [08:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T285149)', diff saved to https://phabricator.wikimedia.org/P19247 and previous config saved to /var/cache/conftool/dbconfig/20220126-085510-marostegui.json [08:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:27] (03PS2) 10Filippo Giunchedi: prometheus: refactor rsync in a standalone profile [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) [08:56:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T285149)', diff saved to https://phabricator.wikimedia.org/P19248 and previous config saved to /var/cache/conftool/dbconfig/20220126-085616-marostegui.json [08:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1014.eqiad.wmnet with OS buster [08:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:55] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1014.eqiad.wmnet with OS buster completed: - ganeti1014 (**PASS**)... [08:57:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T300099', diff saved to https://phabricator.wikimedia.org/P19249 and previous config saved to /var/cache/conftool/dbconfig/20220126-085733-marostegui.json [08:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:37] T300099: Upgrade x1 to Bullseye - https://phabricator.wikimedia.org/T300099 [08:58:33] (03PS1) 10Marostegui: db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757386 (https://phabricator.wikimedia.org/T300099) [08:58:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,gitlab_runner} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:59:27] (03CR) 10Marostegui: [C: 03+2] db1120: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757386 (https://phabricator.wikimedia.org/T300099) (owner: 10Marostegui) [09:00:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1120.eqiad.wmnet with OS bullseye [09:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:00:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1005.eqiad.wmnet with OS buster [09:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:20] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1005.eqiad.wmnet with OS buster [09:01:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! See inline for an extra variable" [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [09:03:09] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33451/console" [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [09:03:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,gitlab_runner} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:04:46] (03PS3) 10Filippo Giunchedi: prometheus: refactor rsync in a standalone profile [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) [09:05:05] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Disable FLASK_PROFILE for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/757383 (https://phabricator.wikimedia.org/T296334) (owner: 10Gergő Tisza) [09:05:34] (03CR) 10Elukey: "Left a question just for curiosity, but from what I can see the change looks good (even if my understanding of the codebase is very limite" [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [09:06:21] !log uploaded scap 4.2.0 to apt.wikimedia.org [09:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:42] (03Merged) 10jenkins-bot: linkrecommendation: Disable FLASK_PROFILE for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/757383 (https://phabricator.wikimedia.org/T296334) (owner: 10Gergő Tisza) [09:09:18] (03PS1) 10Marostegui: db1128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757388 (https://phabricator.wikimedia.org/T299624) [09:10:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice, thank you John for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/757010 (owner: 10Jbond) [09:10:43] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: refactor rsync in a standalone profile [puppet] - 10https://gerrit.wikimedia.org/r/756979 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [09:10:45] (03CR) 10Marostegui: [C: 03+2] db1128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757388 (https://phabricator.wikimedia.org/T299624) (owner: 10Marostegui) [09:11:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P19250 and previous config saved to /var/cache/conftool/dbconfig/20220126-091121-marostegui.json [09:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:36] (03CR) 10JMeybohm: Make a bundle signer return it's root CA (031 comment) [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [09:15:08] (03PS1) 10Marostegui: mariadb: Promote db1128 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/757389 (https://phabricator.wikimedia.org/T299624) [09:16:10] (03PS2) 10Marostegui: mariadb: Promote db1128 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/757389 (https://phabricator.wikimedia.org/T299624) [09:16:39] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/757389 (https://phabricator.wikimedia.org/T299624) (owner: 10Marostegui) [09:20:27] (03PS1) 10Marostegui: Revert "db1120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757132 [09:21:14] (03CR) 10Marostegui: [C: 03+2] Revert "db1120: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757132 (owner: 10Marostegui) [09:21:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1120.eqiad.wmnet with OS bullseye [09:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19251 and previous config saved to /var/cache/conftool/dbconfig/20220126-092158-root.json [09:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:39] !log uploaded scap 4.2.0 to apt.wikimedia.org - T300058 [09:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:43] T300058: Deploy Scap version 4.2.0 - https://phabricator.wikimedia.org/T300058 [09:25:09] !log updated scap to 4.2.0 on A:mw-canary, A:parsoid-canary, A:mw-jobrunner-canary - T300058 [09:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:23] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1005.eqiad.wmnet with OS buster [09:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:28] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1005.eqiad.wmnet with OS buster executed with errors: - ganeti1005 (*... [09:26:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P19252 and previous config saved to /var/cache/conftool/dbconfig/20220126-092626-marostegui.json [09:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:00] !log begin rsync prometheus2004 -> 2005 - T296199 [09:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:04] T296199: Prometheus hardware refresh (+ Bullseye upgrade) - https://phabricator.wikimedia.org/T296199 [09:30:16] (ThanosSidecarPrometheusDown) firing: Thanos Sidecar cannot connect to Prometheus - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org [09:30:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [09:30:42] ah yes that's expected ^ I've stopped prometheus on prometheus2005 [09:32:44] !log updated scap to 4.2.0 on A:restbase-canary - T300058 [09:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:48] T300058: Deploy Scap version 4.2.0 - https://phabricator.wikimedia.org/T300058 [09:33:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1005.eqiad.wmnet with OS buster [09:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:11] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1005.eqiad.wmnet with OS buster [09:33:12] RECOVERY - Host gitlab-runner1001 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [09:33:29] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757391 [09:35:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [09:35:41] (03CR) 10Kosta Harlan: [C: 03+2] "Let's try again" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757391 (owner: 10Kosta Harlan) [09:37:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19254 and previous config saved to /var/cache/conftool/dbconfig/20220126-093702-root.json [09:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:47] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757391 (owner: 10Kosta Harlan) [09:40:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [09:41:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T285149)', diff saved to https://phabricator.wikimedia.org/P19255 and previous config saved to /var/cache/conftool/dbconfig/20220126-094131-marostegui.json [09:41:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:41:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [09:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:36] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [09:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T285149)', diff saved to https://phabricator.wikimedia.org/P19256 and previous config saved to /var/cache/conftool/dbconfig/20220126-094138-marostegui.json [09:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T285149)', diff saved to https://phabricator.wikimedia.org/P19257 and previous config saved to /var/cache/conftool/dbconfig/20220126-094244-marostegui.json [09:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:16] (03CR) 10DCausse: [C: 03+1] Correct wcqs event stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757122 (owner: 10Ebernhardson) [09:47:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1005.eqiad.wmnet with OS buster [09:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:06] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1005.eqiad.wmnet with OS buster completed: - ganeti1005 (**PASS**)... [09:52:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19258 and previous config saved to /var/cache/conftool/dbconfig/20220126-095205-root.json [09:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:37] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [09:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:40] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [09:52:41] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [09:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:49] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync on staging [09:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:33] heads up all, im abut to deploy a change which updates the prometheus ferm rules, this should be a no-op but if you see any issues let me know (https://gerrit.wikimedia.org/r/c/operations/puppet/+/757010) [09:55:53] (03CR) 10Jbond: [C: 03+2] profile: drop individual prometheus node ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/757010 (owner: 10Jbond) [09:57:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1006.eqiad.wmnet with OS buster [09:57:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:44] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1006.eqiad.wmnet with OS buster [09:57:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P19259 and previous config saved to /var/cache/conftool/dbconfig/20220126-095749-marostegui.json [09:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:39] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host es1020.mgmt.eqiad.wmnet with reboot policy GRACEFUL [10:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [10:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19260 and previous config saved to /var/cache/conftool/dbconfig/20220126-100709-root.json [10:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:30] (03CR) 10Btullis: [C: 03+2] profile::cache::kafka::webrequest: Log Sec-CH-UA* headers [puppet] - 10https://gerrit.wikimedia.org/r/755435 (https://phabricator.wikimedia.org/T299401) (owner: 10Phuedx) [10:08:51] (03CR) 10Btullis: [C: 03+2] varnishkafka: use new ca bundle instead of the Puppet one [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [10:08:59] (03PS5) 10Btullis: varnishkafka: use new ca bundle instead of the Puppet one [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [10:12:26] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1020.mgmt.eqiad.wmnet with reboot policy GRACEFUL [10:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P19261 and previous config saved to /var/cache/conftool/dbconfig/20220126-101253-marostegui.json [10:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1020.eqiad.wmnet with OS bullseye [10:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:47] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:17:15] (03PS1) 10Majavah: P:nftables::basefirewall: drop prometheus port filtering [puppet] - 10https://gerrit.wikimedia.org/r/757394 [10:18:55] (03CR) 10Majavah: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1003/33452/" [puppet] - 10https://gerrit.wikimedia.org/r/757394 (owner: 10Majavah) [10:19:09] (03CR) 10jerkins-bot: [V: 04-1] P:nftables::basefirewall: drop prometheus port filtering [puppet] - 10https://gerrit.wikimedia.org/r/757394 (owner: 10Majavah) [10:20:24] (03PS2) 10Majavah: P:nftables::basefirewall: drop prometheus port filtering [puppet] - 10https://gerrit.wikimedia.org/r/757394 [10:20:49] 10SRE, 10ops-eqiad, 10DBA: es1022 troubles with PXE - https://phabricator.wikimedia.org/T299123 (10Volans) I've fixed es1020 manually and checked the other hosts in the same batch ( https://netbox.wikimedia.org/dcim/devices/?cf_ticket=T235659 ), apart es1024 all of them have the same misconfiguration. I test... [10:20:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [10:22:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19263 and previous config saved to /var/cache/conftool/dbconfig/20220126-102213-root.json [10:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:26] btullis, elukey: o/ I'm here with coffee if there's anything that I can help with in the varnishkafka deployment [10:24:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance [10:24:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance [10:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2025 (T300006)', diff saved to https://phabricator.wikimedia.org/P19264 and previous config saved to /var/cache/conftool/dbconfig/20220126-102445-ladsgroup.json [10:25:18] phuedx: hi! Can you join #wikimedia-analytics? [10:25:36] phuedx: Thanks. Seems to be going well - we have deployed the change to cp3050. The only thing is some double-quoting in ch_ua_platform. [10:25:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1006.eqiad.wmnet with OS buster [10:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:49] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [10:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:52] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1006.eqiad.wmnet with OS buster completed: - ganeti1006 (**PASS**)... [10:25:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org [10:25:59] (03PS1) 10Ladsgroup: es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757396 (https://phabricator.wikimedia.org/T300006) [10:26:18] (03PS2) 10Ladsgroup: es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757396 (https://phabricator.wikimedia.org/T300006) [10:26:43] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es2025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757396 (https://phabricator.wikimedia.org/T300006) (owner: 10Ladsgroup) [10:27:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757043 (https://phabricator.wikimedia.org/T298649) (owner: 10JHathaway) [10:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T285149)', diff saved to https://phabricator.wikimedia.org/P19265 and previous config saved to /var/cache/conftool/dbconfig/20220126-102758-marostegui.json [10:28:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [10:28:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [10:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:04] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [10:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T285149)', diff saved to https://phabricator.wikimedia.org/P19266 and previous config saved to /var/cache/conftool/dbconfig/20220126-102805-marostegui.json [10:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:08] (03PS1) 10Majavah: hieradata: add new bullseye eqiad1 bastions [puppet] - 10https://gerrit.wikimedia.org/r/757397 [10:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es2025.codfw.wmnet with OS bullseye [10:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T285149)', diff saved to https://phabricator.wikimedia.org/P19267 and previous config saved to /var/cache/conftool/dbconfig/20220126-102911-marostegui.json [10:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:23] (03PS1) 10Filippo Giunchedi: hieradata: force Host header for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/757398 (https://phabricator.wikimedia.org/T291946) [10:32:06] !log oblivian@deploy1002 Started deploy [docker-pkg/deploy@62a5e87]: redeploy of 3.0.2, including build2001 [10:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:12] !log oblivian@deploy1002 Finished deploy [docker-pkg/deploy@62a5e87]: redeploy of 3.0.2, including build2001 (duration: 01m 05s) [10:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:08] !log oblivian@deploy1002 Started deploy [docker-pkg/deploy@62a5e87]: redeploy of 3.0.2, including build2001 [10:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:41] !log oblivian@deploy1002 Finished deploy [docker-pkg/deploy@62a5e87]: redeploy of 3.0.2, including build2001 (duration: 00m 33s) [10:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:21] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33453/console" [puppet] - 10https://gerrit.wikimedia.org/r/757398 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:37:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19268 and previous config saved to /var/cache/conftool/dbconfig/20220126-103716-root.json [10:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:54] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] hieradata: force Host header for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/757398 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:41:04] !log hnowlan@deploy1002 Started deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) [10:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:07] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:44:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P19269 and previous config saved to /var/cache/conftool/dbconfig/20220126-104416-marostegui.json [10:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:48:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1020.eqiad.wmnet with OS bullseye [10:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19270 and previous config saved to /var/cache/conftool/dbconfig/20220126-104955-root.json [10:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19271 and previous config saved to /var/cache/conftool/dbconfig/20220126-105220-root.json [10:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P19272 and previous config saved to /var/cache/conftool/dbconfig/20220126-105921-marostegui.json [10:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:20] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@0848b15] (dev-cluster): (no justification provided) (duration: 22m 16s) [11:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2025.codfw.wmnet with OS bullseye [11:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19273 and previous config saved to /var/cache/conftool/dbconfig/20220126-110458-root.json [11:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19274 and previous config saved to /var/cache/conftool/dbconfig/20220126-110723-root.json [11:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:11] (03PS14) 10Elukey: P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) [11:10:56] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33454/console" [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [11:12:01] ACKNOWLEDGEMENT - dump of es4 in eqiad on alert1001 is CRITICAL: dump for es4 at eqiad taken more than 8 days ago: Most recent backup 2022-01-18 00:00:01 Jcrespo rerunning after hw troubles - The acknowledgement expires at: 2022-01-27 07:00:00. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [11:12:05] (03CR) 10Elukey: [V: 03+1] P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [11:13:17] (03PS15) 10Elukey: P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) [11:14:13] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33455/console" [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [11:14:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T285149)', diff saved to https://phabricator.wikimedia.org/P19275 and previous config saved to /var/cache/conftool/dbconfig/20220126-111425-marostegui.json [11:14:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:14:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [11:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:30] (03PS1) 10JMeybohm: Enable IPv6DualStack for kubelet on staging masters [puppet] - 10https://gerrit.wikimedia.org/r/757407 (https://phabricator.wikimedia.org/T290967) [11:14:31] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [11:14:32] (03PS1) 10JMeybohm: Split profile::kubernetes::master_hosts by DC [puppet] - 10https://gerrit.wikimedia.org/r/757408 (https://phabricator.wikimedia.org/T290967) [11:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:14:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [11:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [11:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [11:14:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [11:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T285149)', diff saved to https://phabricator.wikimedia.org/P19276 and previous config saved to /var/cache/conftool/dbconfig/20220126-111504-marostegui.json [11:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:47] (03CR) 10Elukey: [V: 03+1] "Folks: I have removed the varnishkafka bits (already deployed with another change) and adjusted an extra space in the config." [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [11:16:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T285149)', diff saved to https://phabricator.wikimedia.org/P19277 and previous config saved to /var/cache/conftool/dbconfig/20220126-111610-marostegui.json [11:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:41] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33456/console" [puppet] - 10https://gerrit.wikimedia.org/r/757408 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [11:20:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19278 and previous config saved to /var/cache/conftool/dbconfig/20220126-112002-root.json [11:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:50] 10SRE, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Nikerabbit) [11:22:06] (03PS1) 10Volans: sre.hosts.provision: disable PXE on all other NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/757410 (https://phabricator.wikimedia.org/T299123) [11:22:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19279 and previous config saved to /var/cache/conftool/dbconfig/20220126-112227-root.json [11:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:43] (03PS1) 10Ladsgroup: rdbms: Pass commented SQL to the GeneralizedSql for logging [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757133 (https://phabricator.wikimedia.org/T298687) [11:23:53] jouncebot: nowandnext [11:23:54] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [11:23:54] In 0 hour(s) and 36 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220126T1200) [11:24:02] (03CR) 10Ladsgroup: [C: 03+2] rdbms: Pass commented SQL to the GeneralizedSql for logging [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757133 (https://phabricator.wikimedia.org/T298687) (owner: 10Ladsgroup) [11:24:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T300006)', diff saved to https://phabricator.wikimedia.org/P19280 and previous config saved to /var/cache/conftool/dbconfig/20220126-112439-ladsgroup.json [11:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:43] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [11:25:40] (03PS1) 10Cparle: Deal with change in MachineVision handler constructor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757413 [11:26:31] (03CR) 10Matthias Mullie: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757413 (owner: 10Cparle) [11:26:33] (03PS1) 104nn1l2: fawiki: Add unwatchedpages permission to patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757414 (https://phabricator.wikimedia.org/T300126) [11:26:38] (03PS1) 10Vgutierrez: cache: Provide a text_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/757415 (https://phabricator.wikimedia.org/T271421) [11:27:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T300006)', diff saved to https://phabricator.wikimedia.org/P19281 and previous config saved to /var/cache/conftool/dbconfig/20220126-112719-ladsgroup.json [11:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:43] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 11449147416 and 1398 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:30:38] (03CR) 10Jbond: [C: 03+1] "LGTM but will leave to arturo to merge" [puppet] - 10https://gerrit.wikimedia.org/r/757394 (owner: 10Majavah) [11:31:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P19282 and previous config saved to /var/cache/conftool/dbconfig/20220126-113115-marostegui.json [11:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:29] (03PS1) 10Marostegui: change_qci_timestamp_T298559.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757416 (https://phabricator.wikimedia.org/T298559) [11:32:05] (03PS2) 10Marostegui: change_qci_timestamp_T298559.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757416 (https://phabricator.wikimedia.org/T298559) [11:32:20] (03CR) 10Marostegui: "This seems to be working fine on cumin1001" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757416 (https://phabricator.wikimedia.org/T298559) (owner: 10Marostegui) [11:32:47] (03PS1) 10Ladsgroup: es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757417 (https://phabricator.wikimedia.org/T300006) [11:33:28] (03PS1) 10Marostegui: Revert "es1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757134 [11:34:26] (03CR) 10Marostegui: [C: 03+2] Revert "es1020: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757134 (owner: 10Marostegui) [11:34:40] (03CR) 10Ladsgroup: [C: 03+1] change_qci_timestamp_T298559.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757416 (https://phabricator.wikimedia.org/T298559) (owner: 10Marostegui) [11:34:51] (03CR) 10Marostegui: [V: 03+2 C: 03+2] change_qci_timestamp_T298559.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757416 (https://phabricator.wikimedia.org/T298559) (owner: 10Marostegui) [11:35:04] (03CR) 10Volans: "LGTM, just missing a parameter in a call" [cookbooks] - 10https://gerrit.wikimedia.org/r/691275 (owner: 10CDanis) [11:35:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19283 and previous config saved to /var/cache/conftool/dbconfig/20220126-113505-root.json [11:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:32] (03PS2) 10Ladsgroup: es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757417 (https://phabricator.wikimedia.org/T300006) [11:35:36] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es2024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757417 (https://phabricator.wikimedia.org/T300006) (owner: 10Ladsgroup) [11:36:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2024.codfw.wmnet with reason: Maintenance [11:36:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2024.codfw.wmnet with reason: Maintenance [11:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2024 (T300006)', diff saved to https://phabricator.wikimedia.org/P19284 and previous config saved to /var/cache/conftool/dbconfig/20220126-113626-ladsgroup.json [11:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:30] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [11:37:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19285 and previous config saved to /var/cache/conftool/dbconfig/20220126-113730-root.json [11:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:39] (03PS1) 10Ladsgroup: Revert "es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757135 [11:38:00] (03PS2) 10Ladsgroup: Revert "es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757135 [11:38:04] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es2025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757135 (owner: 10Ladsgroup) [11:39:31] (03Merged) 10jenkins-bot: rdbms: Pass commented SQL to the GeneralizedSql for logging [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757133 (https://phabricator.wikimedia.org/T298687) (owner: 10Ladsgroup) [11:40:35] (03CR) 10Elukey: [C: 03+1] "Wasn't aware of the option, but I see that the rest of the cluster use it, so LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757407 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [11:41:20] !log installing libxfont security updates [11:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:29] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.19/includes/libs/rdbms/database/Database.php: Backport: [[gerrit:757133|rdbms: Pass commented SQL to the GeneralizedSql for logging (T298687)]] (duration: 00m 54s) [11:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:33] T298687: Query caller comments /* Class::method */ not available in slow queries logstash dashboard - https://phabricator.wikimedia.org/T298687 [11:42:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es2024.codfw.wmnet with OS bullseye [11:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1137 T300099', diff saved to https://phabricator.wikimedia.org/P19286 and previous config saved to /var/cache/conftool/dbconfig/20220126-114236-marostegui.json [11:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:41] T300099: Upgrade x1 to Bullseye - https://phabricator.wikimedia.org/T300099 [11:43:34] (03CR) 10Elukey: "I see from the pcc that no worker has been tested, but from profile::kubernetes::node IIUC the new settings should affect the ferm rules o" [puppet] - 10https://gerrit.wikimedia.org/r/757408 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [11:43:44] (03PS1) 10Marostegui: db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757419 (https://phabricator.wikimedia.org/T300099) [11:43:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:34] (03CR) 10Marostegui: [C: 03+2] db1137: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757419 (https://phabricator.wikimedia.org/T300099) (owner: 10Marostegui) [11:44:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1137.eqiad.wmnet with OS bullseye [11:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:45:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P19287 and previous config saved to /var/cache/conftool/dbconfig/20220126-114619-marostegui.json [11:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:23] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 46584560 and 346 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:50:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19288 and previous config saved to /var/cache/conftool/dbconfig/20220126-115009-root.json [11:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:12] (03PS1) 104nn1l2: commonswiki: Add www.kew.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757423 (https://phabricator.wikimedia.org/T300101) [11:52:20] (03CR) 10jerkins-bot: [V: 04-1] commonswiki: Add www.kew.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757423 (https://phabricator.wikimedia.org/T300101) (owner: 104nn1l2) [11:53:51] (03PS1) 10Hnowlan: maps: tweak postgres configuration settings to use more resources [puppet] - 10https://gerrit.wikimedia.org/r/757424 (https://phabricator.wikimedia.org/T298246) [11:55:57] (03PS2) 104nn1l2: commonswiki: Add www.kew.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757423 (https://phabricator.wikimedia.org/T300101) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220126T1200). [12:00:05] dcausse, cormacparle, and nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:09] hi [12:00:17] o/ [12:00:38] matthiasmullie: is gonna handle our deployment [12:00:40] o/ @cormacparle is not around, I'm here instead to look after that patch [12:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T285149)', diff saved to https://phabricator.wikimedia.org/P19290 and previous config saved to /var/cache/conftool/dbconfig/20220126-120125-marostegui.json [12:01:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:01:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:31] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [12:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T285149)', diff saved to https://phabricator.wikimedia.org/P19291 and previous config saved to /var/cache/conftool/dbconfig/20220126-120132-marostegui.json [12:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:46] I can deploy I guess [12:01:51] o/ (late ^^) [12:02:01] :) [12:02:55] matthiasmullie: I'll start with your patch [12:03:42] (03CR) 10DCausse: [C: 03+2] Deal with change in MachineVision handler constructor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757413 (owner: 10Cparle) [12:03:43] cool - can move straight to prod, can't be tested on mwdebug [12:03:47] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757426 [12:03:49] ok [12:04:03] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757426 (owner: 10Kosta Harlan) [12:04:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T285149)', diff saved to https://phabricator.wikimedia.org/P19292 and previous config saved to /var/cache/conftool/dbconfig/20220126-120439-marostegui.json [12:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:04] (03Merged) 10jenkins-bot: Deal with change in MachineVision handler constructor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757413 (owner: 10Cparle) [12:05:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19293 and previous config saved to /var/cache/conftool/dbconfig/20220126-120513-root.json [12:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:26] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 1430 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:08:34] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757426 (owner: 10Kosta Harlan) [12:08:57] (03PS1) 10Marostegui: Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757136 [12:09:02] !log dcausse@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757413|Deal with change in MachineVision handler constructor]] (duration: 00m 51s) [12:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:28] matthiasmullie: should be done ^ [12:09:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1137.eqiad.wmnet with OS bullseye [12:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:38] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [12:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:40] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [12:09:41] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [12:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:43] @dcausse thanks [12:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:49] (03PS8) 10Jbond: WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [12:10:02] (03CR) 10Marostegui: [C: 03+2] Revert "db1137: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757136 (owner: 10Marostegui) [12:10:04] nn1l2: hi, I'm looking at your patches [12:10:15] thanks! [12:10:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19294 and previous config saved to /var/cache/conftool/dbconfig/20220126-121032-root.json [12:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:40] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync on staging [12:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:34] (03CR) 10DCausse: [C: 03+2] fawiki: Add unwatchedpages permission to patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757414 (https://phabricator.wikimedia.org/T300126) (owner: 104nn1l2) [12:11:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:59] 10SRE, 10observability: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) [12:12:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:12:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:30] nn1l2: regarding https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/757423 could I get +1 from someone [12:14:58] why exactly? [12:14:59] Lucas_WMDE: I'm not familiar with this do you think this needs some more approval ^ [12:15:13] * Lucas_WMDE looks [12:15:14] nn1l2: I'm not familiar with this setting [12:15:14] It used to be done by the deployer himself? [12:15:22] (03CR) 10Majavah: [C: 03+1] commonswiki: Add www.kew.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757423 (https://phabricator.wikimedia.org/T300101) (owner: 104nn1l2) [12:15:28] thanks! [12:15:33] I don’t think those usually need much approval [12:15:43] I’d quickly check on the website if their content is indeed under some CC license [12:15:59] (03CR) 10jerkins-bot: [V: 04-1] WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [12:16:05] yeah, I usually do what Lucas_WMDE says [12:16:21] the URL in the task description says CC BY 4.0, lgtm [12:16:30] GBIF is completely trustworthy [12:16:38] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "looks like the site indeed publishes CC-licensed images, cool" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757423 (https://phabricator.wikimedia.org/T300101) (owner: 104nn1l2) [12:16:39] look at the phab ticket [12:16:53] their website says nothings though [12:17:00] I'm admin on Commons [12:17:15] I do care about copyright and free licenses :) [12:18:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2024.codfw.wmnet with OS bullseye [12:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:19] (03PS2) 10DCausse: fawiki: Add unwatchedpages permission to patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757414 (https://phabricator.wikimedia.org/T300126) (owner: 104nn1l2) [12:18:26] (03CR) 10DCausse: fawiki: Add unwatchedpages permission to patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757414 (https://phabricator.wikimedia.org/T300126) (owner: 104nn1l2) [12:18:28] (03CR) 10DCausse: [C: 03+2] fawiki: Add unwatchedpages permission to patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757414 (https://phabricator.wikimedia.org/T300126) (owner: 104nn1l2) [12:19:12] nn1l2: sorry about that I don't do deploys that often [12:19:16] (03Merged) 10jenkins-bot: fawiki: Add unwatchedpages permission to patrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757414 (https://phabricator.wikimedia.org/T300126) (owner: 104nn1l2) [12:19:28] no problem [12:19:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P19296 and previous config saved to /var/cache/conftool/dbconfig/20220126-121944-marostegui.json [12:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19297 and previous config saved to /var/cache/conftool/dbconfig/20220126-122016-root.json [12:20:18] (03CR) 10Jgiannelos: [C: 03+1] maps: tweak postgres configuration settings to use more resources [puppet] - 10https://gerrit.wikimedia.org/r/757424 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [12:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:28] nn1l2: do you need your patches to be tested on mwdebug first? [12:20:38] yes please [12:22:47] !log installing apache security updates [12:22:48] nn1l2: the first one regarding fawiki is available on mwdebug1001, please let me know if it's OK move forward [12:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:46] (03CR) 10MarcoAurelio: mediawiki::maintenance: Run recountCategories.php monthly on all wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [12:23:48] (03PS1) 10Kosta Harlan: linkrecommendation: Set log level to WARN [deployment-charts] - 10https://gerrit.wikimedia.org/r/757430 [12:24:08] : LGTM, good to go [12:24:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:15] thanks, shipping [12:25:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:25:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:32] !log dcausse@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757414|fawiki: Add unwatchedpages permission to patrollers (T300126)]] (duration: 00m 51s) [12:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:36] T300126: Add unwatchedpages permission to patrollers on Farsi Wikipedia - https://phabricator.wikimedia.org/T300126 [12:25:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19298 and previous config saved to /var/cache/conftool/dbconfig/20220126-122536-root.json [12:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:03] (03PS3) 10DCausse: commonswiki: Add www.kew.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757423 (https://phabricator.wikimedia.org/T300101) (owner: 104nn1l2) [12:26:24] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Set log level to WARN [deployment-charts] - 10https://gerrit.wikimedia.org/r/757430 (owner: 10Kosta Harlan) [12:26:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:11] (03CR) 10DCausse: [C: 03+2] commonswiki: Add www.kew.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757423 (https://phabricator.wikimedia.org/T300101) (owner: 104nn1l2) [12:27:57] (03Merged) 10jenkins-bot: commonswiki: Add www.kew.org to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757423 (https://phabricator.wikimedia.org/T300101) (owner: 104nn1l2) [12:28:12] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [12:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:15] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [12:28:16] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [12:28:17] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on staging [12:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:04] nn1l2: the second one regarding commons is available on mwdebug1001, please let me know if it's OK [12:29:59] (03Merged) 10jenkins-bot: linkrecommendation: Set log level to WARN [deployment-charts] - 10https://gerrit.wikimedia.org/r/757430 (owner: 10Kosta Harlan) [12:30:11] dcausse: LGTM, file uploaded successfully: https://commons.wikimedia.org/wiki/File:Swertia_chirata_Buch.-Ham._ex_Wall._685756.jpg [12:30:38] cool, shipping then [12:31:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/757410 (https://phabricator.wikimedia.org/T299123) (owner: 10Volans) [12:31:12] (03PS1) 10Volans: setup.py: temporary limit dnspython [software/spicerack] - 10https://gerrit.wikimedia.org/r/757431 [12:31:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:57] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: disable PXE on all other NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/757410 (https://phabricator.wikimedia.org/T299123) (owner: 10Volans) [12:32:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/757431 (owner: 10Volans) [12:32:49] !log dcausse@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757423|commonswiki: Add www.kew.org to the wgCopyUploadsDomains allowlist (T300101)]] (duration: 00m 51s) [12:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:53] T300101: Add www.kew.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T300101 [12:32:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:32:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:38] nn1l2: should be good, thanks for your patience! :) [12:33:49] shipping mine now [12:33:52] Thank you! [12:34:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:22] (03PS2) 10DCausse: Correct wcqs event stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757122 (owner: 10Ebernhardson) [12:34:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P19299 and previous config saved to /var/cache/conftool/dbconfig/20220126-123448-marostegui.json [12:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:11] (03Merged) 10jenkins-bot: sre.hosts.provision: disable PXE on all other NICs [cookbooks] - 10https://gerrit.wikimedia.org/r/757410 (https://phabricator.wikimedia.org/T299123) (owner: 10Volans) [12:35:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19300 and previous config saved to /var/cache/conftool/dbconfig/20220126-123520-root.json [12:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:42] (03CR) 10DCausse: [C: 03+2] Correct wcqs event stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757122 (owner: 10Ebernhardson) [12:37:37] (03Merged) 10jenkins-bot: Correct wcqs event stream name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757122 (owner: 10Ebernhardson) [12:38:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2024 (T300006)', diff saved to https://phabricator.wikimedia.org/P19301 and previous config saved to /var/cache/conftool/dbconfig/20220126-123839-ladsgroup.json [12:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:44] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [12:40:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19302 and previous config saved to /var/cache/conftool/dbconfig/20220126-124040-root.json [12:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:38] !log dcausse@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757122|Correct wcqs event stream name]] (duration: 00m 51s) [12:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:15] !log UTC morning backport done [12:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:45:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:59] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10AniketArs) @jhathaway Sorry for the key error, I again generated keys which I'm sharing with you(public one) {F34931520} this time it is a correct key [12:49:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T285149)', diff saved to https://phabricator.wikimedia.org/P19303 and previous config saved to /var/cache/conftool/dbconfig/20220126-124953-marostegui.json [12:49:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:49:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [12:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:58] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [12:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T285149)', diff saved to https://phabricator.wikimedia.org/P19304 and previous config saved to /var/cache/conftool/dbconfig/20220126-125001-marostegui.json [12:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19305 and previous config saved to /var/cache/conftool/dbconfig/20220126-125023-root.json [12:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T285149)', diff saved to https://phabricator.wikimedia.org/P19306 and previous config saved to /var/cache/conftool/dbconfig/20220126-125107-marostegui.json [12:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:25] (03CR) 10Arturo Borrero Gonzalez: "I'm not against this change, but I'd like to clarify one thing first:" [puppet] - 10https://gerrit.wikimedia.org/r/757394 (owner: 10Majavah) [12:55:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19307 and previous config saved to /var/cache/conftool/dbconfig/20220126-125543-root.json [12:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:03] (03CR) 10JMeybohm: [V: 03+1] Split profile::kubernetes::master_hosts by DC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757408 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [12:58:04] (03PS1) 10Ladsgroup: Revert "es2024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757138 [12:58:31] (03PS2) 10Ladsgroup: Revert "es2024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757138 [12:58:35] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es2024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757138 (owner: 10Ladsgroup) [12:58:38] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33458/console" [puppet] - 10https://gerrit.wikimedia.org/r/757408 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [12:58:54] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33457/console" [puppet] - 10https://gerrit.wikimedia.org/r/757407 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:03:33] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757432 [13:05:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1020 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19308 and previous config saved to /var/cache/conftool/dbconfig/20220126-130527-root.json [13:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1014.eqiad.wmnet [13:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P19309 and previous config saved to /var/cache/conftool/dbconfig/20220126-130611-marostegui.json [13:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19310 and previous config saved to /var/cache/conftool/dbconfig/20220126-131047-root.json [13:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1014.eqiad.wmnet [13:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:35] (03CR) 10Kosta Harlan: [C: 03+2] linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757432 (owner: 10Kosta Harlan) [13:15:19] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757432 (owner: 10Kosta Harlan) [13:16:15] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply on staging [13:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:17] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on internal [13:16:18] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply on external [13:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:40] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync on staging [13:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:19:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [13:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:19:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:19:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [13:19:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [13:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T298559)', diff saved to https://phabricator.wikimedia.org/P19311 and previous config saved to /var/cache/conftool/dbconfig/20220126-131959-marostegui.json [13:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:04] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [13:21:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298559)', diff saved to https://phabricator.wikimedia.org/P19313 and previous config saved to /var/cache/conftool/dbconfig/20220126-132114-marostegui.json [13:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P19314 and previous config saved to /var/cache/conftool/dbconfig/20220126-132122-marostegui.json [13:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:00] (03CR) 10Volans: [C: 03+2] setup.py: temporary limit dnspython [software/spicerack] - 10https://gerrit.wikimedia.org/r/757431 (owner: 10Volans) [13:24:33] (03PS1) 10JMeybohm: Enable overlayfs on kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/757433 (https://phabricator.wikimedia.org/T290967) [13:26:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove recentchanges from s8 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P19315 and previous config saved to /var/cache/conftool/dbconfig/20220126-132600-marostegui.json [13:26:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19316 and previous config saved to /var/cache/conftool/dbconfig/20220126-132603-root.json [13:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:05] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [13:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:12] (03PS1) 10JMeybohm: Upgrade codfw kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) [13:27:46] (03Merged) 10jenkins-bot: setup.py: temporary limit dnspython [software/spicerack] - 10https://gerrit.wikimedia.org/r/757431 (owner: 10Volans) [13:28:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1005.eqiad.wmnet [13:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:38] (03PS1) 10Volans: redfish: better support of parsing JSON responses [software/spicerack] - 10https://gerrit.wikimedia.org/r/757435 (https://phabricator.wikimedia.org/T299123) [13:33:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1005.eqiad.wmnet [13:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:27] (03PS1) 104nn1l2: fawiki: Add unwatchedpages permission to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757436 (https://phabricator.wikimedia.org/T300126) [13:36:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P19317 and previous config saved to /var/cache/conftool/dbconfig/20220126-133619-marostegui.json [13:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T285149)', diff saved to https://phabricator.wikimedia.org/P19318 and previous config saved to /var/cache/conftool/dbconfig/20220126-133627-marostegui.json [13:36:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [13:36:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [13:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:32] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [13:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:35] (03PS1) 10JMeybohm: Add k8s masters in codfw eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757437 (https://phabricator.wikimedia.org/T290967) [13:36:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T285149)', diff saved to https://phabricator.wikimedia.org/P19319 and previous config saved to /var/cache/conftool/dbconfig/20220126-133634-marostegui.json [13:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:37] (03PS1) 10JMeybohm: Add k8s masters in eqiad eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757438 (https://phabricator.wikimedia.org/T290967) [13:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:39] (03CR) 10Jbond: [C: 03+1] P:nftables::basefirewall: drop prometheus port filtering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757394 (owner: 10Majavah) [13:37:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T285149)', diff saved to https://phabricator.wikimedia.org/P19320 and previous config saved to /var/cache/conftool/dbconfig/20220126-133740-marostegui.json [13:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:24] (03PS1) 10JMeybohm: Add keys needed for k8s node profile to main master nodes [labs/private] - 10https://gerrit.wikimedia.org/r/757441 (https://phabricator.wikimedia.org/T290967) [13:41:05] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add keys needed for k8s node profile to main master nodes [labs/private] - 10https://gerrit.wikimedia.org/r/757441 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:41:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19321 and previous config saved to /var/cache/conftool/dbconfig/20220126-134106-root.json [13:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:05] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 6 NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33460/console" [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [13:45:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1006.eqiad.wmnet [13:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:41] (03PS2) 10Volans: spicerack: allow to execute another cookbook [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 [13:48:46] (03CR) 10Volans: "Addressed comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans) [13:50:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1006.eqiad.wmnet [13:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P19322 and previous config saved to /var/cache/conftool/dbconfig/20220126-135124-marostegui.json [13:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1015.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [13:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P19323 and previous config saved to /var/cache/conftool/dbconfig/20220126-135245-marostegui.json [13:52:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1015.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [13:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:53:43] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [13:54:03] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1015 [13:54:31] (03CR) 10Jbond: [C: 03+1] spicerack: allow to execute another cookbook [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans) [13:54:43] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:54:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1014.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [13:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:43] (03PS1) 10Marostegui: es1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757446 (https://phabricator.wikimedia.org/T300006) [13:55:51] (03CR) 10Jbond: [C: 03+1] ":facepalm: lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/757435 (https://phabricator.wikimedia.org/T299123) (owner: 10Volans) [13:56:09] 10SRE, 10ops-codfw: Degraded RAID on restbase2011 - https://phabricator.wikimedia.org/T299871 (10Papaul) 05Open→03Resolved [13:56:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19324 and previous config saved to /var/cache/conftool/dbconfig/20220126-135610-root.json [13:56:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1014.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [13:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [13:56:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1025', diff saved to https://phabricator.wikimedia.org/P19325 and previous config saved to /var/cache/conftool/dbconfig/20220126-135635-marostegui.json [13:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [13:57:14] (03CR) 10Volans: [C: 03+2] redfish: better support of parsing JSON responses [software/spicerack] - 10https://gerrit.wikimedia.org/r/757435 (https://phabricator.wikimedia.org/T299123) (owner: 10Volans) [13:58:45] (03CR) 10Marostegui: [C: 03+2] es1025: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757446 (https://phabricator.wikimedia.org/T300006) (owner: 10Marostegui) [13:58:50] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) [14:01:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:03:26] (03Merged) 10jenkins-bot: redfish: better support of parsing JSON responses [software/spicerack] - 10https://gerrit.wikimedia.org/r/757435 (https://phabricator.wikimedia.org/T299123) (owner: 10Volans) [14:06:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T298559)', diff saved to https://phabricator.wikimedia.org/P19326 and previous config saved to /var/cache/conftool/dbconfig/20220126-140629-marostegui.json [14:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:06:34] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [14:06:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [14:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [14:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [14:07:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [14:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T298559)', diff saved to https://phabricator.wikimedia.org/P19327 and previous config saved to /var/cache/conftool/dbconfig/20220126-140712-marostegui.json [14:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P19328 and previous config saved to /var/cache/conftool/dbconfig/20220126-140751-marostegui.json [14:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298559)', diff saved to https://phabricator.wikimedia.org/P19329 and previous config saved to /var/cache/conftool/dbconfig/20220126-140827-marostegui.json [14:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:44] (03CR) 10Ayounsi: [C: 03+1] Add k8s masters in eqiad eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757438 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:11:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19330 and previous config saved to /var/cache/conftool/dbconfig/20220126-141113-root.json [14:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:49] (03PS3) 10Majavah: P:nftables::basefirewall: drop prometheus port filtering [puppet] - 10https://gerrit.wikimedia.org/r/757394 [14:12:14] (03CR) 10Majavah: P:nftables::basefirewall: drop prometheus port filtering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757394 (owner: 10Majavah) [14:16:51] (03PS3) 10Volans: spicerack: allow to execute another cookbook [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 [14:16:56] (03CR) 10Volans: [C: 03+2] spicerack: allow to execute another cookbook [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans) [14:18:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:nftables::basefirewall: drop prometheus port filtering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757394 (owner: 10Majavah) [14:21:06] (03PS1) 10Filippo Giunchedi: service catalog: introduce 'page' field [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) [14:21:54] (03PS1) 10Ottomata: Add comments about eventgate stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757448 [14:22:27] (03CR) 10Ottomata: "Ah, only eventgate-analytics-external uses dynamic stream config. So, when we change stream configs for any other eventgate, like eventga" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757122 (owner: 10Ebernhardson) [14:22:55] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply on production [14:22:55] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply on canary [14:22:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T285149)', diff saved to https://phabricator.wikimedia.org/P19331 and previous config saved to /var/cache/conftool/dbconfig/20220126-142255-marostegui.json [14:22:57] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply on production [14:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:58] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply on canary [14:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:04] T285149: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 [14:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:07] (03Merged) 10jenkins-bot: spicerack: allow to execute another cookbook [software/spicerack] - 10https://gerrit.wikimedia.org/r/755756 (owner: 10Volans) [14:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P19332 and previous config saved to /var/cache/conftool/dbconfig/20220126-142332-marostegui.json [14:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:59] (03PS2) 10Ottomata: Add comments about eventgate stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757448 [14:24:39] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync on canary [14:24:40] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync on production [14:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:04] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync on production [14:25:06] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync on canary [14:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1137 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19333 and previous config saved to /var/cache/conftool/dbconfig/20220126-142620-root.json [14:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:45] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33461/console" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:32:21] (03PS1) 10Muehlenhoff: Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450 [14:32:55] (03CR) 10jerkins-bot: [V: 04-1] Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff) [14:32:57] (03CR) 10Volans: "@amir @kormat: do you have any comment/feedback on this one?" [cookbooks] - 10https://gerrit.wikimedia.org/r/754872 (https://phabricator.wikimedia.org/T239814) (owner: 10Volans) [14:33:33] (03CR) 10Elukey: [C: 03+1] Split profile::kubernetes::master_hosts by DC [puppet] - 10https://gerrit.wikimedia.org/r/757408 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:33:35] volans: my comment is that amir should comment. ;) [14:34:11] (03CR) 10Elukey: [C: 03+1] Enable overlayfs on kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/757433 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:34:47] (03PS2) 10Muehlenhoff: Create a separate puppetboard-idptest.wikimedia.org vhost in idp-staging [puppet] - 10https://gerrit.wikimedia.org/r/757450 [14:35:23] !log roll restarting eventgate-analytics to pick up stream config change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/757122 [14:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:13] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync on production [14:36:13] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync on canary [14:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:39] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync on canary [14:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:01] (03CR) 10Elukey: [C: 03+2] helmfile.d: move eventgate* to the WMF CA cert bundle [deployment-charts] - 10https://gerrit.wikimedia.org/r/753425 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [14:37:03] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync on production [14:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:40] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync on production [14:37:41] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync on canary [14:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:43] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync on canary [14:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:22] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync on production [14:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P19334 and previous config saved to /var/cache/conftool/dbconfig/20220126-143837-marostegui.json [14:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:40:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff) [14:40:53] (03CR) 10Ottomata: [C: 03+2] Add comments about eventgate stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757448 (owner: 10Ottomata) [14:41:39] !log deploying new CA certs for all eventgate services... T296064 [14:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:44] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply on production [14:41:45] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [14:41:46] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply on canary [14:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:15] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5323 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:42:22] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is an upstream module, we just import it as is, we don't meddle with it (similarly modules/stdlib is not messed with) as it makes upg" [puppet] - 10https://gerrit.wikimedia.org/r/756698 (https://phabricator.wikimedia.org/T283273) (owner: 10EpicPupper) [14:42:22] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: sync on production [14:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:43:09] (03PS19) 10Herron: prometheus: add blackbox generic "watchrat" http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [14:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:44:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:45:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:49] (03PS20) 10Herron: prometheus: add blackbox generic "watchrat" http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [14:46:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC is correctly noop on alert1001, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [14:46:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [14:47:52] (03PS2) 10Giuseppe Lavagetto: deployment-prep: install php 7.4 everywhere [puppet] - 10https://gerrit.wikimedia.org/r/755536 (https://phabricator.wikimedia.org/T295578) [14:47:54] (03PS1) 10Giuseppe Lavagetto: thanos::frontend: fix envoy configuration [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) [14:48:05] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.01613 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:48:22] (03CR) 10Elukey: [C: 04-1] "Found two typos but the rest looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [14:48:58] (03CR) 10Herron: prometheus: add blackbox generic "watchrat" http/s static check support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [14:49:45] (03PS2) 10Alexandros Kosiaris: ttyS0-115200: Add a comment about this being VM specific [puppet] - 10https://gerrit.wikimedia.org/r/754884 [14:50:08] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host es1025.mgmt.eqiad.wmnet with reboot policy GRACEFUL [14:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:20] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply on production [14:50:20] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply on canary [14:50:20] 10SRE, 10Data-Engineering, 10observability, 10serviceops: Upgrade Kafka to 2.x - https://phabricator.wikimedia.org/T300102 (10Ottomata) > A better and more stable Kafka Mirror Maker (even if after all the work that Andrew did we have something very stable as well now) This really does look great, and has s... [14:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:33] (03PS2) 10Giuseppe Lavagetto: thanos::frontend: fix envoy configuration [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) [14:50:56] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync on canary [14:50:57] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/754884 (owner: 10Alexandros Kosiaris) [14:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:51:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:41] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33462/console" [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) (owner: 10Giuseppe Lavagetto) [14:51:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org [14:52:16] !log joal@deploy1002 Started deploy [analytics/refinery@ab7f732]: Regular analytics weekly train [analytics/refinery@ab7f732] [14:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:35] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: sync on production [14:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:31] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply on production [14:53:31] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply on canary [14:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T298559)', diff saved to https://phabricator.wikimedia.org/P19335 and previous config saved to /var/cache/conftool/dbconfig/20220126-145342-marostegui.json [14:53:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:53:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:47] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [14:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T298559)', diff saved to https://phabricator.wikimedia.org/P19336 and previous config saved to /var/cache/conftool/dbconfig/20220126-145349-marostegui.json [14:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:15] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync on canary [14:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:39] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: sync on production [14:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1005.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [14:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298559)', diff saved to https://phabricator.wikimedia.org/P19337 and previous config saved to /var/cache/conftool/dbconfig/20220126-145505-marostegui.json [14:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:28] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/754884 (owner: 10Alexandros Kosiaris) [14:55:32] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply on production [14:55:34] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply on canary [14:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:57] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync on production [14:55:58] !log elukey@cp4035:~$ sudo systemctl restart varnishkafka-webrequest.service - metrics showing messages stuck for a poll() [14:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:02] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 350191765584 and 13837 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:56:27] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [14:56:29] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply on canary [14:56:29] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply on production [14:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:53] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync on canary [14:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1005.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [14:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:51] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync on production [14:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:00] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply on production [14:58:00] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply on canary [14:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:23] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync on canary [14:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:58] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10herron) >>! In T281266#7634510, @herron wrote: >>>! In T281266#7634280, @gerritbot wrote: >> Change 755480 had a related patch set uploaded (by Herron; author... [15:00:04] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync on production [15:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:19] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1025.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:32] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 366712625288 and 14407 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:06:09] !log elukey@cp4035:~$ sudo systemctl restart varnishkafka-eventlogging.service - metrics showing messages stuck for a poll() [15:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host es1025.eqiad.wmnet with OS bullseye [15:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:54] !log joal@deploy1002 Finished deploy [analytics/refinery@ab7f732]: Regular analytics weekly train [analytics/refinery@ab7f732] (duration: 16m 38s) [15:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P19338 and previous config saved to /var/cache/conftool/dbconfig/20220126-151009-marostegui.json [15:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1006.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [15:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti1006.eqiad.wmnet to ganeti01.svc.eqiad.wmnet [15:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:08] (03PS9) 10Jbond: WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [15:12:22] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [15:12:26] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 384001107520 and 14820 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:14:15] !log joal@deploy1002 Started deploy [analytics/refinery@ab7f732] (thin): Regular analytics weekly train THIN [analytics/refinery@ab7f732] [15:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:22] !log joal@deploy1002 Finished deploy [analytics/refinery@ab7f732] (thin): Regular analytics weekly train THIN [analytics/refinery@ab7f732] (duration: 00m 07s) [15:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:31] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) >>! In T299652#7650156, @Cmjohnson wrote: > lets go with restbase1019 @hnowlan Sounds good - let me know whenever suits and I c... [15:17:39] (03CR) 10jerkins-bot: [V: 04-1] WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [15:19:15] (03CR) 10Jbond: WIP: add reposync (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [15:20:03] !log joal@deploy1002 Started deploy [analytics/refinery@ab7f732] (hadoop-test): Regular analytics weekly train HADOOP-TEST [analytics/refinery@ab7f732] [15:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:45] !log paused (for meetings) in deploying new CA certs for all eventgate services, still TODO: eventgate-analytics-external, eventgate-main - T296064 [15:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:54] T296064: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 [15:25:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [15:25:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P19340 and previous config saved to /var/cache/conftool/dbconfig/20220126-152514-marostegui.json [15:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:34] !log joal@deploy1002 Finished deploy [analytics/refinery@ab7f732] (hadoop-test): Regular analytics weekly train HADOOP-TEST [analytics/refinery@ab7f732] (duration: 05m 30s) [15:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:38] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1128 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/757389 (https://phabricator.wikimedia.org/T299624) (owner: 10Marostegui) [15:25:46] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 404005563952 and 15622 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:28:56] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 412252211544 and 15812 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:29:18] !log add pay-lvs1003/4 to pfw3-eqiad BGP [15:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [15:35:12] 10SRE, 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10jhathaway) @SCherukuwada we ready to grant access, once @dr0ptp4kt approves [15:36:08] PROBLEM - BGP status on pfw3-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:37:00] 10SRE, 10SRE-Access-Requests: NRodriguez uses the same SSH key(s) in WMCS and production - https://phabricator.wikimedia.org/T299336 (10jhathaway) a:05Jelto→03jhathaway [15:37:01] (03PS1) 10Marostegui: Revert "es1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757144 [15:37:53] (03CR) 10Marostegui: [C: 03+2] Revert "es1025: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757144 (owner: 10Marostegui) [15:38:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19341 and previous config saved to /var/cache/conftool/dbconfig/20220126-153831-root.json [15:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:35] expected ^ [15:40:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T298559)', diff saved to https://phabricator.wikimedia.org/P19342 and previous config saved to /var/cache/conftool/dbconfig/20220126-154019-marostegui.json [15:40:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:40:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:25] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [15:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T298559)', diff saved to https://phabricator.wikimedia.org/P19343 and previous config saved to /var/cache/conftool/dbconfig/20220126-154026-marostegui.json [15:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:33] !log depool cp4035 [15:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1025.eqiad.wmnet with OS bullseye [15:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:03] !log restarting varnish-frontend on cp4035 [15:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298559)', diff saved to https://phabricator.wikimedia.org/P19344 and previous config saved to /var/cache/conftool/dbconfig/20220126-154242-marostegui.json [15:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:12] 10SRE, 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10SCherukuwada) @dr0ptp4kt is aware of the request; he's just somewhat swamped for now. FWIW, I'm not in any sort of hurry. :-) [15:44:47] (03PS1) 10Thiemo Kreuz (WMDE): Don't wrap unknown actions with confirmation [extensions/VisualEditor] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757145 (https://phabricator.wikimedia.org/T300095) [15:45:57] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Don't wrap unknown actions with confirmation [extensions/VisualEditor] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757145 (https://phabricator.wikimedia.org/T300095) (owner: 10Thiemo Kreuz (WMDE)) [15:47:59] !log pool cp4035 [15:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:17] (03PS2) 10JMeybohm: Upgrade codfw kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) [15:50:48] RECOVERY - BGP status on pfw3-eqiad is OK: BGP OK - up: 5, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:55] (03CR) 10JMeybohm: Upgrade codfw kubernetes masters to tainted full nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757434 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:51:45] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Enable IPv6DualStack for kubelet on staging masters [puppet] - 10https://gerrit.wikimedia.org/r/757407 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:51:50] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Split profile::kubernetes::master_hosts by DC [puppet] - 10https://gerrit.wikimedia.org/r/757408 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:51:54] (03CR) 10JMeybohm: [C: 03+2] Enable overlayfs on kubernetes masters [puppet] - 10https://gerrit.wikimedia.org/r/757433 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [15:53:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19345 and previous config saved to /var/cache/conftool/dbconfig/20220126-155334-root.json [15:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:50] (03CR) 10Cwhite: [V: 03+2 C: 03+2] logstash: add docker support in the Makefile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757042 (https://phabricator.wikimedia.org/T300051) (owner: 10Elukey) [15:54:01] !log upgrading varnishkafka to version 1.1.0 on cp[6002,6005,6009-6013].drmrs.wmnet,cp1087.eqiad.wmnet,cp[4021,4033-4034,4036].ulsfo.wmnet [15:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:10] (03PS1) 10Jgreen: nsca_frack.cfg.erb add hosts pay-lvs100[34], remove pay-lvs100[12] [puppet] - 10https://gerrit.wikimedia.org/r/757457 (https://phabricator.wikimedia.org/T147932) [15:54:38] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:55:00] (03PS4) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T284628) [15:55:42] 10SRE, 10Traffic: Create Ganeti VMs for Wikidough in drmrs - https://phabricator.wikimedia.org/T300156 (10ssingh) [15:55:44] (03CR) 10AOkoth: kuberenetes: disable mwautopull timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T284628) (owner: 10AOkoth) [15:56:16] (03CR) 10JMeybohm: [C: 03+1] kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T284628) (owner: 10AOkoth) [15:57:37] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb add hosts pay-lvs100[34], remove pay-lvs100[12] [puppet] - 10https://gerrit.wikimedia.org/r/757457 (https://phabricator.wikimedia.org/T147932) (owner: 10Jgreen) [15:57:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P19346 and previous config saved to /var/cache/conftool/dbconfig/20220126-155747-marostegui.json [15:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:14] 10SRE, 10Traffic: Create Ganeti VMs for durum in drmrs - https://phabricator.wikimedia.org/T300158 (10ssingh) [16:02:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Cmjohnson) @cmooney I went through all the cabling and confirmed the correct patches. the connections at the demarc are pretty foolproof wit... [16:05:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Cmjohnson) @cmooney it appears to be disabled cmjohnson@re0.cr1-eqiad> show interfaces descriptions Interface Admin Link Description... [16:06:42] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [16:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1047.eqiad.wmnet with O... [16:07:45] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [16:08:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19347 and previous config saved to /var/cache/conftool/dbconfig/20220126-160838-root.json [16:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:45] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [16:12:52] (03CR) 10Filippo Giunchedi: "LGTM overall, my understanding is that we're essentially dropping SNI support (?) we'd need to validate that all clients are indeed sendin" [puppet] - 10https://gerrit.wikimedia.org/r/757452 (https://phabricator.wikimedia.org/T300119) (owner: 10Giuseppe Lavagetto) [16:12:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P19348 and previous config saved to /var/cache/conftool/dbconfig/20220126-161252-marostegui.json [16:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:03] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756985 (https://phabricator.wikimedia.org/T299913) (owner: 10Eigyan) [16:14:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "Forgot to flag another variable to remove, but LGTM nevertheless" [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [16:17:56] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1047.eqiad.wmnet with OS bullseye [16:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bu... [16:18:37] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 453009151984 and 18792 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:23:15] !log restart varnishkafka instances on cp1087 [16:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19349 and previous config saved to /var/cache/conftool/dbconfig/20220126-162342-root.json [16:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:03] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:57] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 453009151984 and 19293 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:27:20] (03CR) 10Jbond: "lgtm couple of minor nits" [puppet] - 10https://gerrit.wikimedia.org/r/757450 (owner: 10Muehlenhoff) [16:27:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T298559)', diff saved to https://phabricator.wikimedia.org/P19350 and previous config saved to /var/cache/conftool/dbconfig/20220126-162756-marostegui.json [16:27:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [16:28:02] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [16:28:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298559)', diff saved to https://phabricator.wikimedia.org/P19351 and previous config saved to /var/cache/conftool/dbconfig/20220126-162810-marostegui.json [16:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:19] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 238, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:32:19] (03CR) 10Ladsgroup: exim: add the ability to silently drop senders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [16:32:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Cmjohnson thanks. The interfaces on the CR are down by default. Not sure if you changed anything but there is no improvement rig... [16:33:35] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on restbase1019.eqiad.wmnet with reason: Firmware upgrade [16:33:37] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on restbase1019.eqiad.wmnet with reason: Firmware upgrade [16:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:08] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1019.eqiad.wmnet [16:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 233, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:37:46] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [16:38:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19352 and previous config saved to /var/cache/conftool/dbconfig/20220126-163845-root.json [16:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:46] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [16:43:43] (03PS4) 10JMeybohm: Make a bundle signer return it's root CA [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) [16:44:14] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Make a bundle signer return it's root CA [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756546 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [16:44:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Cmjohnson) @cmooney, I have a light meter and I see light from lsw-f and lsw-e to the demarc, and then I see light to cr1 and cr2 from old c... [16:44:28] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add ca to multirootca.conf in simple-cfssl [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/756616 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [16:47:26] (03PS1) 10JMeybohm: cfssl-issuer: Update to v0.2.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/757462 (https://phabricator.wikimedia.org/T299906) [16:47:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission sodium.wikimedia.org - https://phabricator.wikimedia.org/T299785 (10Cmjohnson) [16:47:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission sodium.wikimedia.org - https://phabricator.wikimedia.org/T299785 (10Cmjohnson) 05Open→03Resolved [16:47:51] !log draining instances off ganeti1007 for reimage [16:47:53] 10SRE, 10Infrastructure-Foundations: decom sodium - https://phabricator.wikimedia.org/T298727 (10Cmjohnson) [16:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Cmjohnson thanks ok. yeah it is odd. All the switch->switch links have come up ok (using the same CWDM4 optics), so it'd be unus... [16:48:15] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] cfssl-issuer: Update to v0.2.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/757462 (https://phabricator.wikimedia.org/T299906) (owner: 10JMeybohm) [16:51:23] !log [WCQS Deploy] Restarted updaters across fleet: `ryankemper@cumin1001:~$ sudo cumin -b 6 'wcqs*' 'sudo systemctl restart wcqs-updater'` [16:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298559)', diff saved to https://phabricator.wikimedia.org/P19353 and previous config saved to /var/cache/conftool/dbconfig/20220126-165130-marostegui.json [16:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:35] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [16:53:12] !log published image docker-registry.discovery.wmnet/cfssl-issuer:0.2.1-1 - T299906 [16:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:17] T299906: Extend cfssl-issuer to return the Root CA certificate - https://phabricator.wikimedia.org/T299906 [16:53:22] elukey: ^ [16:53:43] (03CR) 10Muehlenhoff: aptrepo: add an elastic68 component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757046 (https://phabricator.wikimedia.org/T295666) (owner: 10DCausse) [16:53:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19354 and previous config saved to /var/cache/conftool/dbconfig/20220126-165349-root.json [16:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:09] jayme: \o/ [16:54:15] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 453009151984 and 20931 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:54:36] (03PS10) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [16:55:11] elukey: lmk if you run into any issues when testing. You'll need to force a refresh of existing certs, though [16:55:37] jayme: do I need to force the new docker image in helmfiles? [16:55:47] yesyes, sure [16:56:05] ok sending a patch :) [16:56:09] you need to overwrite the version in values [16:56:13] yep yep [16:56:38] if you figure out a nice way to force-refresh certs, that would be a good addition to https://wikitech.wikimedia.org/wiki/Kubernetes/cert-manager I guess [16:57:12] 10ops-eqiad, 10decommission-hardware: decommission pay-lvs1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T300165 (10Jgreen) [16:58:33] 10ops-eqiad, 10decommission-hardware: decommission pay-lvs1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T300168 (10Jgreen) [16:59:33] 10SRE, 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10dr0ptp4kt) Approved, conditioned on the data remaining inaccessible except for those with NDA, need to know, and suitable strong authentication requirements where identitie... [17:00:41] (03CR) 10jerkins-bot: [V: 04-1] RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [17:00:57] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 453009151984 and 21333 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:06:12] (03PS1) 10Elukey: helmfile.d: set new cfssl-issuer version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757465 (https://phabricator.wikimedia.org/T299906) [17:06:21] jayme: --^ [17:06:36] 👀 [17:06:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P19355 and previous config saved to /var/cache/conftool/dbconfig/20220126-170635-marostegui.json [17:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:47] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Cmjohnson) @hnowlan the BIOS and network firmware have been updated on restbase1019. The current idrac is too old to update, my oldest v... [17:08:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19356 and previous config saved to /var/cache/conftool/dbconfig/20220126-170852-root.json [17:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:53] (03CR) 10Jcrespo: "I don't have any opinion on this patch, moving myself to CC- whatever best method exists, I will just use it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [17:09:55] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 453009151984 and 21870 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:11:08] maps postgres spam is somewhat expected, apologies for the noise [17:11:26] elukey: hmm...the CI has not detected a diff [17:11:54] (03PS2) 10Dzahn: delete bugzilla_static after it moved from puppet to k8s [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538) [17:14:07] in theory we have [17:14:08] image: "{{ .Values.image.repository }}/{{ .Values.image.name }}:{{ .Values.image.tag | default .Chart.AppVersion }}" [17:15:42] I agree with that theory :) [17:16:35] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 453009786088 and 22270 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:19:01] elukey: I do see a diff when running rake locally [17:19:29] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 239, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:20:14] jayme: very weird :D [17:21:11] (03PS1) 10Brennen Bearnes: Fix empty div when there's no sitenotice. [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757467 (https://phabricator.wikimedia.org/T300096) [17:21:14] elukey: well..not with your change. But with "tag: blabla" [17:21:27] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1019.eqiad.wmnet with OS buster [17:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P19357 and previous config saved to /var/cache/conftool/dbconfig/20220126-172141-marostegui.json [17:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:39] jayme: .... [17:23:01] (03PS2) 10Elukey: helmfile.d: set new cfssl-issuer version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757465 (https://phabricator.wikimedia.org/T299906) [17:23:34] elukey: I *think* the leading zero makes "default" believe it's empty [17:23:42] yeah..that does not work as well :) [17:23:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19359 and previous config saved to /var/cache/conftool/dbconfig/20220126-172358-root.json [17:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:15] (03PS1) 10MSantos: maps: re-enable OSM sync for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/757487 (https://phabricator.wikimedia.org/T299049) [17:25:20] hnowlan: so double quotes? [17:25:21] checking [17:25:43] err sorry jayme --^ [17:25:44] :) [17:27:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:28:09] elukey: will not work as well [17:28:53] ah..dammit :D [17:29:14] jouncebot now [17:29:14] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [17:29:29] I should really leave the desk... 0.2.0-1 is the version we're currently running elukey [17:29:33] (03CR) 10Hnowlan: [C: 03+2] maps: re-enable OSM sync for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/757487 (https://phabricator.wikimedia.org/T299049) (owner: 10MSantos) [17:30:17] all clear to deploy a backport for a train blocker? [17:30:17] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) [17:30:20] (03CR) 10JMeybohm: [C: 04-1] helmfile.d: set new cfssl-issuer version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757465 (https://phabricator.wikimedia.org/T299906) (owner: 10Elukey) [17:30:46] 10SRE, 10ops-codfw: Test Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [17:31:04] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) [17:31:44] (03PS1) 10JHathaway: NRodriguez: add new production ssh key [puppet] - 10https://gerrit.wikimedia.org/r/757488 (https://phabricator.wikimedia.org/T299336) [17:32:21] jayme: ack sorry I see now, 0.2.0-1-20220123 is the right one [17:32:23] thanks [17:32:24] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/757488 (https://phabricator.wikimedia.org/T299336) (owner: 10JHathaway) [17:32:31] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:45] (03PS21) 10Herron: prometheus: add blackbox generic "watchrat" http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) [17:33:40] elukey: or just 0.2.1-1 [17:34:09] jayme: but I don't see it in https://docker-registry.wikimedia.org/cfssl-issuer/tags/ [17:34:09] 0.2.0-1-20220123 is the old one as well :) [17:34:38] elukey: 0.2.1-1 is there [17:35:08] not for me, maybe I have a chached version [17:35:16] anyway, changing the version, thanks :) [17:35:43] (03PS3) 10Elukey: helmfile.d: set new cfssl-issuer version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757465 (https://phabricator.wikimedia.org/T299906) [17:36:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298559)', diff saved to https://phabricator.wikimedia.org/P19360 and previous config saved to /var/cache/conftool/dbconfig/20220126-173647-marostegui.json [17:36:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:36:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:53] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [17:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T298559)', diff saved to https://phabricator.wikimedia.org/P19361 and previous config saved to /var/cache/conftool/dbconfig/20220126-173654-marostegui.json [17:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:58] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) [17:38:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298559)', diff saved to https://phabricator.wikimedia.org/P19363 and previous config saved to /var/cache/conftool/dbconfig/20220126-173810-marostegui.json [17:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19364 and previous config saved to /var/cache/conftool/dbconfig/20220126-173901-root.json [17:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:27] (03CR) 10Herron: [C: 03+2] prometheus: add blackbox generic "watchrat" http/s static check support [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [17:39:47] (03PS1) 10Volans: [WIP] team-sre: add hardware-related checks [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) [17:40:00] (03PS11) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [17:40:34] (03CR) 10Herron: [C: 03+2] prometheus: add blackbox generic "watchrat" http/s static check support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [17:43:49] (03CR) 10JMeybohm: [C: 03+1] helmfile.d: set new cfssl-issuer version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757465 (https://phabricator.wikimedia.org/T299906) (owner: 10Elukey) [17:46:35] (03CR) 10jerkins-bot: [V: 04-1] RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [17:49:19] (03CR) 10Elukey: [C: 03+2] helmfile.d: set new cfssl-issuer version [deployment-charts] - 10https://gerrit.wikimedia.org/r/757465 (https://phabricator.wikimedia.org/T299906) (owner: 10Elukey) [17:49:59] 10SRE, 10ops-codfw, 10ops-eqiad: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10hnowlan) [17:50:31] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:23] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) [17:52:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:53] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [17:53:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P19365 and previous config saved to /var/cache/conftool/dbconfig/20220126-175315-marostegui.json [17:53:17] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1025 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19366 and previous config saved to /var/cache/conftool/dbconfig/20220126-175405-root.json [17:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:14] 10SRE, 10Scap, 10Release-Engineering-Team (Onboarding 🚀), 10Sustainability (Incident Followup), 10User-brennen: scap's logstash_checker.py is blissfully unaware of any logstash indexing latency - https://phabricator.wikimedia.org/T255197 (10thcipriani) [17:57:53] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [17:59:03] 10SRE, 10Scap, 10Release-Engineering-Team (Onboarding 🚀): Remove trusty-specific hacks from logstash_checker.py - https://phabricator.wikimedia.org/T216380 (10thcipriani) [17:59:18] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:44] (03CR) 10Volans: [C: 03+1] "LGTM if the key was verified by Jesse" [puppet] - 10https://gerrit.wikimedia.org/r/757488 (https://phabricator.wikimedia.org/T299336) (owner: 10JHathaway) [18:01:12] (03Abandoned) 10EpicPupper: puppet: Change IRC network refs from freenode to Libera [puppet] - 10https://gerrit.wikimedia.org/r/756698 (https://phabricator.wikimedia.org/T283273) (owner: 10EpicPupper) [18:02:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [18:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:44] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) ssh key confirmed via gchat [18:08:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P19368 and previous config saved to /var/cache/conftool/dbconfig/20220126-180819-marostegui.json [18:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:38] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:09:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Cmjohnson reversed the fibers and we got the links up: ` cmooney@re0.cr1-eqiad> show interfaces diagnostics optics et-1/0/2 | mat... [18:10:02] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:14:43] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1019.eqiad.wmnet with OS buster [18:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:54] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1019.eqiad.wmnet [18:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:13] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 456019324464 and 25849 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:23:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T298559)', diff saved to https://phabricator.wikimedia.org/P19369 and previous config saved to /var/cache/conftool/dbconfig/20220126-182325-marostegui.json [18:23:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:23:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:31] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [18:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T298559)', diff saved to https://phabricator.wikimedia.org/P19370 and previous config saved to /var/cache/conftool/dbconfig/20220126-182333-marostegui.json [18:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298559)', diff saved to https://phabricator.wikimedia.org/P19371 and previous config saved to /var/cache/conftool/dbconfig/20220126-182448-marostegui.json [18:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:53] (03PS1) 10Majavah: exec-manage: update example host and fix list command [puppet] - 10https://gerrit.wikimedia.org/r/757496 [18:29:08] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: allow depooling multiple nodes at once [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/757497 [18:32:17] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/757498 [18:34:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: grid: allow depooling multiple nodes at once [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/757497 (owner: 10Arturo Borrero Gonzalez) [18:34:56] (03PS1) 10Bartosz Dziewoński: Do not duplicate categories in primary action tabs space [skins/Timeless] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757470 (https://phabricator.wikimedia.org/T300100) [18:39:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P19372 and previous config saved to /var/cache/conftool/dbconfig/20220126-183953-marostegui.json [18:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:04] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: introduce check to verify default grid release [puppet] - 10https://gerrit.wikimedia.org/r/757499 (https://phabricator.wikimedia.org/T277653) [18:45:20] (03PS2) 10Arturo Borrero Gonzalez: toolforge: automated-tests: introduce check to verify default grid release [puppet] - 10https://gerrit.wikimedia.org/r/757499 (https://phabricator.wikimedia.org/T277653) [18:52:47] (03PS1) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) [18:54:17] ACKNOWLEDGEMENT - Host restbase2011 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T299928 (unhandled alert) [18:54:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P19373 and previous config saved to /var/cache/conftool/dbconfig/20220126-185457-marostegui.json [18:54:59] (03PS12) 10Jbond: RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [18:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:09] (03PS3) 10Dzahn: delete bugzilla_static after it moved from puppet to k8s [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538) [18:55:46] (03CR) 10Jbond: RepoSync: add new class to mana syncing repositories (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [18:56:10] (03CR) 10Jbond: RepoSync: add new class to mana syncing repositories (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [18:57:45] (03PS4) 10Dzahn: delete bugzilla_static after it moved from puppet to k8s [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538) [18:58:21] (03CR) 10jerkins-bot: [V: 04-1] delete bugzilla_static after it moved from puppet to k8s [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [18:59:29] (03PS5) 10Dzahn: delete bugzilla_static after it moved from puppet to k8s [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538) [19:00:04] brennen and jeena: My dear minions, it's time we take the moon! Just kidding. Time for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220126T1900). [19:00:04] RoanKattouw and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220126T1900). [19:00:04] dontpanic, eigyan, MatmaRex, and nn1l2: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:11] hi [19:00:17] hi [19:00:44] greetings all [19:00:46] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 464343395888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:01:49] (03CR) 10jerkins-bot: [V: 04-1] RepoSync: add new class to mana syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [19:03:20] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33464/miscweb2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [19:03:22] (03PS1) 10Elukey: knative-serving: add more SANs to the Istio Egress gw's Certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/757502 (https://phabricator.wikimedia.org/T298976) [19:05:34] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 465330368048 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:05:56] (03PS2) 104nn1l2: fawiki: Add unwatchedpages permission to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757436 (https://phabricator.wikimedia.org/T300126) [19:07:16] is anyone around who could deploy the backports in this window? [19:07:21] (03PS3) 104nn1l2: fawiki: Add unwatchedpages permission to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757436 (https://phabricator.wikimedia.org/T300126) [19:07:29] I can in a few minutes [19:07:37] a board meeting is running over a bit [19:07:49] (unless no one beats me, of course) [19:10:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T298559)', diff saved to https://phabricator.wikimedia.org/P19374 and previous config saved to /var/cache/conftool/dbconfig/20220126-191002-marostegui.json [19:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:07] T298559: Fix mismatching field type of querycache_info.qci_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298559 [19:10:35] I’m also still around [19:10:41] * Lucas_WMDE peeks at the window [19:12:01] Lucas_WMDE: can you deploy? [19:12:11] sure [19:12:13] thanks [19:12:13] I can at least start [19:12:29] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Do not duplicate categories in primary action tabs space [skins/Timeless] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757470 (https://phabricator.wikimedia.org/T300100) (owner: 10Bartosz Dziewoński) [19:12:53] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] [wmf-config] Undeploy gdi survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756985 (https://phabricator.wikimedia.org/T299913) (owner: 10Eigyan) [19:13:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Fix empty div when there's no sitenotice. [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757467 (https://phabricator.wikimedia.org/T300096) (owner: 10Brennen Bearnes) [19:13:28] dontpanic: are you also here? I haven’t seen you o/ yet I think [19:13:33] (03Merged) 10jenkins-bot: [wmf-config] Undeploy gdi survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756985 (https://phabricator.wikimedia.org/T299913) (owner: 10Eigyan) [19:14:19] (03CR) 10Elukey: [C: 03+2] knative-serving: add more SANs to the Istio Egress gw's Certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/757502 (https://phabricator.wikimedia.org/T298976) (owner: 10Elukey) [19:14:25] (03PS1) 10Dzahn: httpbb: move tests for static-bugzilla to new file for miscweb-k8s [puppet] - 10https://gerrit.wikimedia.org/r/757505 (https://phabricator.wikimedia.org/T300171) [19:15:07] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:756985|[wmf-config] Undeploy gdi survey on cawiki beta (T299913)]] (no-op sync, beta only) (duration: 00m 52s) [19:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:12] T299913: Undeploy the cawiki test survey - https://phabricator.wikimedia.org/T299913 [19:15:12] thank you Lucas_WMDE [19:15:15] np [19:15:15] (03Merged) 10jenkins-bot: Do not duplicate categories in primary action tabs space [skins/Timeless] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757470 (https://phabricator.wikimedia.org/T300100) (owner: 10Bartosz Dziewoński) [19:15:23] (it’ll take a few more minutes to actually be effective on Beta) [19:15:32] alright, apparently Timeless has fast CI [19:15:36] (heh) [19:15:43] urbanecm, Lucas_WMDE: i need to be afk for about 30, but i can take over if anything is left pre-train-window. [19:15:51] ok [19:16:27] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [19:16:29] (03CR) 10Dzahn: "[deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/miscweb/test_miscweb.yaml --hosts miscweb2002.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [19:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:33] MatmaRex: the Timeless patch should be on mwdebug1001, can you test it there? [19:17:01] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [19:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:11] seems to work on my end at least [19:17:13] using https://www.mediawiki.org/wiki/Extension:GraphViz?useskin=timeless [19:17:33] looking [19:17:41] yes, looks fixed [19:17:46] (03CR) 10Dzahn: "this also removed monitoring which included the only check for general expirty of TLS cert for traffic but I have made tickets about that " [puppet] - 10https://gerrit.wikimedia.org/r/755761 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [19:18:08] (03CR) 10Jbond: [C: 04-1] exim: Silently block spam email from given source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757031 (owner: 10Jcrespo) [19:18:26] ok, syncing [19:18:29] (03CR) 10Jbond: [C: 03+1] exim: add the ability to silently drop senders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [19:18:31] I'm arriving [19:18:43] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Don't wrap unknown actions with confirmation [extensions/VisualEditor] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757145 (https://phabricator.wikimedia.org/T300095) (owner: 10Thiemo Kreuz (WMDE)) [19:18:55] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [19:18:57] ah, I didn’t see your notice in the calendar, sorry [19:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:00] good that you’re here now :) [19:19:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.19/skins/Timeless/includes/TimelessTemplate.php: Backport: [[gerrit:757470|Do not duplicate categories in primary action tabs space (T300100)]] (duration: 00m 51s) [19:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:17] T300100: In Timeless categories are showing up where page/action tabs are - https://phabricator.wikimedia.org/T300100 [19:19:31] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [19:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:40] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [19:19:41] alright, let’s do dontpanic’s config change while waiting for MediaWiki core CI [19:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:48] (03PS3) 10Lucas Werkmeister (WMDE): bgwiki: Add 'wgNamespaceRobotPolicies' for Draft (Talk) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756978 (https://phabricator.wikimedia.org/T299224) (owner: 10Tks4Fish) [19:19:48] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [19:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:54] my changes have taken effect thanks team! [19:19:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [19:19:56] (03PS1) 10Dzahn: backup: remove fileset for static-bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) [19:20:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [19:20:11] fully here now [19:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:25] eigyan: great! [19:20:27] dontpanic: ok [19:20:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:47] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [19:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] bgwiki: Add 'wgNamespaceRobotPolicies' for Draft (Talk) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756978 (https://phabricator.wikimedia.org/T299224) (owner: 10Tks4Fish) [19:22:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [19:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:57] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [19:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:09] (03Merged) 10jenkins-bot: bgwiki: Add 'wgNamespaceRobotPolicies' for Draft (Talk) namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756978 (https://phabricator.wikimedia.org/T299224) (owner: 10Tks4Fish) [19:23:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [19:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [19:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [19:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:25] dontpanic: your change is on mwdebug1001, can you test it there? [19:24:35] (03CR) 10Elukey: [C: 03+2] ml-services: add draftquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/756064 (https://phabricator.wikimedia.org/T298989) (owner: 10Accraze) [19:24:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:24:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [19:25:26] Lucas_WMDE: I'm not sure it's testable, let me search again [19:25:55] (LogstashIndexingFailures) firing: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [19:25:55] * Lucas_WMDE looks up what that setting does [19:26:04] sounds like it should be testable, might require a purge [19:27:05] seems to work on https://bg.wikipedia.org/wiki/%D0%A7%D0%B5%D1%80%D0%BD%D0%BE%D0%B2%D0%B0:%D0%A2%D0%B5%D1%81%D1%82 (even without purge) [19:27:08] yep, iut's working [19:27:14] yay [19:27:35] syncing [19:27:40] cool, ty :) [19:28:25] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:756978|bgwiki: Add 'wgNamespaceRobotPolicies' for Draft (Talk) namespace (T299224)]] (duration: 00m 52s) [19:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:30] T299224: bgwiki: Add draft namespace - https://phabricator.wikimedia.org/T299224 [19:28:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:55] alright, the core change is almost done merging so let’s wait for that and do nn1l2’s config change later [19:29:13] thanks! [19:29:20] thanks a ton Lucas_WMDE :) [19:29:26] np :) [19:29:59] (03Merged) 10jenkins-bot: Fix empty div when there's no sitenotice. [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757467 (https://phabricator.wikimedia.org/T300096) (owner: 10Brennen Bearnes) [19:30:55] (LogstashIndexingFailures) resolved: (2) Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [19:30:58] MatmaRex: the core change (sitenotice) should be on mwdebug1001, please test [19:31:52] Lucas_WMDE: looks fixed, using the example from https://phabricator.wikimedia.org/T300096#7651228 [19:31:57] ok, syncing [19:33:14] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.19/includes/skins/Skin.php: Backport: [[gerrit:757467|Fix empty div when there's no sitenotice. (T300096)]] (duration: 00m 51s) [19:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:19] T300096: Erroneous [dismiss] button shown to all readers - https://phabricator.wikimedia.org/T300096 [19:33:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:34:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] fawiki: Add unwatchedpages permission to eliminators (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757436 (https://phabricator.wikimedia.org/T300126) (owner: 104nn1l2) [19:35:04] (03CR) 10Jcrespo: [C: 04-2] exim: Silently block spam email from given source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757031 (owner: 10Jcrespo) [19:35:13] (03Abandoned) 10Jcrespo: exim: Silently block spam email from given source [puppet] - 10https://gerrit.wikimedia.org/r/757031 (owner: 10Jcrespo) [19:35:28] (03Merged) 10jenkins-bot: Don't wrap unknown actions with confirmation [extensions/VisualEditor] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757145 (https://phabricator.wikimedia.org/T300095) (owner: 10Thiemo Kreuz (WMDE)) [19:36:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:50] MatmaRex: and now the VE change is on mwdebug1001, please test :) [19:37:38] Lucas_WMDE: looks good as well [19:37:43] ok [19:38:55] syncing [19:39:02] (after it took me a few seconds to tab-complete the right path ^^) [19:39:37] (03PS4) 10Lucas Werkmeister (WMDE): fawiki: Add unwatchedpages permission to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757436 (https://phabricator.wikimedia.org/T300126) (owner: 104nn1l2) [19:39:38] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.19/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWTransclusionDialog.js: Backport: [[gerrit:757145|Don't wrap unknown actions with confirmation (T300095)]] (duration: 00m 51s) [19:39:40] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) @MarkTraceur & @Ottomata please approve [19:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:43] T300095: Unable to insert a template using the template editor - https://phabricator.wikimedia.org/T300095 [19:40:34] fyi just broke puppet fixing it now [19:41:02] thanks Lucas_WMDE [19:41:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:20] (03CR) 10JHathaway: [C: 03+2] icinga: add additional users to fr-tech-ops [puppet] - 10https://gerrit.wikimedia.org/r/757043 (https://phabricator.wikimedia.org/T298649) (owner: 10JHathaway) [19:41:30] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) I checked the "status code 5xx" graph in the "RED dashboard" for appservers, looked at the last 30 days, especially if anything went up since Jan 17 when I... [19:41:36] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10Ottomata) Approved. [19:41:50] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] fawiki: Add unwatchedpages permission to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757436 (https://phabricator.wikimedia.org/T300126) (owner: 104nn1l2) [19:42:20] puppet fixed now [19:42:29] !log accraze@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [19:42:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:42:33] (03Merged) 10jenkins-bot: fawiki: Add unwatchedpages permission to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757436 (https://phabricator.wikimedia.org/T300126) (owner: 104nn1l2) [19:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:18] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, and 2 others: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jhathaway) @jgleeson this change has been merged in so all users should be able to ack alerts. I a... [19:43:36] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, and 2 others: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jhathaway) 05In progress→03Resolved [19:43:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:44] nn1l2: the fawiki change is on mwdebug1001, please test [19:43:48] * Lucas_WMDE also tests [19:44:04] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.07209 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:44:31] looks correct at https://fa.wikipedia.org/w/api.php?action=query&list=allusers&aufrom=LordProfo&auto=LordProfo&auprop=rights afaict [19:44:43] Good to go [19:44:43] no change apart from the new right, reupload-own still included [19:44:46] ok [19:45:50] reupload-own is on every user on every WMF project by default [19:45:52] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:757436|fawiki: Add unwatchedpages permission to eliminators (T300126)]] (duration: 00m 51s) [19:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:57] T300126: Add unwatchedpages permission to patrollers & eliminators on Farsi Wikipedia - https://phabricator.wikimedia.org/T300126 [19:46:28] !log UTC evening backport window done [19:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:14] brennen: train should be ok to proceed, I think [19:48:34] * Lucas_WMDE is excited for those “numRows was deprecated in” warnings to finally disappear from logspam-watch [19:48:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:58] thanks Lucas_WMDE [19:49:00] (back) [19:50:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:50:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:06] !log accraze@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [19:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:49] alright, I’m signing off for today then :) [19:50:54] see you and good luck with the train on wmf.1 [19:51:01] *group1 ^^ [19:51:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:17] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002168 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:53:37] !log labweb1001, labweb1002, cloudweb2001-dev (wikitech hosts) - apt-get remove --purge fonts*; apt-get remove --purge xfonts* | purging font packages that had been installed as dependencies (T294378) [19:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:41] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [19:54:51] (03PS4) 10Ebernhardson: rdf query service: Use constant filename for defaults [puppet] - 10https://gerrit.wikimedia.org/r/757124 (https://phabricator.wikimedia.org/T299222) [19:56:39] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) @AniketArs as an added layer of insurance can you provide that public key via a Gerrit patch, wikitech user page, or Phabricator post with Ad... [20:00:04] brennen and jeena: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220126T2000). [20:00:10] o/ [20:00:24] o/ [20:00:53] !log mw131* - purging remaining font packages [20:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:58] (03PS1) 10Jbond: C:network: add mx context to abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/757516 (https://phabricator.wikimedia.org/T270618) [20:01:00] (03PS1) 10Jbond: O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) [20:01:21] !log train 1.38.0-wmf.19 (T293960): all known blockers patched, logs for wmf.19 quiet - proceeding to group1 [20:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:25] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [20:02:01] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) a:03jhathaway [20:03:29] (03CR) 10jerkins-bot: [V: 04-1] O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [20:03:42] (03CR) 10Jbond: [C: 03+2] C:network: add mx context to abuse_nets [puppet] - 10https://gerrit.wikimedia.org/r/757516 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [20:04:05] (03PS1) 10Brennen Bearnes: group1 wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757518 [20:04:07] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757518 (owner: 10Brennen Bearnes) [20:04:24] (03PS2) 10Jbond: O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) [20:05:01] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.19 refs T293960 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757518 (owner: 10Brennen Bearnes) [20:06:18] (03CR) 10jerkins-bot: [V: 04-1] O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [20:06:42] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.19 refs T293960 [20:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:47] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [20:07:37] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.19 refs T293960 (duration: 00m 54s) [20:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:46] (03PS2) 10Clare Ming: Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) [20:09:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33465/console" [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [20:09:02] (03CR) 10Jbond: O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [20:11:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:12:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:29] (03PS5) 10AOkoth: kuberenetes: disable mwautopull timer [puppet] - 10https://gerrit.wikimedia.org/r/754960 (https://phabricator.wikimedia.org/T288345) [20:14:33] (03PS1) 10AOkoth: otrs: rename profile to vrts [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) [20:14:40] (03PS3) 10Jbond: O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) [20:15:16] 10SRE, 10SRE-Access-Requests, 10Fundraising-Backlog, 10observability, 10serviceops-radar: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) Thanks so much @jhathaway and everyone else who help along the way! :) [20:15:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33466/console" [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [20:15:46] (03CR) 10jerkins-bot: [V: 04-1] otrs: rename profile to vrts [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [20:16:39] jeena: do these DBTransactionSizeErrors ring any bells? feel like this is ticking up, and i don't see any for .18 in the last several hours [20:17:23] oh yeah there are more now...I don't recall this happening last time [20:17:29] (03CR) 10jerkins-bot: [V: 04-1] O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [20:17:32] (03CR) 10Jbond: exim: Silently block spam email from given source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757031 (owner: 10Jcrespo) [20:17:36] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) purged remaining font files on mw13* [20:17:37] gonna roll back. [20:20:44] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.38.0-wmf.19 refs T293960" [20:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:48] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [20:21:50] !log train 1.38.0-wmf.19 (T293960): rolling back due to increase in DBTransactionSizeErrors [20:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:10] (03PS1) 10Brennen Bearnes: Revert "group1 wikis to 1.38.0-wmf.19 refs T293960" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757521 [20:23:12] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "group1 wikis to 1.38.0-wmf.19 refs T293960" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757521 (owner: 10Brennen Bearnes) [20:23:55] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.38.0-wmf.19 refs T293960" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757521 (owner: 10Brennen Bearnes) [20:24:17] (03PS2) 10AOkoth: otrs: rename profile to vrts [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) [20:24:54] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10jhathaway) [20:25:33] (03CR) 10Jcrespo: exim: Silently block spam email from given source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/757031 (owner: 10Jcrespo) [20:26:09] "a many much better way" English I do not [20:26:33] it's 9pm here, I think a good time to stop working :-) [20:27:11] (03PS4) 10Jbond: O:mail::mx: Add mx specific block list [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) [20:27:51] brennen: added to blockers in phab [20:28:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:59] jeena: thanks! you're a step ahead of me. [20:30:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:30:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:17] teamwork! [20:31:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:31] hrm, next weird thing: if you look at mediawiki-errors in logstash, a couple of deprecation warnings jumped up right after deploy and have stayed there (but in wmf.18) [20:36:46] coincidence? [20:37:22] (03PS3) 10AOkoth: otrs: rename profile to vrts [puppet] - 10https://gerrit.wikimedia.org/r/757519 (https://phabricator.wikimedia.org/T293942) [20:38:50] (03CR) 10RLazarus: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/757505 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [20:38:52] I don't see any big jumps on the graph [20:40:21] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 486689386032 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:40:30] jeena: https://logstash.wikimedia.org/goto/7c098f3b2be80f27b3b2b0f7e6a2cc11 [20:40:38] (maybe that will work) [20:45:13] well, that should be fixed in .19 so we should expect those in .18 anyway I think https://phabricator.wikimedia.org/T299721 [20:46:12] yeah, makes sense i guess. and they seem to have finally dropped back off. [20:46:26] filing that under "not to worry about". [20:46:41] the spike just seemed odd. [20:46:50] yeah I'm not that concerned about it [21:00:04] brennen and jeena: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220126T2000). [21:00:04] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220126T2100). [21:01:50] (03PS6) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [21:10:09] (03PS7) 10AGueyte: Update Event Stream for IPInfo events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756635 (https://phabricator.wikimedia.org/T296415) [21:14:16] 10SRE, 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10jhathaway) 05Open→03Resolved @SCherukuwada I have added wmf-search-console-account@wmf-sc-experiments.iam.gserviceaccount.com to the following: - https://en.wikipedi... [21:32:53] (Device rebooted) firing: Device rebooted - https://alerts.wikimedia.org [21:37:53] (Device rebooted) resolved: Device rebooted - https://alerts.wikimedia.org [21:45:36] (03PS1) 10RLazarus: imagecatalog: Only run on the active deployment host [puppet] - 10https://gerrit.wikimedia.org/r/757530 (https://phabricator.wikimedia.org/T287130) [21:46:47] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33470/console" [puppet] - 10https://gerrit.wikimedia.org/r/757530 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [21:48:01] (03CR) 10RLazarus: imagecatalog: Only run on the active deployment host [puppet] - 10https://gerrit.wikimedia.org/r/757530 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [22:08:23] 10SRE, 10SRE-Access-Requests: Requesting LDAP-only access to analytics-privatedata-users for Madalina Ana - https://phabricator.wikimedia.org/T299587 (10nshahquinn-wmf) @jhathaway @Volans this should be ready to go now. [22:09:08] (03CR) 10Nray: [C: 03+1] Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) (owner: 10Clare Ming) [22:14:07] (03CR) 10Cwhite: [C: 03+2] logstash: improve filter for ORES [puppet] - 10https://gerrit.wikimedia.org/r/756959 (https://phabricator.wikimedia.org/T299999) (owner: 10Elukey) [22:18:17] (03CR) 10Ahmon Dancy: [C: 03+1] MWMultiVersion.php: Flexible wikiversions file selection (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756718 (owner: 10Ahmon Dancy) [22:21:46] (03CR) 10Ebernhardson: [C: 03+1] rdf-streaming-updater: add the reconciliation stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753788 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [22:29:43] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 46.86 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:32:01] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 97.61 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:32:47] (03PS1) 10Cwhite: logstash: correct ores grok tagging and id [puppet] - 10https://gerrit.wikimedia.org/r/757535 [22:36:06] (03CR) 10Cwhite: [C: 03+2] logstash: correct ores grok tagging and id [puppet] - 10https://gerrit.wikimedia.org/r/757535 (owner: 10Cwhite) [22:38:17] (03CR) 10JHathaway: exim: add the ability to silently drop senders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [22:39:27] (03CR) 10Jdlrobson: [C: 03+1] Update config for idwiki: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757500 (https://phabricator.wikimedia.org/T299676) (owner: 10Clare Ming) [22:51:57] (03Abandoned) 10Jdlrobson: Opt in link should be different in migration mode [skins/Vector] (wmf/1.38.0-wmf.18) - 10https://gerrit.wikimedia.org/r/756697 (https://phabricator.wikimedia.org/T299927) (owner: 10Jdlrobson) [22:52:46] (03CR) 10JHathaway: [C: 03+1] "looks good, though I find the commit message a bit confusing, wouldn't an abuse_network with context mx prevent those networks from being " [puppet] - 10https://gerrit.wikimedia.org/r/757517 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [22:58:43] (03CR) 10Ladsgroup: "This change is ready for review." [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [23:00:02] jouncebot next [23:00:02] In 0 hour(s) and 59 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220127T0000) [23:02:15] brennen: do you think we can push the train now? [23:02:22] (once the patch is merged ofc) [23:02:35] Amir1: yeah, assuming it passes, i think we can give it one more shot. [23:03:28] (03CR) 10Ladsgroup: [C: 03+2] Revert "rdbms: cleanup the use of QUERY_ flags to query() in Database" [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [23:03:58] brennen: I +2'ed so it wouldn't take forty minutes just to pass jenkins [23:04:02] only twenty minutes [23:04:08] in the mean time, ICE CREAM [23:04:08] ::nod:: [23:04:21] thanks Amir1, and happy ice cream. :) [23:04:51] Thanks ^^ [23:05:18] (03CR) 10Cwhite: "Looks like a good start to me." [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [23:08:25] (03CR) 10Ladsgroup: exim: add the ability to silently drop senders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [23:08:58] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/757447 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [23:09:02] (ci is failing) [23:09:15] (03CR) 10Ryan Kemper: "We'll deploy this now since it's taking a bit longer than intended to get the elastic hosts into service (dealing w/ puppet / elasticsearc" [puppet] - 10https://gerrit.wikimedia.org/r/752724 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [23:09:25] (03CR) 10Bking: [C: 03+2] cirrussearch: Reenable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/752724 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [23:12:18] jbond: heads up, we're merging your patch for https://gerrit.wikimedia.org/r/c/operations/puppet/+/757516/ [23:15:07] zabe: both seem random failure :( [23:15:22] the tests are passing [23:16:27] (03CR) 10jerkins-bot: [V: 04-1] Revert "rdbms: cleanup the use of QUERY_ flags to query() in Database" [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [23:18:51] (03CR) 10Ladsgroup: [C: 03+2] "...." [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [23:19:03] yep [23:19:04] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [23:26:21] CI generally seems to be suffering atm [23:26:51] yeah [23:29:06] 'No such file or directory' showing up everywhere. [23:29:21] I'm this close to force-merging the patch [23:30:39] Amir1: we're getting late enough in the day and wmf.19 has had little enough exposure to surface other potential issues that we might save ourselves some pain leaving it for the day. [23:30:56] (03CR) 10jerkins-bot: [V: 04-1] Revert "rdbms: cleanup the use of QUERY_ flags to query() in Database" [core] (wmf/1.38.0-wmf.19) - 10https://gerrit.wikimedia.org/r/757473 (https://phabricator.wikimedia.org/T300194) (owner: 10Ladsgroup) [23:31:26] i don't know what's up with CI and i'd rather not get into some situation where we're force-merging multiple changes. [23:32:05] yeah [23:32:15] I need to call it a day as well [23:32:16] `FileNotFoundError: [Errno 2] No such file or directory: './node_modules/.bin/grunt': './node_modules/.bin/grunt'` on https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/71031/consoleFull but no obvious `npm install` errors [23:32:53] Amir1: thanks for the assist. we'll pick up the train in the US morning; i'll get at it fairly early. [23:34:35] !log train 1.38.0-wmf.19 (T293960): parking the train at group0 until US morning; we have a probable fix for T300194 but CI is having issues [23:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:42] T300194: Wikimedia\Rdbms\DBTransactionSizeError: Transaction spent 3.6s in writes, exceeding the 3s limit - https://phabricator.wikimedia.org/T300194 [23:34:42] T293960: 1.38.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T293960 [23:34:44] created T300214 for the CI issues [23:34:44] T300214: 'No such file or directory' CI failures in multplie repos - https://phabricator.wikimedia.org/T300214 [23:34:49] thx zabe [23:57:00] (03PS1) 104nn1l2: commonswiki: Add leg.journals.isu.ac.ir to the wgCopyUploadsDomains allowlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757549 (https://phabricator.wikimedia.org/T300217)