[00:00:05] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T0000). Please do the needful. [00:01:18] (03PS1) 10Ladsgroup: mailman: Drop listinfo files [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) [00:04:44] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [00:06:21] (03CR) 10Ladsgroup: "PCC NOOP: https://puppet-compiler.wmflabs.org/compiler1003/894/" [puppet] - 10https://gerrit.wikimedia.org/r/716077 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [00:19:50] (03CR) 10Thcipriani: [C: 03+1] add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn) [01:05:23] (03PS2) 10Bstorm: Use common k8s labels [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/637813 (https://phabricator.wikimedia.org/T266844) (owner: 10Legoktm) [01:27:01] PROBLEM - snapshot of s7 in eqiad on alert1001 is CRITICAL: snapshot for s7 at eqiad taken more than 3 days ago: Most recent backup 2021-08-30 00:55:54 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [01:38:41] (03CR) 10Juan90264: Adding square wordmark for ptwikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [01:54:58] (03PS1) 10Krinkle: blameStartupRegistry: Call StartupModule::getScript instead of hardcoding [extensions/WikimediaMaintenance] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716092 (https://phabricator.wikimedia.org/T290213) [01:55:52] Amir1: any testing/shelling on deploy or all clear? [01:56:39] Krinkle: nothing on shell-side, I was testing something that got deployed through wmf.21 [01:57:31] ack [01:57:35] (03CR) 10Krinkle: [C: 03+2] blameStartupRegistry: Call StartupModule::getScript instead of hardcoding [extensions/WikimediaMaintenance] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716092 (https://phabricator.wikimedia.org/T290213) (owner: 10Krinkle) [02:02:23] (03Merged) 10jenkins-bot: blameStartupRegistry: Call StartupModule::getScript instead of hardcoding [extensions/WikimediaMaintenance] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716092 (https://phabricator.wikimedia.org/T290213) (owner: 10Krinkle) [02:04:45] * Krinkle tests on mwmaint2002 [02:05:34] !log krinkle@deploy1002 Synchronized php-1.37.0-wmf.21/extensions/WikimediaMaintenance/blameStartupRegistry.php: I63bf1922af593b7a144ef5f6d036f9a5e23cec09 (duration: 01m 07s) [02:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:39] RECOVERY - snapshot of s7 in eqiad on alert1001 is OK: Last snapshot for s7 at eqiad (db1171.eqiad.wmnet:3317) taken on 2021-09-02 00:42:46 (1065 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:58:15] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:16:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:21:44] (03PS4) 10Krinkle: Update configuration related to disabling Score functionality [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm) [04:22:15] (03CR) 10Krinkle: [C: 03+1] "tiny tweak as the reference to (nice) error message threw me off. feel free to undo though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715194 (owner: 10Legoktm) [04:26:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_delayed.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:05] PROBLEM - MariaDB memory on dbstore1007 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (17612) = 40.5% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:46:35] (03PS1) 10Marostegui: Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/716096 [04:48:10] (03CR) 10Marostegui: [C: 03+2] Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/716096 (owner: 10Marostegui) [04:50:32] !log Remove flaggedrevs_stats2 and flaggedrevs_stats on ruwiki - T289050 [04:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:38] T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050 [05:09:53] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 743 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [05:10:03] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:10:05] wut [05:37:15] ACKNOWLEDGEMENT - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 463 (limit: 25) Legoktm T290223 https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [05:46:04] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [06:11:00] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:32:13] (03CR) 10Elukey: update celery worker to allow for celery v5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [06:36:56] (03CR) 10Jgiannelos: [C: 04-1] maps: import script is overwritting log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716068 (owner: 10MSantos) [06:40:54] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=labsnfs file=node_directory_size_bytes.prom instance=labstore1004 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [06:42:00] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Parsoid, and 3 others: Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10fgiunchedi) p:05Triage→03Medium [06:42:26] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10fgiunchedi) a:05odimitrijevic→03JMando [06:42:38] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety: Requesting access to restricted and analytics-privatedata-users for Nathan Forrester - https://phabricator.wikimedia.org/T289259 (10fgiunchedi) a:05odimitrijevic→03NForrester [06:54:24] (03PS1) 10Marostegui: mariadb: Set pc2007 to spare [puppet] - 10https://gerrit.wikimedia.org/r/716205 (https://phabricator.wikimedia.org/T289112) [06:54:56] (03PS1) 10Marostegui: ProductionServices.php: Remove pc2007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716206 (https://phabricator.wikimedia.org/T289112) [06:55:34] (03PS1) 10Filippo Giunchedi: admin: add katelevan [puppet] - 10https://gerrit.wikimedia.org/r/716207 (https://phabricator.wikimedia.org/T289258) [06:56:55] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Remove pc2007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716206 (https://phabricator.wikimedia.org/T289112) (owner: 10Marostegui) [06:57:38] (03Merged) 10jenkins-bot: ProductionServices.php: Remove pc2007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716206 (https://phabricator.wikimedia.org/T289112) (owner: 10Marostegui) [06:57:43] (03PS1) 10Filippo Giunchedi: admin: add cmaslak [puppet] - 10https://gerrit.wikimedia.org/r/716208 (https://phabricator.wikimedia.org/T289257) [06:58:52] (03CR) 10Marostegui: [C: 03+2] mariadb: Set pc2007 to spare [puppet] - 10https://gerrit.wikimedia.org/r/716205 (https://phabricator.wikimedia.org/T289112) (owner: 10Marostegui) [06:59:02] 10SRE, 10SRE-Access-Requests, 10Analytics: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10fgiunchedi) @odimitrijevic hello, a friendly reminder this task is pending approval (cc @Ottomata too) [06:59:18] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Remove pc2007 T289112 (duration: 01m 06s) [06:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:23] T289112: decommission pc2007.codfw.wmnet - https://phabricator.wikimedia.org/T289112 [07:00:22] !log Stop mariadb on pc2007 before decommissioning T289112 [07:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:56] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:02:51] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10fgiunchedi) @KFrancis I believe your understanding is correct, as per this comment: >>! In T289783#7312746, @dr0ptp4kt wrote: > @jcrespo Yes, confirmed on the contract with This Dot (Simone... [07:04:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [07:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:32] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10fgiunchedi) [07:09:34] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10fgiunchedi) [07:15:44] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10fgiunchedi) @Ottomata @odimitrijevic @thcipriani @DMburugu hello, we're seeking approval for this access request, thank you! @mewoph is access to hadoop data something you'll... [07:20:01] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10fgiunchedi) p:05Triage→03Medium [07:29:28] (03PS4) 10Vgutierrez: haproxy: Basic TLS terminator based on HAProxy [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) [07:29:30] (03PS2) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) [07:44:20] !log Remove flaggedrevs_stats2 and flaggedrevs_stats on arwiki - T289050 [07:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:24] T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050 [07:44:43] !log Remove flaggedrevs_stats2 and flaggedrevs_stats on huwiki - T289050 [07:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:31] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [07:47:37] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [07:51:32] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1064.eqiad.wmnet [07:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:57] (03PS1) 10Volans: pylint: remove unnecessary disable comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/716211 [07:51:59] (03PS1) 10Volans: remote: add support for the installer key [software/spicerack] - 10https://gerrit.wikimedia.org/r/716212 [07:52:01] (03PS1) 10Volans: puppet: minor improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/716213 [07:55:19] (03CR) 10Marostegui: "This is fine with me, but please ping me before deploying this. Even if only will trigger a reload, I would like to disable puppet on our " [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [07:56:00] (03PS1) 10Volans: sre.experimental.reimage: add reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) [07:56:10] (03CR) 10Vgutierrez: [V: 03+1] haproxy: Use systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [07:57:42] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1064.eqiad.wmnet [07:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:51] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10fgiunchedi) >>! In T285808#7325313, @Cmjohnson wrote: > @fgiunchedi ms-be1064/65/66 are installed and are ready for you to take over, 1067 is not racked yet until we can space in... [08:07:26] 10SRE, 10Wikimedia-Mailing-lists: Outlook/Microsoft bounced all? daily-article-l deliveries for Sept. 2 - https://phabricator.wikimedia.org/T290223 (10fgiunchedi) p:05Triage→03Medium [08:07:36] 10SRE, 10Performance-Team: Switch to encrypted kafka for coal/navtiming/statsv - https://phabricator.wikimedia.org/T290131 (10fgiunchedi) p:05Triage→03Medium [08:09:37] (03PS1) 10Marostegui: mariadb: Promote db2110 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/716217 (https://phabricator.wikimedia.org/T289650) [08:09:53] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/716217 (https://phabricator.wikimedia.org/T289650) (owner: 10Marostegui) [08:10:13] (03CR) 10Mvolz: zotero: Remove HTTP service from kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [08:12:23] (03PS1) 10Marostegui: wmnet: Switchover db2090 with db2110 [dns] - 10https://gerrit.wikimedia.org/r/716218 (https://phabricator.wikimedia.org/T289650) [08:12:51] (03CR) 10Marostegui: [C: 04-2] "wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/716218 (https://phabricator.wikimedia.org/T289650) (owner: 10Marostegui) [08:14:33] (03CR) 10David Caro: update celery worker to allow for celery v5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [08:14:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2140 for upgrade', diff saved to https://phabricator.wikimedia.org/P17135 and previous config saved to /var/cache/conftool/dbconfig/20210902-081436-marostegui.json [08:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:58] !log Upgrade db2140 [08:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:19] (03CR) 10Elukey: [C: 03+1] "Super ignorant about JS code but the structure of the ops-maint-gcal.js seems sound. I also tested the code with FF and Chrome: the former" [software] - 10https://gerrit.wikimedia.org/r/715980 (owner: 10Filippo Giunchedi) [08:18:19] (03PS3) 10Jbond: admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 [08:19:02] (03CR) 10Jelto: [V: 03+1] "I migrated the logic of https://gerrit.wikimedia.org/r/plugins/gitiles/operations/gitlab-ansible/+/refs/heads/master/roles/gitlab_server/t" [puppet] - 10https://gerrit.wikimedia.org/r/712322 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [08:19:33] (03CR) 10jerkins-bot: [V: 04-1] admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond) [08:20:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17136 and previous config saved to /var/cache/conftool/dbconfig/20210902-082012-root.json [08:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:17] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [08:29:16] (03PS4) 10Jbond: admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 [08:29:52] (03PS1) 10Muehlenhoff: Remove access for fdans [puppet] - 10https://gerrit.wikimedia.org/r/716220 [08:34:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for fdans [puppet] - 10https://gerrit.wikimedia.org/r/716220 (owner: 10Muehlenhoff) [08:35:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17138 and previous config saved to /var/cache/conftool/dbconfig/20210902-083515-root.json [08:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:21] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [08:37:20] (03CR) 10Majavah: [C: 03+2] "Re-tested on toolsbeta (since I'm not sure if I did that initially) and works fine. Let's ship it with the next release." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713661 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah) [08:38:01] (03Merged) 10jenkins-bot: Replace distro with os release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/713661 (https://phabricator.wikimedia.org/T278748) (owner: 10Majavah) [08:38:20] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn) [08:41:03] (03PS2) 10Filippo Giunchedi: clinic-duty: add ops-maintenance calendar link generator [software] - 10https://gerrit.wikimedia.org/r/715980 [08:41:37] (03CR) 10Filippo Giunchedi: "Thank you for the kind works Effie and Luca!" [software] - 10https://gerrit.wikimedia.org/r/715980 (owner: 10Filippo Giunchedi) [08:42:12] (03PS1) 10Majavah: d/changelog: prepare 0.23 release [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/716221 [08:42:34] (03PS1) 10Jbond: P:envoy::builder: disable timer logging [puppet] - 10https://gerrit.wikimedia.org/r/716222 [08:43:08] (03CR) 10Jbond: systemd::timer::job: switch monitoring_enabled default to false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/636628 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [08:44:30] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond) [08:44:47] (03CR) 10jerkins-bot: [V: 04-1] P:envoy::builder: disable timer logging [puppet] - 10https://gerrit.wikimedia.org/r/716222 (owner: 10Jbond) [08:47:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/716207 (https://phabricator.wikimedia.org/T289258) (owner: 10Filippo Giunchedi) [08:48:03] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/716208 (https://phabricator.wikimedia.org/T289257) (owner: 10Filippo Giunchedi) [08:49:10] (03PS1) 10Muehlenhoff: Remove access for gilles [puppet] - 10https://gerrit.wikimedia.org/r/716223 [08:49:37] (03CR) 10Jbond: [C: 03+2] admin: utils add helper script for dealing with data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/715958 (owner: 10Jbond) [08:50:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17140 and previous config saved to /var/cache/conftool/dbconfig/20210902-085019-root.json [08:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:24] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [08:51:37] moritzm: he still has phab admin + active account fyi [08:53:21] (03PS2) 10Jbond: P:envoy::builder: disable timer logging [puppet] - 10https://gerrit.wikimedia.org/r/716222 [08:53:58] (03PS6) 10Jbond: P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) [08:54:07] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add katelevan [puppet] - 10https://gerrit.wikimedia.org/r/716207 (https://phabricator.wikimedia.org/T289258) (owner: 10Filippo Giunchedi) [08:54:09] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add cmaslak [puppet] - 10https://gerrit.wikimedia.org/r/716208 (https://phabricator.wikimedia.org/T289257) (owner: 10Filippo Giunchedi) [08:54:18] RhinosF1: ack, I've already removed the WMF-NDA membeship on Phab, will check about Administrator status [08:54:23] (03PS2) 10Filippo Giunchedi: admin: add katelevan [puppet] - 10https://gerrit.wikimedia.org/r/716207 (https://phabricator.wikimedia.org/T289258) [08:55:05] moritzm: ack, I assume his own team will deal with the cloud vps projects he's in as they all look perf team work related [08:55:07] !log Remove flaggedrevs_stats2 and flaggedrevs_stats from eowiki,idwiki,plwiki,trwiki - T289050 [08:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:11] T289050: MyISAM flaggedrevs_stats tables on several sections - https://phabricator.wikimedia.org/T289050 [08:55:30] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for gilles [puppet] - 10https://gerrit.wikimedia.org/r/716223 (owner: 10Muehlenhoff) [08:55:44] (03CR) 10Filippo Giunchedi: "LGTM, modulo the change for multiple dest hosts in I3964a58b7" [puppet] - 10https://gerrit.wikimedia.org/r/715638 (https://phabricator.wikimedia.org/T289857) (owner: 10Legoktm) [08:56:12] (03PS3) 10Filippo Giunchedi: admin: add katelevan [puppet] - 10https://gerrit.wikimedia.org/r/716207 (https://phabricator.wikimedia.org/T289258) [08:57:32] (03PS7) 10Jbond: P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) [08:57:48] (03PS2) 10Filippo Giunchedi: admin: add cmaslak [puppet] - 10https://gerrit.wikimedia.org/r/716208 (https://phabricator.wikimedia.org/T289257) [08:57:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[19-22] - https://phabricator.wikimedia.org/T290202 (10JMeybohm) The latest kubernetes node there is is kubernetes1017, so I'd say the new nodes should be `kubernetes10[18-21]`. We also need them to run **Stretch**... [08:58:47] (03PS10) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [08:59:03] (03PS3) 10Vgutierrez: haproxy: Allow configuring TLS options [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) [08:59:03] RhinosF1: ack, indeed. for the standard offboarding we only pull PII-sensitive access like production, Cloud VPS is self-managed by the respective project admins [08:59:05] (03PS1) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) [08:59:23] moritzm: ty for confirming [09:03:59] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10fgiunchedi) @KLevan access has been set up, please confirm the following: * SSH access is working * the kerber... [09:04:28] (03CR) 10Elukey: update celery worker to allow for celery v5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [09:05:14] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10fgiunchedi) @chmielkomaslak access has been set up, please confirm the following: * SSH access is workin... [09:05:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17141 and previous config saved to /var/cache/conftool/dbconfig/20210902-090523-root.json [09:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:28] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [09:07:18] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db2110 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/716217 (https://phabricator.wikimedia.org/T289650) (owner: 10Marostegui) [09:07:38] (03CR) 10David Caro: [C: 03+1] haproxy: Use systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:07:49] (03CR) 10Kormat: [C: 03+1] wmnet: Switchover db2090 with db2110 [dns] - 10https://gerrit.wikimedia.org/r/716218 (https://phabricator.wikimedia.org/T289650) (owner: 10Marostegui) [09:08:29] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Chmielko Maslak - https://phabricator.wikimedia.org/T289257 (10fgiunchedi) a:05odimitrijevic→03chmielkomaslak [09:08:41] 10SRE, 10SRE-Access-Requests, 10Trust-and-Safety, 10Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Kate Levan - https://phabricator.wikimedia.org/T289258 (10fgiunchedi) a:05odimitrijevic→03KLevan [09:10:48] (03CR) 10Elukey: [C: 03+1] "Te-tested with FF/Chrome, all good!" [software] - 10https://gerrit.wikimedia.org/r/715980 (owner: 10Filippo Giunchedi) [09:13:13] (03CR) 10Muehlenhoff: [C: 03+1] add deployment and perf-roots shell groups to parsoid hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn) [09:14:47] (03CR) 10Jbond: [C: 03+1] add deployment and perf-roots shell groups to parsoid hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn) [09:15:43] (03CR) 10JMeybohm: [C: 03+1] "I'm fine with trying that, but please keep a close eye on the nodes when you merge that as it might hit them hard as well." [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm) [09:16:27] (03PS1) 10Muehlenhoff: Add Subbu as approval contact for Parsoid-related groups [puppet] - 10https://gerrit.wikimedia.org/r/716227 [09:17:12] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17142 and previous config saved to /var/cache/conftool/dbconfig/20210902-092026-root.json [09:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:32] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [09:23:00] (03PS11) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [09:24:16] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:24:39] (03CR) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter (031 comment) [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [09:26:04] PROBLEM - Host mw2264 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. Looking at the puppetised service unit it seems we don't actually use any WMF-specific settings/options, so in a later step we" [puppet] - 10https://gerrit.wikimedia.org/r/715742 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:31:06] (03PS12) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [09:31:25] (03PS2) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) [09:32:23] (03CR) 10jerkins-bot: [V: 04-1] P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:33:07] (03PS11) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [09:34:03] (03CR) 10jerkins-bot: [V: 04-1] haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:37:57] (03PS16) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [09:38:13] (03PS18) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [09:38:39] (03PS3) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) [09:42:44] (03PS1) 10Muehlenhoff: Remove LDAP access for nnair [puppet] - 10https://gerrit.wikimedia.org/r/716230 [09:45:15] (03PS2) 10Muehlenhoff: Remove LDAP access for nnair [puppet] - 10https://gerrit.wikimedia.org/r/716230 [09:45:18] (03PS5) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [09:48:20] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [09:50:52] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for nnair [puppet] - 10https://gerrit.wikimedia.org/r/716230 (owner: 10Muehlenhoff) [09:52:15] 10SRE, 10Wikimedia-Mailing-lists: Outlook/Microsoft bounced all? daily-article-l deliveries for Sept. 2 - https://phabricator.wikimedia.org/T290223 (10fgiunchedi) I took a quick look at `/root/invalid-spam.pck` (needs to be readable by user `list`) and indeed the `X-Spam-Report` header is marked as encoded wit... [09:55:53] !log hashar@deploy1002 Started deploy [integration/docroot@973ac8a]: Support listing files on index pages - T289196 [09:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:58] T289196: doc.wm.o says "empty directory" for non-empty demos directory - https://phabricator.wikimedia.org/T289196 [09:56:00] !log hashar@deploy1002 Finished deploy [integration/docroot@973ac8a]: Support listing files on index pages - T289196 (duration: 00m 07s) [09:56:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2073 for upgrade', diff saved to https://phabricator.wikimedia.org/P17145 and previous config saved to /var/cache/conftool/dbconfig/20210902-095601-marostegui.json [09:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:08] !log Upgrade db2073 [09:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:04] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T1000) [10:00:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2073 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17147 and previous config saved to /var/cache/conftool/dbconfig/20210902-100007-root.json [10:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:12] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [10:00:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/716227 (owner: 10Muehlenhoff) [10:02:54] (03CR) 10Jbond: [C: 03+1] prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [10:04:44] (03CR) 10MVernon: Fix dnspython 2 compat (031 comment) [debs/python-eventlet] (debian/bullseye) - 10https://gerrit.wikimedia.org/r/715199 (https://phabricator.wikimedia.org/T283714) (owner: 10Filippo Giunchedi) [10:08:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:08:53] (03CR) 10Mvolz: [C: 03+2] Update Zotero to c4d40f374d2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/715232 (owner: 10Mvolz) [10:09:51] (03CR) 10Hashar: [C: 03+1] "The svg logo looks fine to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704376 (https://phabricator.wikimedia.org/T284877) (owner: 10Juan90264) [10:10:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:11:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/712322 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [10:11:56] (03Merged) 10jenkins-bot: Update Zotero to c4d40f374d2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/715232 (owner: 10Mvolz) [10:13:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/716213 (owner: 10Volans) [10:13:39] (03CR) 10Mvolz: [C: 03+2] Updated outdated helm commands in NOTES.txt [deployment-charts] - 10https://gerrit.wikimedia.org/r/691599 (owner: 10Mvolz) [10:15:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/716212 (owner: 10Volans) [10:15:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2073 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17150 and previous config saved to /var/cache/conftool/dbconfig/20210902-101511-root.json [10:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:16] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [10:16:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/716211 (owner: 10Volans) [10:17:08] (03Merged) 10jenkins-bot: Updated outdated helm commands in NOTES.txt [deployment-charts] - 10https://gerrit.wikimedia.org/r/691599 (owner: 10Mvolz) [10:19:47] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [10:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:32] (03CR) 10Volans: [C: 03+2] pylint: remove unnecessary disable comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/716211 (owner: 10Volans) [10:21:45] (03CR) 10Volans: [C: 03+2] remote: add support for the installer key [software/spicerack] - 10https://gerrit.wikimedia.org/r/716212 (owner: 10Volans) [10:21:54] (03CR) 10Volans: [C: 03+2] puppet: minor improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/716213 (owner: 10Volans) [10:22:44] (03CR) 10Kormat: [C: 03+2] prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [10:23:15] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:18] (03PS4) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) [10:23:50] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [10:23:55] (03PS1) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 [10:24:28] (03CR) 10MSantos: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/643717 (owner: 10Hnowlan) [10:24:39] (03CR) 10Kormat: "It would be good to see a PCC run to make sure this doesn't have any surprises." [puppet] - 10https://gerrit.wikimedia.org/r/715934 (owner: 10MVernon) [10:24:42] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:24:44] (03CR) 10MVernon: [C: 03+2] prometheus: couple mysqld exporter service to mariadb service [puppet] - 10https://gerrit.wikimedia.org/r/714358 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [10:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:58] (03PS2) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 [10:25:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [10:26:19] (03CR) 10jerkins-bot: [V: 04-1] Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [10:26:37] (03PS3) 10MSantos: maps: import script is overwritting log [puppet] - 10https://gerrit.wikimedia.org/r/716068 [10:26:42] (03CR) 10Jbond: "LGTM but seem comment on task, im not sure its a good idea to apply this as it just hides a bigger underling issue" [puppet] - 10https://gerrit.wikimedia.org/r/715220 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [10:26:44] (03CR) 10MSantos: maps: import script is overwritting log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716068 (owner: 10MSantos) [10:28:39] (03Merged) 10jenkins-bot: pylint: remove unnecessary disable comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/716211 (owner: 10Volans) [10:28:41] (03Merged) 10jenkins-bot: remote: add support for the installer key [software/spicerack] - 10https://gerrit.wikimedia.org/r/716212 (owner: 10Volans) [10:28:43] (03Merged) 10jenkins-bot: puppet: minor improvements [software/spicerack] - 10https://gerrit.wikimedia.org/r/716213 (owner: 10Volans) [10:30:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2073 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17152 and previous config saved to /var/cache/conftool/dbconfig/20210902-103014-root.json [10:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:20] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [10:33:49] (03PS1) 10MSantos: maps: re-enable OSM sync and tile generation in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/716239 [10:36:05] (03CR) 10Elukey: [C: 04-1] "still WIP :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [10:38:37] !log REINDEX database gis in maps1009 while it's in depooled state [10:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:03] (03CR) 10MSantos: [C: 04-1] "Hold until REINDEX is done in maps1009." [puppet] - 10https://gerrit.wikimedia.org/r/716239 (owner: 10MSantos) [10:43:58] (03PS2) 10Urbanecm: dewiki: Enable Growth features for 30% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715957 (https://phabricator.wikimedia.org/T288420) [10:45:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2073 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17155 and previous config saved to /var/cache/conftool/dbconfig/20210902-104518-root.json [10:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:24] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [10:45:54] (03PS3) 10Urbanecm: dewiki: Enable Growth features for 30% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715957 (https://phabricator.wikimedia.org/T288420) [10:46:25] apergos: this ^^ can be used for the training btw, if someone signed up. Otherwise I'll just roll it out myself :). [10:46:50] no one has signed up [10:47:10] okay, good to know. [10:47:36] there are no trainees signed up [10:47:49] (sorry for the repeat, trying to multitask) [10:47:58] np [10:48:11] there are no patches in the window either [10:48:15] so you can self-serve [10:48:48] :-) [10:49:00] (03PS5) 10Vgutierrez: haproxy: STEK support [puppet] - 10https://gerrit.wikimedia.org/r/716224 (https://phabricator.wikimedia.org/T290005) [10:51:03] (03CR) 10JMeybohm: zotero: Remove HTTP service from kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [10:52:08] unfortunately there is ameeting just before this window that usually extends til right before the irc ping [10:52:30] so I am usually trying to check patches, training request queue etc during the second half of that meeting :-/ I'm sure people in it love me for that... [10:53:09] don't forget to add your patch to the dpeloyment calendar just for the record! [10:53:53] sure! [10:56:42] 10SRE, 10Traffic, 10Patch-For-Review: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [10:57:28] (03PS3) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) [10:58:53] (03CR) 10MVernon: dbtools: make mariadb service Wants prometheus-mysqld-exporter (031 comment) [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [11:00:04] Amir1, Lucas_WMDE, and apergos: #bothumor I � Unicode. All rise for EU Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T1100). [11:00:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2073 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17158 and previous config saved to /var/cache/conftool/dbconfig/20210902-110022-root.json [11:00:23] * urbanecm deploying [11:00:25] yeah yeah see discussion above [11:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:27] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [11:00:34] thank you anyways jouncebot [11:00:35] (03CR) 10Urbanecm: [C: 03+2] dewiki: Enable Growth features for 30% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715957 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm) [11:01:29] (03Merged) 10jenkins-bot: dewiki: Enable Growth features for 30% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715957 (https://phabricator.wikimedia.org/T288420) (owner: 10Urbanecm) [11:01:47] if only core merged that fast [11:02:25] hehe [11:02:46] i'm afraid k8s building will add (well, actually added) some time to the merge time for config :/ [11:04:10] !log metawiki: Server-side page move from VRT -> Volunteer Response Team (T290083) [11:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:15] T290083: Move VRT to Volunteer Response Team on metawiki - https://phabricator.wikimedia.org/T290083 [11:04:45] (03PS2) 10MVernon: mariadb::misc::db_inventory: use mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/715934 [11:05:14] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/715934 (owner: 10MVernon) [11:05:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:05:37] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3ce5d80eb6f8ad720b5d9c0b6ad7840dd869735e: dewiki: Enable Growth features for 30% of newcomers (T288420) (duration: 01m 58s) [11:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:42] T288420: Deploy Growth features on German Wikipedia - https://phabricator.wikimedia.org/T288420 [11:06:01] mw2264 timed outs [11:07:10] and https://config-master.wikimedia.org/pybal/codfw/jobrunner says it's running? [11:07:16] *pooled [11:07:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:47] `ssh: connect to host mw2264.codfw.wmnet port 22: Connection timed out` <== this is what scap says [11:10:20] huh [11:10:28] * apergos hunts around in phab [11:10:51] also crossposted to -sre [11:11:59] nada there [11:12:01] nothing in SAL [11:12:25] 12:26:04 <+icinga-wm> PROBLEM - Host mw2264 is DOWN: PING CRITICAL - Packet loss = 100% [11:12:43] no surprise it doesn't respond to ssh then [11:13:00] (times are in my local time, so about 2 hours ago) [11:13:57] no response to ping from bast200x [11:14:13] ah I see there was an alert [11:14:41] who is on clinic duty? mmm [11:14:55] hey godog ^^ care to look or poke someone? [11:16:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [11:18:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2106 for upgrade', diff saved to https://phabricator.wikimedia.org/P17160 and previous config saved to /var/cache/conftool/dbconfig/20210902-111843-marostegui.json [11:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:13] mw2264 is even unreachable on the mgmt port, will need DC ops [11:21:39] it could be depooled for now I guess [11:21:49] (note my nice use of the passive voice there :-p) [11:22:26] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2264.codfw.wmnet [11:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:56] yeah, marked it as inactive in conftool [11:24:01] urbanecm: care to do a test scap? ^^ [11:24:17] sure [11:25:02] 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10MoritzMuehlenhoff) [11:26:00] !log urbanecm@deploy1002 Synchronized README: testing scap (duration: 01m 06s) [11:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:07] completed without issues this time [11:26:09] thanks moritzm [11:26:14] \o/ [11:26:35] that's it for the window I guess? [11:26:39] looks so [11:26:55] wandering off them, see you same time same bat channel next week! [11:27:03] ttyl! [11:27:04] *off then [11:28:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17161 and previous config saved to /var/cache/conftool/dbconfig/20210902-112812-root.json [11:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:17] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [11:35:57] (03CR) 10Jbond: [C: 04-1] "-1 is marked looks like a minor error, everything else just nits or clarification" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [11:37:36] PROBLEM - MariaDB Replica Lag: s1 on db1099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3158.75 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:40:27] Emperor: ^ is that from your tests? [11:40:43] if you are still testing, you might want to extend the downtime [11:40:58] (03CR) 10Mvolz: [C: 03+1] "cool, thanks ^-^" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [11:43:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17162 and previous config saved to /var/cache/conftool/dbconfig/20210902-114315-root.json [11:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:21] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [11:47:06] (03PS8) 10Jbond: admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) [11:48:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] wmflib: add 'aliases' to Service [puppet] - 10https://gerrit.wikimedia.org/r/714965 (owner: 10Filippo Giunchedi) [11:49:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: add aliases for a few services [puppet] - 10https://gerrit.wikimedia.org/r/714966 (owner: 10Filippo Giunchedi) [11:51:59] (03CR) 10Hnowlan: [C: 03+2] maps: import script is overwritting log [puppet] - 10https://gerrit.wikimedia.org/r/716068 (owner: 10MSantos) [11:53:19] marostegui: no, we're done testing, and I'd put everything back as it was :( [11:54:09] Emperor: Replication is stopped, if you restarted mysqld, you need to start it again [11:54:22] oh, bother. [11:54:29] You can check it by running: show slave status\G and checking for Seconds Behind the Mater [11:54:38] You'll see it is NULL [11:54:45] To start it just run: start slave; [11:55:04] And if you run show slave status\G again, you'll see how NULL has gone into seconds and it will be decreasing [11:56:07] err, how do I connect to mysql on a multi-instance server? I presume I need some runes to tell mysql which I want to talk to? [11:56:44] ah yes, you need to connect to the specific socket using: mysql -S /run/mysqld/mysqld.s1.sock [11:56:58] You should check all the instances and double check if replication is running on all [11:57:05] (03CR) 10Muehlenhoff: "Looks good, one thing to update inline." [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:57:15] You can see the instances on the MOTD or simply by ls /run/mysqld/ [11:58:04] I only touched s1 on that system (which is now replicating and catching up) [11:58:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17163 and previous config saved to /var/cache/conftool/dbconfig/20210902-115819-root.json [11:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:25] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [11:58:27] but double-checking s8 just in case [11:58:30] Emperor: Good! [11:58:40] s8 is up to date [11:58:55] Excellent, so s1 will eventually catch up [11:59:02] I'll ACK the alert [11:59:09] Sounds good - thank you [12:00:59] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on db1099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2298.30 seconds MVernon mea culpa, replication restarted https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:04:47] (03PS5) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [12:06:38] RECOVERY - MariaDB Replica Lag: s1 on db1099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:08:35] Emperor: ^ \o/ [12:13:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17164 and previous config saved to /var/cache/conftool/dbconfig/20210902-121323-root.json [12:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:28] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [12:17:42] (03CR) 10Filippo Giunchedi: [C: 03+2] wmflib: add 'aliases' to Service [puppet] - 10https://gerrit.wikimedia.org/r/714965 (owner: 10Filippo Giunchedi) [12:18:01] apergos: looks like we're set re: faulty mw host (?) [12:18:06] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add aliases for a few services [puppet] - 10https://gerrit.wikimedia.org/r/714966 (owner: 10Filippo Giunchedi) [12:18:35] godog, yeah morti tzm was around and took care of the immediate issue + made a task, all good! [12:18:42] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: extend service_names to include aliases [puppet] - 10https://gerrit.wikimedia.org/r/714968 (owner: 10Filippo Giunchedi) [12:19:21] apergos: cheers [12:20:16] (03CR) 10Muehlenhoff: "Some random comments, didn't make a full pass yet." [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:21:07] (03CR) 10Michael DiPietro: "New pcc output:" [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [12:24:10] (03CR) 10Elukey: update celery worker to allow for celery v5 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [12:25:44] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:08] (03CR) 10Elukey: "Michael, there is still one blocker that I'd like to fix before we proceed. From the pcc it seems that the systemd unit for ORES gets chan" [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [12:28:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17165 and previous config saved to /var/cache/conftool/dbconfig/20210902-122826-root.json [12:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:31] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715934 (owner: 10MVernon) [12:28:32] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [12:29:02] (03CR) 10Kormat: [C: 03+1] dbtools: make mariadb service Wants prometheus-mysqld-exporter [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [12:31:32] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] blubberoid: Remove HTTP service from kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017) (owner: 10JMeybohm) [12:32:44] (03CR) 10Elukey: "Tobias: we could roll this thing out with a simple procedure(puppet disabled on all nodes, then one by one depool run-puppet repool), what" [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [12:32:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] blubberoid: Remove HTTP service from kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017) (owner: 10JMeybohm) [12:33:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [12:33:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) (owner: 10JMeybohm) [12:34:14] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10thcipriani) Approved for `restricted` [12:38:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] zotero: Remove HTTP service from kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [12:38:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [12:39:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm) [12:39:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Aside from Jelto's comment, rest LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm) [12:41:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2119 for upgrade', diff saved to https://phabricator.wikimedia.org/P17166 and previous config saved to /var/cache/conftool/dbconfig/20210902-124102-marostegui.json [12:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:32] (03PS8) 10Jbond: P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) [12:41:44] (03CR) 10Jbond: "fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:42:07] (03PS6) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [12:42:14] !log Upgrade db2119 [12:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:20] (03PS7) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [12:44:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2119 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17167 and previous config saved to /var/cache/conftool/dbconfig/20210902-124434-root.json [12:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:40] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [12:44:51] (03PS4) 10Alexandros Kosiaris: facter networking: override the networking.ip6 fact [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond) [12:44:53] (03PS3) 10Alexandros Kosiaris: facter networking: filter out cali/tap interfaces [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [12:45:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I went ahead and amended the commit message a bit, I hope that's ok. +1 from me now." [puppet] - 10https://gerrit.wikimedia.org/r/715949 (https://phabricator.wikimedia.org/T265904) (owner: 10Jbond) [12:45:27] (03CR) 10JMeybohm: [C: 04-1] "I would assume it's possible to merge this prior to having the deploy users created but you should definitely double check." [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:45:57] (03PS8) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [12:50:10] (03CR) 10Alexandros Kosiaris: "Couple of nits (one echoing Keith), but otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715943 (owner: 10Jbond) [12:51:01] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:51:48] 10SRE, 10Acme-chief, 10Traffic: Support OCSP stapling from prefetched responses in HAProxy - https://phabricator.wikimedia.org/T290249 (10Vgutierrez) [12:52:07] 10SRE, 10Acme-chief, 10Traffic: Support OCSP stapling from prefetched responses in HAProxy - https://phabricator.wikimedia.org/T290249 (10Vgutierrez) p:05Triage→03Medium [12:53:07] (03PS3) 10JMeybohm: termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) [12:53:12] (03PS3) 10JMeybohm: citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) [12:53:17] (03PS3) 10JMeybohm: zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) [12:53:22] (03PS3) 10JMeybohm: mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) [12:53:27] (03PS3) 10JMeybohm: wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) [12:53:32] (03PS3) 10JMeybohm: cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) [12:54:19] (03PS4) 10JMeybohm: cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) [12:54:59] (03PS6) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [12:55:17] 10SRE, 10Discovery-Search: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Reedy) If this is an actual blocker to {T263142}, can we put it in the tree as such? Thanks! [12:55:37] !log disable puppet fleet wide to roll out 715728 [12:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:41] (03CR) 10JMeybohm: cxserver: Remove HTTP service from kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm) [12:55:51] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [12:56:50] 10SRE, 10Discovery-Search: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Reedy) [12:58:12] (03CR) 10Jbond: [C: 03+2] admin: drop dependencies between adminuser and admingroup [puppet] - 10https://gerrit.wikimedia.org/r/715728 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [12:59:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2119 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17168 and previous config saved to /var/cache/conftool/dbconfig/20210902-125937-root.json [12:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:43] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [12:59:50] (03PS3) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 [13:00:56] (03CR) 10jerkins-bot: [V: 04-1] Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [13:06:18] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netops: Decommission asw-b-eqiad - https://phabricator.wikimedia.org/T208788 (10Jclark-ctr) [13:07:21] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10decommission-hardware, 10netops: Decommission asw-b-eqiad - https://phabricator.wikimedia.org/T208788 (10Jclark-ctr) 05Open→03Resolved Preformed factory reset and removed from racks all asw switches in row B [13:07:35] (03PS4) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 [13:09:01] (03CR) 10jerkins-bot: [V: 04-1] Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [13:09:05] 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Dzahn) ` Record: 5 Date/Time: 08/30/2021 10:22:49 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B1. -------------------------------------------... [13:09:06] !lof reimage sretest1001 [13:11:04] jbond: you have a typo in that command ftr [13:12:41] (03CR) 10MVernon: [C: 03+2] mariadb::misc::db_inventory: use mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/715934 (owner: 10MVernon) [13:13:14] 10SRE, 10Discovery-Search: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10MoritzMuehlenhoff) Two things here: - Java 17 is included in Bullseye, but 17 is not yet a GA release. As such there won't be any updates in Debian until that changes (plus relyi... [13:14:10] urbanecm: thanks :) [13:14:13] !log reimage sretest1001 [13:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2119 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17169 and previous config saved to /var/cache/conftool/dbconfig/20210902-131441-root.json [13:14:44] !log reimage sretest1002 (not sretest1001) [13:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:46] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [13:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:53] two typoes :facepalm: [13:15:06] (03CR) 10MVernon: [C: 03+2] dbtools: make mariadb service Wants prometheus-mysqld-exporter [software] - 10https://gerrit.wikimedia.org/r/715926 (https://phabricator.wikimedia.org/T289488) (owner: 10MVernon) [13:16:19] 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Dzahn) @Papaul Hi, this host went down as described above and I pasted the relevant entries from 'racadm getsel' above. As you can see it looks like the DIMM B1 is broken. It is in rack B3 and purchase date wa... [13:16:42] 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Dzahn) p:05Triage→03Medium [13:20:46] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:21:54] (03CR) 10JMeybohm: [C: 03+1] "This LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:22:12] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [13:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:37] (03CR) 10JMeybohm: [C: 03+2] citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) (owner: 10JMeybohm) [13:23:42] (03CR) 10JMeybohm: [C: 03+2] termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [13:23:46] (03CR) 10JMeybohm: [C: 03+2] blubberoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017) (owner: 10JMeybohm) [13:24:23] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [13:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:28] (03Merged) 10jenkins-bot: blubberoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715447 (https://phabricator.wikimedia.org/T236017) (owner: 10JMeybohm) [13:26:51] (03Merged) 10jenkins-bot: termbox: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715448 (https://phabricator.wikimedia.org/T254581) (owner: 10JMeybohm) [13:26:53] (03Merged) 10jenkins-bot: citoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715449 (https://phabricator.wikimedia.org/T255868) (owner: 10JMeybohm) [13:29:16] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [13:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2119 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17171 and previous config saved to /var/cache/conftool/dbconfig/20210902-132945-root.json [13:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:49] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [13:35:49] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:37] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:43] (03CR) 10Volans: "Replies inline, I'll send the new PS in a bit" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [13:38:10] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [13:38:10] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'test' . [13:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:10] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [13:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:31] (03PS1) 10MVernon: prometheus: couple mysqld export service to mariadb (multi-instance) [puppet] - 10https://gerrit.wikimedia.org/r/716306 (https://phabricator.wikimedia.org/T289488) [13:39:34] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [13:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:20] 10SRE, 10SRE-Access-Requests, 10Analytics: Requesting access to analytics-privatedata-users group for Abban Dunne - https://phabricator.wikimedia.org/T289775 (10Ottomata) Approved [13:41:16] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [13:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:23] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10Ottomata) Approved from analytics if @DMburugu approves. [13:41:28] (03PS4) 10JMeybohm: zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) [13:41:34] (03PS4) 10JMeybohm: mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) [13:41:44] (03PS4) 10JMeybohm: wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) [13:41:50] (03PS5) 10JMeybohm: cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) [13:42:07] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [13:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:32] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [13:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2119 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17172 and previous config saved to /var/cache/conftool/dbconfig/20210902-134448-root.json [13:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:54] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [13:46:38] (03CR) 10JMeybohm: [C: 03+2] cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm) [13:46:40] (03CR) 10JMeybohm: [C: 03+2] wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm) [13:46:42] (03CR) 10JMeybohm: [C: 03+2] mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [13:46:44] (03CR) 10JMeybohm: [C: 03+2] zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [13:48:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2136 for upgrade', diff saved to https://phabricator.wikimedia.org/P17173 and previous config saved to /var/cache/conftool/dbconfig/20210902-134838-marostegui.json [13:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:44] (03Merged) 10jenkins-bot: zotero: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715450 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [13:49:46] (03Merged) 10jenkins-bot: mathoid: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715451 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm) [13:49:48] (03Merged) 10jenkins-bot: wikifeeds: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715452 (https://phabricator.wikimedia.org/T255878) (owner: 10JMeybohm) [13:49:50] (03Merged) 10jenkins-bot: cxserver: Remove HTTP service from kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/715453 (https://phabricator.wikimedia.org/T255879) (owner: 10JMeybohm) [13:50:25] (03CR) 10Effie Mouzeli: "I knew this day would come: https://people.wikimedia.org/~jiji/haproxy1.png" [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:52:02] ^ lol [13:55:59] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [13:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:58] (03PS1) 10Hashar: README.md: add deploy_artifacts.py [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716316 [13:57:00] (03PS1) 10Hashar: Gerrit v3.3.6 [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716317 (https://phabricator.wikimedia.org/T290236) [13:57:59] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [13:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:16] (03CR) 10Hashar: "Bug fix release! https://www.gerritcodereview.com/3.3.html#336" [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716317 (https://phabricator.wikimedia.org/T290236) (owner: 10Hashar) [14:00:11] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [14:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 10%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17174 and previous config saved to /var/cache/conftool/dbconfig/20210902-140357-root.json [14:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:04] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [14:05:52] (03CR) 10Dzahn: [C: 03+1] Add Subbu as approval contact for Parsoid-related groups [puppet] - 10https://gerrit.wikimedia.org/r/716227 (owner: 10Muehlenhoff) [14:08:54] (03CR) 10Jbond: "See inline, fyi you didn't push a PS2 (come comments suggested that you had allready made the changes)" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [14:11:19] (03CR) 10Subramanya Sastry: [C: 03+1] Add Subbu as approval contact for Parsoid-related groups [puppet] - 10https://gerrit.wikimedia.org/r/716227 (owner: 10Muehlenhoff) [14:13:28] !log installing ffmpeg security updates [14:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:48] (03PS3) 10Jelto: helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) [14:13:50] (03CR) 10David Caro: [C: 03+2] P:standard: move admin to its own profile [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:14:23] (03CR) 10Dzahn: "Before the full backup was daily and the partial was hourly, but the calendar syntax after looks like both are running once per day?" [puppet] - 10https://gerrit.wikimedia.org/r/712322 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [14:15:38] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10KFrancis) @jcrespo Hi all, I am confirming Simone does not need a separate NDA. Please proceed with any needed access request. Thanks! [14:17:40] (03CR) 10Jelto: "fixed in Patch Set 3" [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:18:08] (03CR) 10Subramanya Sastry: [C: 03+1] add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn) [14:18:52] (03PS5) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) [14:19:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 25%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17175 and previous config saved to /var/cache/conftool/dbconfig/20210902-141901-root.json [14:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:09] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [14:20:28] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Krinkle) [14:22:15] !log installing exiv2 security updates [14:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:30] (03CR) 10Dzahn: [C: 03+2] Add Subbu as approval contact for Parsoid-related groups [puppet] - 10https://gerrit.wikimedia.org/r/716227 (owner: 10Muehlenhoff) [14:22:37] (03PS2) 10Dzahn: Add Subbu as approval contact for Parsoid-related groups [puppet] - 10https://gerrit.wikimedia.org/r/716227 (owner: 10Muehlenhoff) [14:26:05] (03CR) 10Herron: thanos: add thanos::recording_rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [14:26:23] (03CR) 10Dzahn: "please also merge this on the puppetmaster (if it's ready for that)" [puppet] - 10https://gerrit.wikimedia.org/r/715003 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [14:28:14] (03CR) 10JMeybohm: [C: 04-1] "This is what I had in mind as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [14:28:59] (03CR) 10Brennen Bearnes: [C: 03+1] "I am in favor of the upgrade, but probably won't be around to pair on it after 16:30 UTC today, 'til Tuesday 2021-09-07." [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716317 (https://phabricator.wikimedia.org/T290236) (owner: 10Hashar) [14:30:39] (03CR) 10JMeybohm: [C: 04-1] Introduce the secrets helm chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [14:30:41] (03CR) 10Filippo Giunchedi: "See inline" [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [14:32:01] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [14:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:59] (03PS1) 10Muehlenhoff: Update repository hook for Gitlab 14 [puppet] - 10https://gerrit.wikimedia.org/r/716346 (https://phabricator.wikimedia.org/T289802) [14:33:13] (03CR) 10Herron: thanos: add thanos::recording_rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [14:33:59] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:34:00] (03CR) 10Jbond: sre.experimental.reimage: add reimage cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [14:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 50%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17176 and previous config saved to /var/cache/conftool/dbconfig/20210902-143405-root.json [14:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:10] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [14:34:12] (03PS7) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) [14:35:55] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:17] (03PS6) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) [14:36:33] (03PS8) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) [14:38:09] !log jayme@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [14:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:46] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [14:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:01] (03CR) 10Herron: thanos: add thanos::recording_rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [14:39:31] !log jayme@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [14:39:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:17] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [14:40:23] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:40:40] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [14:40:48] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) 05Open→03Resolved [14:40:51] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10JMeybohm) 05Open→03Resolved a:03JMeybohm [14:40:59] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [14:41:07] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [14:41:18] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [14:41:24] 10SRE, 10Citoid, 10Prod-Kubernetes, 10serviceops, and 2 others: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10JMeybohm) 05Open→03Resolved [14:41:30] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [14:41:41] 10SRE, 10Prod-Kubernetes, 10Traffic, 10serviceops, and 3 others: Move termbox to use TLS only - https://phabricator.wikimedia.org/T254581 (10JMeybohm) 05Open→03Resolved [14:41:52] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm) 05Open→03Resolved [14:42:33] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10JMeybohm) 05Open→03Resolved [14:43:28] (03PS2) 10Volans: sre.experimental.reimage: add reimage cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) [14:43:36] (03CR) 10Volans: "replies inline, PS sent" [cookbooks] - 10https://gerrit.wikimedia.org/r/716216 (https://phabricator.wikimedia.org/T205885) (owner: 10Volans) [14:44:03] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:18] (03PS1) 10Alexandros Kosiaris: Revert "mathoid: Pin the chart version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716361 [14:44:47] (03PS1) 10Zabe: prometheus_local_crontabs: remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/716362 (https://phabricator.wikimedia.org/T273673) [14:45:22] 10SRE, 10SRE-Access-Requests: Requesting access to production shell for Mew Ophaswongse - https://phabricator.wikimedia.org/T290200 (10mewoph) >>! In T290200#7327520, @fgiunchedi wrote: > @mewoph is access to hadoop data something you'll need? I'll use this information to determine whether to setup kerberos fo... [14:45:22] (03CR) 10Effie Mouzeli: Automatically pull latest MediaWiki image onto staging cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm) [14:47:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:58] (03CR) 10Ahmon Dancy: [C: 03+1] Gerrit v3.3.6 (031 comment) [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716317 (https://phabricator.wikimedia.org/T290236) (owner: 10Hashar) [14:48:34] (03CR) 10JMeybohm: [C: 03+1] helmfile.d admin add dedicated deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/715498 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [14:48:36] 10Puppet, 10SRE, 10Infrastructure-Foundations: Search public keys in additional places for sslcert::certificate - https://phabricator.wikimedia.org/T290261 (10fgiunchedi) [14:48:38] dcaro, mutante: there are 2 unmerged patches of yours on puppetmaster (merged on gerrit). Are you about to merge them? [14:48:47] (03CR) 10Ahmon Dancy: [C: 03+2] README.md: add deploy_artifacts.py [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716316 (owner: 10Hashar) [14:48:55] (03Merged) 10jenkins-bot: README.md: add deploy_artifacts.py [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716316 (owner: 10Hashar) [14:49:03] volans: I am waiting for dcaro and left a comment on that gerrit [14:49:04] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1033.eqiad.wmnet [14:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 75%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17177 and previous config saved to /var/cache/conftool/dbconfig/20210902-144908-root.json [14:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:13] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [14:49:18] (03PS1) 10Filippo Giunchedi: POC sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) [14:49:45] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) [14:50:03] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) Updated the network ports with vlans for both NICs [14:50:52] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Search public keys in additional places for sslcert::certificate - https://phabricator.wikimedia.org/T290261 (10fgiunchedi) >>! In T290261#7328600, @gerritbot wrote: > Change 716370 had a related patch set uploaded (by Filippo Giunchedi; au... [14:50:57] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1034.eqiad.wmnet [14:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:52] (03PS2) 10Jdlrobson: Fix Wikidata API url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716073 [14:56:43] (03CR) 10Hashar: Gerrit v3.3.6 (031 comment) [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716317 (https://phabricator.wikimedia.org/T290236) (owner: 10Hashar) [14:56:53] volans: ack, you can merge [14:57:10] I don't have any patch to merge, was checking the Icinga alert [14:57:21] then I'll merge :) [14:57:28] go ahead :) [14:57:59] (03CR) 10Brennen Bearnes: [C: 03+1] Update repository hook for Gitlab 14 [puppet] - 10https://gerrit.wikimedia.org/r/716346 (https://phabricator.wikimedia.org/T289802) (owner: 10Muehlenhoff) [14:58:02] dcaro: you can type "multiple" it's ok [14:58:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [14:58:12] just make sure the admin module change works [14:58:16] ack [14:58:29] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:59:50] dcaro: looks like noop on bast3004 :) thanks, not a trivial one [15:00:21] (03PS4) 10Dzahn: add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) [15:00:51] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for SimoneThisDot - https://phabricator.wikimedia.org/T289783 (10fgiunchedi) 05Open→03Resolved I believe this task can be resolved! Feel free to reopen if sth is amiss [15:00:55] mutante: it should be a noop everywhere yep [15:00:57] (03CR) 10Bstorm: [C: 03+2] prometheus_local_crontabs: remove absented cron [puppet] - 10https://gerrit.wikimedia.org/r/716362 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:01:15] great! [15:02:50] (03PS1) 10Alexandros Kosiaris: Add a timeout parameter [software/benchmw] - 10https://gerrit.wikimedia.org/r/716371 [15:03:10] (03CR) 10Dzahn: [C: 03+2] add deployment and perf-roots shell groups to parsoid hosts [puppet] - 10https://gerrit.wikimedia.org/r/715988 (https://phabricator.wikimedia.org/T290144) (owner: 10Dzahn) [15:04:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2136 (re)pooling @ 100%: Slowly repool after reimage T288803', diff saved to https://phabricator.wikimedia.org/P17178 and previous config saved to /var/cache/conftool/dbconfig/20210902-150412-root.json [15:04:14] (03PS1) 10Mforns: Fix again --until for monitor_refine_event_sanitized_analytics_delayed [puppet] - 10https://gerrit.wikimedia.org/r/716372 [15:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:18] T288803: Upgrade s4 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T288803 [15:05:42] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10Parsoid, and 3 others: Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Dzahn) After merging the change above I ran puppet on parse2001 and saw all the deployer shell accounts being created. On all other ho... [15:06:08] (03CR) 10Ottomata: [C: 03+2] Fix again --until for monitor_refine_event_sanitized_analytics_delayed [puppet] - 10https://gerrit.wikimedia.org/r/716372 (owner: 10Mforns) [15:07:20] (03PS3) 10Filippo Giunchedi: clinic-duty: add ops-maintenance calendar link generator [software] - 10https://gerrit.wikimedia.org/r/715980 [15:07:50] Krinkle: you can now ssh to parse* and wtp*,as member of perf-roots [15:08:12] (ran puppet on parse2001, others will follow [15:08:14] (03CR) 10Filippo Giunchedi: [C: 03+2] clinic-duty: add ops-maintenance calendar link generator [software] - 10https://gerrit.wikimedia.org/r/715980 (owner: 10Filippo Giunchedi) [15:08:23] mutante: thx [15:12:16] 10SRE, 10ops-codfw: Test Dell switches cabling - https://phabricator.wikimedia.org/T290133 (10Papaul) a:05Papaul→03ayounsi [15:13:10] 10SRE, 10ops-codfw, 10serviceops: mw2264 went down - https://phabricator.wikimedia.org/T290242 (10Papaul) a:03Papaul [15:15:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1034.eqiad.wmnet [15:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:23] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1034.eqiad.wmnet` - mc1034.eqiad.wmnet (**PASS**) - Downtimed... [15:16:21] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts mc1033.eqiad.wmnet [15:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:34] (03PS5) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 [15:17:35] (03CR) 10jerkins-bot: [V: 04-1] Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [15:17:52] (03CR) 10Elukey: Introduce the secrets helm chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [15:21:53] (03PS6) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 [15:26:09] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1019.eqiad.wmnet [15:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:13] (03CR) 10Elukey: Introduce the secrets helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [15:28:38] !log dzahn@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [15:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:23] (03PS7) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 [15:31:39] !log dzahn@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [15:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:06] (03CR) 10Elukey: Introduce the secrets helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [15:37:31] (03PS1) 10Effie Mouzeli: mwdebug: increase the number of php-fpm workers in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/716388 [15:38:08] (03CR) 10JMeybohm: [C: 04-1] Introduce the secrets helm chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [15:40:43] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1019.eqiad.wmnet [15:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:47] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1019.eqiad.wmnet` - mc1019.eqiad.wmnet (**PASS**) - Downtimed... [15:47:54] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Doing): Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10dancy) 05Open→03Resolved a:03dancy Marking this resolved since the code has been updated and being used in Beta cluster. Not deployed to production yet but tha... [15:51:29] (03CR) 10Vgutierrez: [C: 03+2] envoyproxy: Provide support for UDS upstreams (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712368 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [15:52:07] (03PS2) 10Effie Mouzeli: mwdebug: increase the number of workers and replicas in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/716388 [15:53:06] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1020.eqiad.wmnet [15:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:16] (03PS10) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [15:54:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Go for it! 96 fpm workers in total like a standard codfw mw appserver box" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716388 (owner: 10Effie Mouzeli) [15:55:12] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: increase the number of workers and replicas in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/716388 (owner: 10Effie Mouzeli) [15:57:22] (03CR) 10Elukey: Introduce the secrets helm chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [15:58:33] (03Merged) 10jenkins-bot: mwdebug: increase the number of workers and replicas in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/716388 (owner: 10Effie Mouzeli) [16:00:04] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T1600). [16:04:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mc1020.eqiad.wmnet [16:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:24] 10SRE, 10ops-eqiad, 10decommission-hardware: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1020.eqiad.wmnet` - mc1020.eqiad.wmnet (**PASS**) - Downtimed... [16:09:16] jouncebot: now [16:09:17] For the next 0 hour(s) and 50 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T1600) [16:09:27] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:24] (03PS1) 10Effie Mouzeli: mwdebug: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/716410 [16:12:57] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/716410 (owner: 10Effie Mouzeli) [16:13:35] (03CR) 10Legoktm: [C: 03+2] Automatically pull latest MediaWiki image onto staging cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715993 (https://phabricator.wikimedia.org/T284628) (owner: 10Legoktm) [16:15:54] (03Merged) 10jenkins-bot: mwdebug: fix typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/716410 (owner: 10Effie Mouzeli) [16:18:34] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:50] PROBLEM - Check systemd state on kubestage1002 is CRITICAL: CRITICAL - degraded: The following units failed: mwautopull.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:05] ^ me, known [16:26:14] (03PS21) 10Dduvall: gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) [16:26:14] ACKNOWLEDGEMENT - Check systemd state on kubestage1002 is CRITICAL: CRITICAL - degraded: The following units failed: mwautopull.service Legoktm fixing with dancy https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:43] (03CR) 10Dduvall: gitlab: Provide profile for docker based GitLab runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [16:28:36] jbond: ^ added the Stdlib::HTTPSUrl typing you requested. is it too late to add this to the puppet request window? :) [16:28:54] dduvall: give me 5 mins [16:28:58] no prob [16:29:15] (03PS1) 10Effie Mouzeli: Decommission mc1019, mc1020, mc1033, mc1034 [puppet] - 10https://gerrit.wikimedia.org/r/716413 (https://phabricator.wikimedia.org/T289657) [16:29:33] (03CR) 10jerkins-bot: [V: 04-1] Decommission mc1019, mc1020, mc1033, mc1034 [puppet] - 10https://gerrit.wikimedia.org/r/716413 (https://phabricator.wikimedia.org/T289657) (owner: 10Effie Mouzeli) [16:30:24] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) [16:31:02] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10jijiki) @wiki_willy you can remove 1033 and 1034 [16:32:40] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10wiki_willy) Awesome, thanks so much @jijiki. (fyi for @Cmjohnson and @Jclark-ctr) >>! In T289657#7328872, @jijiki wrote: > @wiki_wil... [16:34:20] (03CR) 10Jbond: [C: 03+2] gitlab: Provide profile for docker based GitLab runners [puppet] - 10https://gerrit.wikimedia.org/r/708339 (https://phabricator.wikimedia.org/T287504) (owner: 10Dduvall) [16:34:52] dduvall: merged [16:34:55] jbond: thank you! [16:35:00] (03PS9) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [16:35:30] (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [16:36:36] (03PS1) 10RLazarus: Send the "No Gerrit patches" convenience message for Puppet windows. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/716428 [16:37:08] (03CR) 10jerkins-bot: [V: 04-1] Send the "No Gerrit patches" convenience message for Puppet windows. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/716428 (owner: 10RLazarus) [16:37:47] love you too jerkins <3 [16:38:00] haha.. sorry man. [16:38:09] Damn style enforcement. [16:38:23] (03PS2) 10RLazarus: Send the "No Gerrit patches" convenience message for Puppet windows. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/716428 [16:38:57] (03CR) 10jerkins-bot: [V: 04-1] Send the "No Gerrit patches" convenience message for Puppet windows. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/716428 (owner: 10RLazarus) [16:39:53] (03CR) 10Effie Mouzeli: [C: 03+1] toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [16:39:59] (03CR) 10Effie Mouzeli: [C: 03+1] toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [16:40:07] Seems like it wants it all on one line (but will then complain about too long line? ) [16:40:24] Maybe you're just like my mother... she's never satisfied.. why do we scream at each other? [16:40:26] yeah I am baffled [16:41:29] (03PS10) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [16:41:40] checking the tox settings, maybe you're supposed to configure one rule or the other somewhere, but not both [16:41:53] (03Abandoned) 10Effie Mouzeli: Decommission mc1019, mc1020, mc1033, mc1034 [puppet] - 10https://gerrit.wikimedia.org/r/716413 (https://phabricator.wikimedia.org/T289657) (owner: 10Effie Mouzeli) [16:41:58] (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [16:42:02] * dancy counts the hundreds of hours wasted on automated style checks. [16:42:09] (03PS1) 10Urbanecm: Growth: Define wgGEMentorDashboardDiscoveryEnabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716431 (https://phabricator.wikimedia.org/T289054) [16:42:21] (03PS11) 10Vgutierrez: envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) [16:43:25] yeahhhh [16:43:47] automated style checks are great but they really ought to come with automated fixes [16:44:08] nod [16:44:15] I don't love black's style preferences but I would happily adopt them over digging into this myself [16:44:21] Totally. If you care about it so much jerkins-bot, then fix it yourself! [16:44:34] (03PS1) 10Effie Mouzeli: Decommission mc1019, mc1020, mc1033, mc1034 [puppet] - 10https://gerrit.wikimedia.org/r/716432 (https://phabricator.wikimedia.org/T289657) [16:44:40] patchsets welcome, jerkins!!! [16:44:44] ahaha [16:44:47] * dancy ends rant. [16:45:21] (03CR) 10Ssingh: [C: 03+1] envoyproxy: Support alpn_protocols configuration [puppet] - 10https://gerrit.wikimedia.org/r/713238 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:45:25] (03CR) 10jerkins-bot: [V: 04-1] Decommission mc1019, mc1020, mc1033, mc1034 [puppet] - 10https://gerrit.wikimedia.org/r/716432 (https://phabricator.wikimedia.org/T289657) (owner: 10Effie Mouzeli) [16:45:29] (03CR) 10Herron: [C: 03+2] thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [16:45:41] (03PS10) 10Vgutierrez: envoyproxy: Support TLS min/max version config [puppet] - 10https://gerrit.wikimedia.org/r/713246 (https://phabricator.wikimedia.org/T271421) [16:46:16] (03CR) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [16:46:30] (03PS11) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [16:47:11] (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [16:47:20] (03PS7) 10Herron: thanos: add thanos::recording_rule [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) [16:47:48] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The following units failed: mwautopull.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:37] (03CR) 10Herron: [C: 03+2] thanos: add thanos::recording_rule (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715779 (https://phabricator.wikimedia.org/T287142) (owner: 10Herron) [16:48:48] (03PS1) 10Alexandros Kosiaris: Bump replicas instead of workers to reach 96 [deployment-charts] - 10https://gerrit.wikimedia.org/r/716433 [16:49:36] (03PS9) 10Herron: thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) [16:52:00] (03CR) 10Herron: [C: 03+2] thanos: add recording rules for etcd error slo [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [16:52:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bump replicas instead of workers to reach 96 [deployment-charts] - 10https://gerrit.wikimedia.org/r/716433 (owner: 10Alexandros Kosiaris) [16:52:55] (03CR) 10Herron: [C: 03+2] thanos: add recording rules for etcd error slo (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714814 (https://phabricator.wikimedia.org/T289615) (owner: 10Herron) [16:53:27] (03CR) 10BryanDavis: [C: 03+2] toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [16:53:35] (03CR) 10BryanDavis: [C: 03+2] toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [16:55:16] (03Merged) 10jenkins-bot: Bump replicas instead of workers to reach 96 [deployment-charts] - 10https://gerrit.wikimedia.org/r/716433 (owner: 10Alexandros Kosiaris) [16:56:42] (03Merged) 10jenkins-bot: toolhub: Add mcrouter sidecar for memcached access [deployment-charts] - 10https://gerrit.wikimedia.org/r/715286 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [16:56:44] (03Merged) 10jenkins-bot: toolhub: Add helmfile.d [deployment-charts] - 10https://gerrit.wikimedia.org/r/714867 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [16:56:49] 10SRE, 10Security-Team, 10observability: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300 (10Reedy) [16:57:11] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:19] 10SRE, 10Wikimedia-Mailing-lists: Outlook/Microsoft bounced all? daily-article-l deliveries for Sept. 2 - https://phabricator.wikimedia.org/T290223 (10Legoktm) Thanks for taking a look :) I don't really understand why spamassasin added the X-Spam-Report header in the first place, AIUI it's only supposed to do... [16:57:30] (03PS12) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [16:58:12] (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [16:58:32] (03PS3) 10RLazarus: Send the "No Gerrit patches" convenience message for Puppet windows. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/716428 [16:59:08] 10SRE, 10Peek, 10Security-Team, 10PM: Change peek scheduled jobs to systemd timer or k8s cron - https://phabricator.wikimedia.org/T254368 (10Reedy) 05Open→03Resolved a:03chasemp [16:59:36] (03PS13) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [17:00:04] chrisalbon and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T1700). [17:01:06] (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [17:03:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:57] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: puppetdb seems to be slow on host reimage - https://phabricator.wikimedia.org/T263578 (10jbond) Just a quick update, i preformed the facts change at roughly 10:30 UTC, you can see in the graph below that we had a spike in command pro... [17:04:30] RECOVERY - Check systemd state on kubestage1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:46] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:51] (03PS14) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [17:05:34] (03CR) 10jerkins-bot: [V: 04-1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [17:06:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:25] (03PS15) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [17:09:55] (03PS2) 10Effie Mouzeli: Decommission mc1019, mc1020, mc1033, mc1034 [puppet] - 10https://gerrit.wikimedia.org/r/716432 (https://phabricator.wikimedia.org/T289657) [17:11:56] (03PS1) 10Jbond: lldp: fix confine [puppet] - 10https://gerrit.wikimedia.org/r/716440 [17:13:21] (03CR) 10Andrew Bogott: [C: 03+1] lldp: fix confine [puppet] - 10https://gerrit.wikimedia.org/r/716440 (owner: 10Jbond) [17:13:31] (03PS16) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [17:14:54] (03PS1) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) [17:16:35] (03CR) 10Milimetric: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [17:16:57] (03CR) 10Jbond: [C: 03+1] "lgtm thanks <3" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/716428 (owner: 10RLazarus) [17:17:15] (03PS17) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [17:19:01] 10SRE, 10Analytics, 10Patch-For-Review: Trash cleanup cron spams on an-test hosts - https://phabricator.wikimedia.org/T286442 (10Ottomata) Could we just make the script use /home which is everywhere? [17:21:32] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:22] (03CR) 10Jbond: "lgtm but i don't know the history adding Valentin and otto who might" [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [17:27:22] (03PS18) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [17:28:52] (03PS8) 10Elukey: Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 [17:30:53] (03CR) 10jerkins-bot: [V: 04-1] Introduce the secrets helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [17:32:53] (03PS1) 10Herron: thanos::rule ensure /etc/thanos-rule/rules directory [puppet] - 10https://gerrit.wikimedia.org/r/716449 [17:34:22] (03CR) 10BryanDavis: [C: 03+2] Send the "No Gerrit patches" convenience message for Puppet windows. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/716428 (owner: 10RLazarus) [17:35:09] bd808: thanks! [17:35:16] (03Merged) 10jenkins-bot: Send the "No Gerrit patches" convenience message for Puppet windows. [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/716428 (owner: 10RLazarus) [17:36:58] jouncebot: nowandnext [17:36:58] For the next 0 hour(s) and 23 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T1700) [17:36:58] In 0 hour(s) and 23 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T1800) [17:37:02] (03CR) 10Elukey: "now it seems that the problem is in another chart, will investigate!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [17:37:30] rzl: I guess we will see when the next puppet window rolls around if it worked the way you hoped :) [17:37:58] (03CR) 10Herron: [C: 03+2] thanos::rule ensure /etc/thanos-rule/rules directory [puppet] - 10https://gerrit.wikimedia.org/r/716449 (owner: 10Herron) [17:38:05] your pessimism that the next window won't have any patches is... probably well-founded [17:38:42] but hey, if anyone lurking has a puppet patch that needs to be merged, let me take this opportunity to advertise the https://wikitech.wikimedia.org/wiki/Puppet_request_window ! [17:39:03] in addition to all its regular benefits, you can try to get a streak going and flummox my ability to test that jouncebot change, which would be very funny [17:40:24] (03CR) 10Bstorm: Route Grid engine web requests via Kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/697096 (https://phabricator.wikimedia.org/T282975) (owner: 10Majavah) [17:40:58] bd808: o/ if you have a moment, CI for deployment-charts is returning me a weird result for the toolhub chart - https://integration.wikimedia.org/ci/job/helm-lint/5150/console [17:41:44] PROBLEM - kartotherian endpoints health on maps2010 is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [17:42:36] elukey: looking... [17:45:34] RECOVERY - kartotherian endpoints health on maps2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/kartotherian [17:50:33] it is weird, the mw.mcrouter.pools is indeed not in values.yaml, but was it changed recently? [17:50:44] elukey: I'm confused about why that test passed on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/714867 but is now failing. The failure is for missing yaml data, but that data is supposed to be system wide. I'm just not sure how it ends up in CI. [17:51:20] ah I see in .fixtures there is an example [17:51:28] the values it wants are from /etc/helmfile-defaults/mediawiki/mcrouter_pools.yaml which Puppet maintains on the deploy boxes [17:51:51] yes yes but in this case they are not in the chart, and helm fails to validate it [17:53:01] but it is in the fixtures, so in theory it should pick it up, I don't recall if CI does a validation with and without fixtures [17:53:19] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [17:53:43] maybe it is a race condition in how the linting/validation is done [17:54:26] (03PS1) 10Legoktm: shell: Fix $wgShellboxUrls by passing service name when creating BoxedCommand [core] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716103 (https://phabricator.wikimedia.org/T290193) [17:54:30] I re-ran the job via jenkins ui and it failed in the same way. I'm not re-running the one that passed on my merge to see if it still works or now fails [17:54:34] (03PS1) 10Legoktm: Use the 'score' Shellbox if configured [extensions/Score] (wmf/1.37.0-wmf.21) - 10https://gerrit.wikimedia.org/r/716104 (https://phabricator.wikimedia.org/T290193) [17:55:13] elukey: and it fails... so there is something non-obvious about that test that I did wrong [17:56:31] bd808: going afk for dinner but I'll try to check tomorrow morning with ServiceOps in case you don't find the issue later on, lemme know on IRC how it goes! [17:56:41] I could put stub data into helmfile.d/services/toolhub/.fixtures.yaml to fix it, but not sure if that's right. [17:56:44] (03PS10) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [17:58:52] (03PS19) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [17:59:16] (03PS20) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [17:59:50] (03CR) 10Ottomata: analytics/data_purge: Finish renaming geoeditors_daily to editors_daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T1800). [18:00:04] No Gerrit patches in the queue for this window AFAICS. [18:01:01] (03CR) 10Ryan Kemper: [C: 03+2] query_service: migrate query-service-gc-log-cleanup cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/716039 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [18:03:21] (03PS1) 10Nikki Nikkhoui: Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) [18:03:37] (03CR) 10Herron: [C: 03+1] profile: adapt alertmanager-webhook-logger to ECS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [18:05:46] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, and 2 others: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10RKemper) Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/716039 `Fri 2021-09-03 02:12:00 UTC 8h left n/a... [18:06:34] (03CR) 10Joal: "Works for me :) Thanks a lot Dan (not +1ing as Andrew let a comment for modif)" [puppet] - 10https://gerrit.wikimedia.org/r/716441 (https://phabricator.wikimedia.org/T290093) (owner: 10Milimetric) [18:08:34] bd808: my guess is that the gate CI that ran for the helmfile.d patch used the version of the chart before the mcrouter patch (since it hadn't been published yet). And now with the correct chart, it's actually failing [18:10:11] (03PS21) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [18:10:56] (03PS11) 10Ryan Kemper: blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [18:13:25] (03PS2) 10Ryan Kemper: airflow: Compress scheduler logs [puppet] - 10https://gerrit.wikimedia.org/r/716018 (owner: 10Ebernhardson) [18:13:56] (03PS22) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [18:14:28] (03CR) 10Ryan Kemper: [C: 03+2] blazegraph: Setup new wcqs instances [puppet] - 10https://gerrit.wikimedia.org/r/713946 (owner: 10Ebernhardson) [18:14:51] (03CR) 10BryanDavis: Introduce the secrets helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [18:16:53] (03PS23) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [18:18:45] (03Abandoned) 10Nikki Nikkhoui: Add image suggestion api to lookup table [puppet] - 10https://gerrit.wikimedia.org/r/716461 (https://phabricator.wikimedia.org/T288132) (owner: 10Nikki Nikkhoui) [18:19:11] (03CR) 10Ottomata: "OH cool!" [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [18:19:33] (03CR) 10Bstorm: "Based on the discussion on task, with proper paths, the tiles can be made to work via toolserver URLs. However, considering the correct UR" [puppet] - 10https://gerrit.wikimedia.org/r/692000 (https://phabricator.wikimedia.org/T282889) (owner: 10Majavah) [18:19:36] (03PS24) 10Michael DiPietro: update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) [18:20:42] !log ryankemper@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [18:20:42] !log ryankemper@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [18:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:09] PROBLEM - Check systemd state on wcqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:29] PROBLEM - Check systemd state on wcqs1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:52] (03CR) 10Ottomata: Configure event stream for map tile expiration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [18:24:18] (03CR) 10Ottomata: Configure event stream for map tile expiration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/715028 (https://phabricator.wikimedia.org/T289771) (owner: 10Jgiannelos) [18:24:27] (03PS1) 10BryanDavis: Revert "toolhub: Add helmfile.d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716486 [18:24:42] !log ryankemper@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [18:24:42] !log ryankemper@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [18:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:09] PROBLEM - Check systemd state on wcqs1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-blazegraph-exporter-wcqs-blazegraph.service,wcqs-blazegraph.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:11] PROBLEM - Blazegraph Port for wcqs-blazegraph on wcqs1001 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:27:25] PROBLEM - Blazegraph process -wcqs-blazegraph- on wcqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:27:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Platform Team Workboards (Green): eqiad: Server moves to free up space on 10g racks - https://phabricator.wikimedia.org/T267065 (10Jclark-ctr) 05Open→03Resolved no longer needed for 10g space in eqiad [18:27:28] 10SRE, 10ops-eqiad: eqiad: Move maps1001 same rack A4 - https://phabricator.wikimedia.org/T273983 (10Jclark-ctr) [18:27:33] 10SRE, 10ops-eqiad, 10DBA: eqiad: move db1111 to rack A8 - https://phabricator.wikimedia.org/T273982 (10Jclark-ctr) [18:27:36] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) [18:27:46] (03CR) 10BryanDavis: [C: 03+2] Revert "toolhub: Add helmfile.d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716486 (owner: 10BryanDavis) [18:28:31] !log [WCQS] Merged & deployed https://gerrit.wikimedia.org/r/c/operations/puppet/+/713946, going to suppress icinga alerts on `wcqs*` hosts because these are still in the process of being spun up properly and aren't serving traffic or anything [18:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:44] (03CR) 10BryanDavis: Introduce the secrets helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [18:31:16] !log [WCQS] `wcqs100[1-3],wcqs200[1-3]` downtimed until `2021-09-09 20:29:55` (UTC) [18:31:18] (03Merged) 10jenkins-bot: Revert "toolhub: Add helmfile.d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716486 (owner: 10BryanDavis) [18:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:59] (03CR) 10BryanDavis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716235 (owner: 10Elukey) [18:35:59] (03CR) 10Michael DiPietro: "New pcc:" [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [18:37:31] (03CR) 10Ryan Kemper: [C: 03+2] airflow: Compress scheduler logs [puppet] - 10https://gerrit.wikimedia.org/r/716018 (owner: 10Ebernhardson) [18:39:11] (03PS2) 10Ryan Kemper: blazegraph: Setup tls termination for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/713958 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:40:28] (03CR) 10Ryan Kemper: [C: 03+2] blazegraph: Setup tls termination for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/713958 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:41:35] RECOVERY - Check systemd state on wcqs1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:44] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/716470 [18:44:30] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/716471 [18:46:40] (03CR) 10Andrew Bogott: [C: 03+1] update celery worker to allow for celery v5 [puppet] - 10https://gerrit.wikimedia.org/r/715974 (https://phabricator.wikimedia.org/T288528) (owner: 10Michael DiPietro) [18:47:06] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/716472 [19:00:41] twentyafterfour and dancy: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T1900). [19:06:33] (03PS6) 10Jdlrobson: Adding wordmark for ptwikinews mobile and desktop skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [19:06:42] (03PS8) 10Jdlrobson: Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [19:07:16] (03CR) 10Jdlrobson: "I've merged this into the other patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/704171 ) as we usually update these " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [19:07:20] (03PS7) 10Jdlrobson: Adding wordmark for ptwikinews mobile and desktop skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [19:08:46] (03PS3) 10Jdlrobson: Fix Wikidata beta cluster Nearby API url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716073 [19:08:50] (03CR) 10jerkins-bot: [V: 04-1] Use the ptwikinews wordmark in new vector and mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [19:09:16] (03PS1) 10Hashar: Merge branch 'stable-3.3' into wmf/stable-3.3 [software/gerrit/plugins/gitiles] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716484 [19:13:00] mediawiki-new-errors looks mostly quiet though there are a couple of sporadic errors. 15 in 5 hours seems low enough to move forward. Preparing to deploy wmf.21 to all wikis monentarily [19:13:04] (03CR) 10Hashar: "Upstream has updated the plugin" [software/gerrit/plugins/gitiles] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716484 (owner: 10Hashar) [19:15:00] (03PS1) 10Hashar: Merge tag 'v3.3.6' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716485 (https://phabricator.wikimedia.org/T290236) [19:15:35] (03CR) 10Hashar: "The submodule bump in our deploy repo is done via https://gerrit.wikimedia.org/r/716485" [software/gerrit/plugins/gitiles] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716484 (owner: 10Hashar) [19:17:18] (03CR) 10jerkins-bot: [V: 04-1] Merge tag 'v3.3.6' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716485 (https://phabricator.wikimedia.org/T290236) (owner: 10Hashar) [19:21:05] (03PS1) 1020after4: all wikis to 1.37.0-wmf.21 refs T281162 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716518 [19:21:07] (03CR) 1020after4: [C: 03+2] all wikis to 1.37.0-wmf.21 refs T281162 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716518 (owner: 1020after4) [19:25:45] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.21 refs T281162 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716518 (owner: 1020after4) [19:27:58] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.21 refs T281162 [19:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:05] T281162: 1.37.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T281162 [19:32:19] (03PS1) 10BryanDavis: toolhub: Add helmfile.d (second attempt) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716521 (https://phabricator.wikimedia.org/T280881) [19:40:12] !log jiji@cumin1001 START - Cookbook sre.hosts.decommission for hosts mc1021.eqiad.wmnet [19:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:58] (03PS1) 10Cmjohnson: Adding mac addresses to dhcpd file cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/716530 (https://phabricator.wikimedia.org/T284471) [19:43:45] (03CR) 10BryanDavis: toolhub: Add helmfile.d (second attempt) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716521 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [19:45:57] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [19:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:12] (03CR) 10Cmjohnson: [C: 03+2] Adding mac addresses to dhcpd file cloudcephosd hosts [puppet] - 10https://gerrit.wikimedia.org/r/716530 (https://phabricator.wikimedia.org/T284471) (owner: 10Cmjohnson) [19:48:03] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10Cmjohnson) [19:48:07] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10Cmjohnson) mc1019, 1020 and mc1033 and 1034 have been removed from the rack. [19:48:56] 10SRE, 10ops-eqiad, 10DC-Ops: Q1:(Need By: ASAP) rack/setup/install ms-be10[64-67] - https://phabricator.wikimedia.org/T285808 (10Cmjohnson) @Jclark-ctr I removed the 2 servers in D4, can you please rack ms-be1067. [19:48:56] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mc1021.eqiad.wmnet [19:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:02] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: Decommission mc[1019-1023,1025-1026,1028-1036].eqiad.wmnet - https://phabricator.wikimedia.org/T289657 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jiji@cumin1001 for hosts: `mc1021.eqiad.wmnet` - mc1021.eqiad.wmnet (**... [19:49:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:49:50] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:06] (03PS1) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 [20:08:22] (03CR) 10Legoktm: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716521 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [20:09:25] (03CR) 10BryanDavis: [C: 03+2] "Let's try again!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/716521 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [20:12:29] (03PS1) 10Herron: slo_dashboard: switch etcd request slo query to recording rule metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615) [20:12:35] (03Merged) 10jenkins-bot: toolhub: Add helmfile.d (second attempt) [deployment-charts] - 10https://gerrit.wikimedia.org/r/716521 (https://phabricator.wikimedia.org/T280881) (owner: 10BryanDavis) [20:13:03] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper) [20:17:06] (03PS1) 10Herron: remove *_sli_metric_prom params [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716536 [20:18:43] (03CR) 10Herron: [V: 03+2 C: 03+2] remove *_sli_metric_prom params [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716536 (owner: 10Herron) [20:21:15] (03PS2) 10Herron: slo_dashboard: switch etcd request slo query to recording rule metrics [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/716535 (https://phabricator.wikimedia.org/T289615) [20:22:05] 10SRE, 10SRE-Access-Requests: Overwrote access key - https://phabricator.wikimedia.org/T290279 (10ChristineDeKock) [20:26:18] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:28:14] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:28:26] 10SRE, 10SRE-Access-Requests: Overwrote access key - https://phabricator.wikimedia.org/T290279 (10Urbanecm) a:05ssingh→03None Clearing assignee. [20:28:34] 10SRE, 10SRE-Access-Requests: Overwrote access key - https://phabricator.wikimedia.org/T290279 (10Urbanecm) [20:28:36] 10SRE, 10SRE-Access-Requests: re-open access to Analytic Cluster for ChristineDeKock - https://phabricator.wikimedia.org/T284987 (10Urbanecm) [20:28:40] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 6.159e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:38:20] (03PS1) 10Bstorm: cloud osmdb: update the filenames in case we re-import the shapefiles [puppet] - 10https://gerrit.wikimedia.org/r/716543 (https://phabricator.wikimedia.org/T285668) [20:58:04] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:02:34] (03PS2) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 [21:05:24] (03PS1) 10BryanDavis: readme: Update readme for many changes that went undocumented [deployment-charts] - 10https://gerrit.wikimedia.org/r/716553 [21:08:14] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper) [21:10:07] (03PS3) 10Ryan Kemper: elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 [21:17:49] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [21:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:55] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, and 2 others: Deployers unable to ssh to parse* hosts - https://phabricator.wikimedia.org/T290144 (10Arlolra) [21:20:02] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: begin pull out config [software/spicerack] - 10https://gerrit.wikimedia.org/r/716532 (owner: 10Ryan Kemper) [21:20:12] well, it almost worked. Now to figure out why it didn't quite work. [21:37:32] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [21:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:35] (03PS1) 10Zabe: query_service: remove absented query-service-gc-log-cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/716563 (https://phabricator.wikimedia.org/T273673) [21:47:57] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [21:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:12] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/716563 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:51:32] (03CR) 10Zabe: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/896/wdqs2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/716563 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:01:44] (03CR) 10Cwhite: [C: 03+2] profile: adapt alertmanager-webhook-logger to ECS [puppet] - 10https://gerrit.wikimedia.org/r/715111 (https://phabricator.wikimedia.org/T289356) (owner: 10Cwhite) [22:28:39] 10SRE, 10Wikimedia-Mailing-lists: Outlook/Microsoft bounced all? daily-article-l deliveries for Sept. 2 - https://phabricator.wikimedia.org/T290223 (10MZMcBride) Thank you all for investigating this issue. Confirming that I have not touched this script in months, honestly I'd forgotten all about it again. The... [22:32:32] RECOVERY - Check systemd state on wcqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:40:52] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:04] brennen: #bothumor I � Unicode. All rise for US Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210902T2300). [23:00:04] Jdlrobson: A patch you scheduled for US Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:37] * thcipriani waves [23:00:59] Jdlrobson: around? [23:01:09] Howd [23:01:10] y [23:01:26] cool, let's get to mergin' [23:01:35] 1 is beta cluster only so hopefully a quick one [23:02:01] (03CR) 10Thcipriani: [C: 03+2] Fix Wikidata beta cluster Nearby API url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716073 (owner: 10Jdlrobson) [23:02:46] (03Merged) 10jenkins-bot: Fix Wikidata beta cluster Nearby API url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/716073 (owner: 10Jdlrobson) [23:02:56] huh, I thought there was a way to actually preview the svg in gerrit, but now I can't figure it out [23:04:35] there's not annoyingly.. best way is to take the raw SVG markup and throw it into a webpage [23:04:48] yep, just looking on codepen :) [23:05:06] (03CR) 10Jdlrobson: "This patch can be safely abandoned now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704172 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [23:05:23] (03CR) 10Thcipriani: [C: 03+2] Adding wordmark for ptwikinews mobile and desktop skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [23:05:31] lgtm [23:06:10] (03Merged) 10jenkins-bot: Adding wordmark for ptwikinews mobile and desktop skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704171 (https://phabricator.wikimedia.org/T281591) (owner: 10Juan90264) [23:07:29] 10SRE, 10SRE-Access-Requests: Requesting access to Stat1007 for jmando - https://phabricator.wikimedia.org/T289606 (10EYener) Hi all! Thank you for working on this and granting access for @JMando! We have been working from the level of access authorized, which you've all granted, to trying to confirm that we c... [23:08:59] Jdlrobson: live on mwdebug2002, check please [23:09:41] LGTM [23:09:51] cool, going live [23:10:44] Sweeeeett [23:11:38] !log thcipriani@deploy1002 Synchronized static/images/mobile/copyright/wikinews-wordmark-pt.svg: Config: [[gerrit:704171|Adding wordmark for ptwikinews mobile and desktop skins (T281591)]] Part I (duration: 01m 14s) [23:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:44] T281591: Add wordmark on ptwikinews - https://phabricator.wikimedia.org/T281591 [23:12:54] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:704171|Adding wordmark for ptwikinews mobile and desktop skins (T281591)]] Part II (duration: 00m 57s) [23:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:59] ^ Jdlrobson all done [23:13:53] and I think your beta change is going out with: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/18981/console [23:13:59] but if not that one then the next [23:14:15] thanks thcipriani that's awesome [23:14:18] I appreciate your time [23:14:25] free of charge :) [23:14:33] it's a good deal [23:14:36] :D [23:29:52] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:48:20] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.135e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [23:54:16] (03PS1) 10Legoktm: toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) [23:55:21] (03PS2) 10Legoktm: toolhub: Fix mounting of mcrouter-config volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/716633 (https://phabricator.wikimedia.org/T290283) [23:55:45] (03PS1) 10Cwhite: logstash: temporarily reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716635 [23:58:51] (03CR) 10jerkins-bot: [V: 04-1] logstash: temporarily reroute alertmanager webhook logs [puppet] - 10https://gerrit.wikimedia.org/r/716635 (owner: 10Cwhite) [23:59:54] a bit late to the window but I'll do a maintenance script patch backport