[00:22:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:45] PROBLEM - MariaDB memory on dbstore1007 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (6856) = 43.8% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:49:42] (03CR) 10Tim Starling: [C: 03+2] Update src/defines.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701467 (owner: 10Tim Starling) [01:51:24] (03Merged) 10jenkins-bot: Update src/defines.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/701467 (owner: 10Tim Starling) [02:01:33] !log tstarling@deploy1002 Synchronized src/defines.php: for consistency only, should have no production impact (duration: 00m 57s) [02:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:17] PROBLEM - MariaDB memory on dbstore1007 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (6856) = 43.7% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:48:07] (03PS1) 10Gergő Tisza: Add a link: Show article extract instead of description in the link inspector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709256 (https://phabricator.wikimedia.org/T287636) [03:15:31] PROBLEM - MariaDB memory on clouddb1019 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (18326) = 75.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:28:19] PROBLEM - MariaDB memory on dbstore1007 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (6856) = 43.7% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:43:39] PROBLEM - MariaDB memory on dbstore1007 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (6856) = 43.7% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:58:53] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:42] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:06] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/709258 [05:18:54] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/709258 (owner: 10Marostegui) [05:21:40] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/709268 [05:22:09] RECOVERY - MariaDB memory on clouddb1019 is OK: OK Memory 1% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:23:38] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/709268 (owner: 10Marostegui) [05:25:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for gsingers [puppet] - 10https://gerrit.wikimedia.org/r/709017 (owner: 10Muehlenhoff) [05:27:29] (03PS1) 10Marostegui: Revert "mariadb: Move db1125 to m1" [puppet] - 10https://gerrit.wikimedia.org/r/709269 [05:28:23] (03PS2) 10Marostegui: Revert "mariadb: Move db1125 to m1" [puppet] - 10https://gerrit.wikimedia.org/r/709269 [05:30:19] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Move db1125 to m1" [puppet] - 10https://gerrit.wikimedia.org/r/709269 (owner: 10Marostegui) [05:31:37] (03PS1) 10Muehlenhoff: Remove LDAP acesss for heather [puppet] - 10https://gerrit.wikimedia.org/r/709259 [05:34:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP acesss for heather [puppet] - 10https://gerrit.wikimedia.org/r/709259 (owner: 10Muehlenhoff) [05:41:10] (03PS1) 10Muehlenhoff: Remove LDAP access for amy-wmde [puppet] - 10https://gerrit.wikimedia.org/r/709261 [05:48:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for amy-wmde [puppet] - 10https://gerrit.wikimedia.org/r/709261 (owner: 10Muehlenhoff) [05:52:32] (03PS1) 10Muehlenhoff: Remove LDAP access for criley [puppet] - 10https://gerrit.wikimedia.org/r/709263 [05:56:20] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for criley [puppet] - 10https://gerrit.wikimedia.org/r/709263 (owner: 10Muehlenhoff) [06:01:55] (03PS1) 10Muehlenhoff: Remove access for jbol [puppet] - 10https://gerrit.wikimedia.org/r/709264 [06:02:24] (03CR) 10jerkins-bot: [V: 04-1] Remove access for jbol [puppet] - 10https://gerrit.wikimedia.org/r/709264 (owner: 10Muehlenhoff) [06:03:00] (03PS2) 10Muehlenhoff: Remove LDAP access for jbol [puppet] - 10https://gerrit.wikimedia.org/r/709264 [06:07:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for jbol [puppet] - 10https://gerrit.wikimedia.org/r/709264 (owner: 10Muehlenhoff) [06:29:42] (03PS16) 10Elukey: WIP - Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [06:29:58] (03PS17) 10Elukey: Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [06:34:04] PROBLEM - Check systemd state on puppetdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_stockpile_queue.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:57] RECOVERY - Check systemd state on puppetdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:15] PROBLEM - MariaDB memory on dbstore1007 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (6856) = 43.7% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [06:52:49] RECOVERY - MariaDB memory on dbstore1007 is OK: OK Memory 51% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:00:25] 10SRE, 10CommRel-Specialists-Support (Jul-Sep-2021), 10Datacenter-Switchover: CommRel support for September 2021 Switchover - https://phabricator.wikimedia.org/T287546 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:00:43] 10SRE, 10Performance-Team, 10serviceops: WARNING: opcache cache-hit ratio is below 99.99% on multiple eqiad appservers and parsoid servers - https://phabricator.wikimedia.org/T287792 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:01:40] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10MoritzMuehlenhoff) p:05Triage→03Medium [07:02:58] 10SRE, 10serviceops: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 (10MoritzMuehlenhoff) p:05High→03Medium [07:12:45] !log installing aspell security updates [07:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:52] !log installing libsndfile security updates on buster [07:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:23] (03CR) 10Klausman: [C: 03+1] knative-serving: override KUBERNETES_SERVICE_HOST and update images [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [07:39:39] (03CR) 10Klausman: [C: 03+1] Drop compatibility for k8s 1.12 in CI checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/708725 (owner: 10Elukey) [07:43:54] (03PS2) 10Filippo Giunchedi: hieradata: easier navigation for service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/709011 [07:45:01] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: easier navigation for service::catalog [puppet] - 10https://gerrit.wikimedia.org/r/709011 (owner: 10Filippo Giunchedi) [07:45:58] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: wait for puppetdb to be up before enabling it [puppet] - 10https://gerrit.wikimedia.org/r/708033 (owner: 10Filippo Giunchedi) [07:47:00] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: allow alerts from grafana and thanos-query [puppet] - 10https://gerrit.wikimedia.org/r/709030 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [07:47:44] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: allow grafana and thanos to send alerts to am [puppet] - 10https://gerrit.wikimedia.org/r/709031 (https://phabricator.wikimedia.org/T287142) (owner: 10Filippo Giunchedi) [07:48:13] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add 'role' for prometheus service [puppet] - 10https://gerrit.wikimedia.org/r/709010 (owner: 10Filippo Giunchedi) [07:48:21] (03PS2) 10Filippo Giunchedi: hieradata: add 'role' for prometheus service [puppet] - 10https://gerrit.wikimedia.org/r/709010 [07:53:17] !log catch up bullseye installs with latest state of testing [07:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:35] 10SRE, 10serviceops, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jelto) [08:08:38] 10SRE, 10serviceops: Clean up old Docker images on deneb - https://phabricator.wikimedia.org/T287222 (10ema) I've done some cleaning in my home too, down to ~500M now. [08:11:02] (03CR) 10Ema: [C: 03+1] icinga: remove grafana alerts for Traffic, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/708081 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [08:23:29] (03CR) 10Volans: [C: 03+2] "Thanks, LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/708763 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [08:24:16] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/708976 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [08:24:25] (03CR) 10Muehlenhoff: [V: 03+2] ganeti: Add ganeti test cluster to locations [software/spicerack] - 10https://gerrit.wikimedia.org/r/708763 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [08:24:32] (03CR) 10Filippo Giunchedi: "I'll pause this actually, the feature doesn't work in Karma ATM anyways" [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [08:25:13] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: remove grafana alerts for Traffic, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/708081 (https://phabricator.wikimedia.org/T282806) (owner: 10Filippo Giunchedi) [08:25:18] (03PS2) 10Filippo Giunchedi: icinga: remove grafana alerts for Traffic, moved to alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/708081 (https://phabricator.wikimedia.org/T282806) [08:26:23] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/707457 (https://phabricator.wikimedia.org/T285802) (owner: 10Legoktm) [08:32:41] (03PS1) 10Muehlenhoff: Extend Cumin alias for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/709377 [08:35:10] (03CR) 10Volans: [C: 04-2] "Unfortunately is not that simple, and those records are the ones actually used, hence this patch can't be merged." [dns] - 10https://gerrit.wikimedia.org/r/708735 (owner: 10Muehlenhoff) [08:38:27] (03PS1) 10Giuseppe Lavagetto: deploy-mwdebug: run every 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/709378 (https://phabricator.wikimedia.org/T287570) [08:41:09] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30456/console" [puppet] - 10https://gerrit.wikimedia.org/r/709378 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [08:51:54] (03CR) 10Muehlenhoff: [C: 03+2] addnode cookbook: Also allow ganeti test cluster role [cookbooks] - 10https://gerrit.wikimedia.org/r/708976 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [08:52:32] (03CR) 10Elukey: [C: 03+2] knative-serving: override KUBERNETES_SERVICE_HOST and update images [deployment-charts] - 10https://gerrit.wikimedia.org/r/708545 (https://phabricator.wikimedia.org/T278192) (owner: 10Elukey) [08:52:44] (03PS4) 10Elukey: Drop compatibility for k8s 1.12 in CI checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/708725 [08:57:15] (03CR) 10Elukey: [C: 03+2] Drop compatibility for k8s 1.12 in CI checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/708725 (owner: 10Elukey) [08:57:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Drop compatibility for k8s 1.12 in CI checks [deployment-charts] - 10https://gerrit.wikimedia.org/r/708725 (owner: 10Elukey) [08:58:26] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Marostegui) p:05Triage→03Medium [08:58:37] (03PS1) 10Phuedx: vote: Enable Single Transferable Vote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709380 (https://phabricator.wikimedia.org/T283728) [09:03:32] (03PS2) 10Giuseppe Lavagetto: deploy-mwdebug: run every 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/709378 (https://phabricator.wikimedia.org/T287570) [09:13:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deploy-mwdebug: run every 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/709378 (https://phabricator.wikimedia.org/T287570) (owner: 10Giuseppe Lavagetto) [09:13:57] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30457/console" [puppet] - 10https://gerrit.wikimedia.org/r/708521 (owner: 10David Caro) [09:15:23] (03CR) 10Filippo Giunchedi: "LGTM, though I think if we're supporting this way of passing the configuration then the config_file option should go. IMHO better to have " [puppet] - 10https://gerrit.wikimedia.org/r/709053 (owner: 10David Caro) [09:21:26] (03CR) 10David Caro: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/709053 (owner: 10David Caro) [09:22:25] (03CR) 10Elukey: Add kubeflow's kfserving charts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [09:25:48] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, 10Patch-For-Review: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10Joe) The code should now be deployed when merged/built into an image within 5 minutes. I think... [09:27:16] (03PS1) 10Jelto: hiera::role::common::idp add gitlab-replica to production idp [puppet] - 10https://gerrit.wikimedia.org/r/709383 (https://phabricator.wikimedia.org/T285867) [09:31:09] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30458/console" [puppet] - 10https://gerrit.wikimedia.org/r/709383 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [09:34:10] (03PS1) 10Giuseppe Lavagetto: deploy-mwdebug: use proper syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/709384 [09:36:32] (03CR) 10Jelto: [V: 03+1] "I would like to add gitlab-replica.wikimedia.org to production idp/CAS. Could you please take a look?" [puppet] - 10https://gerrit.wikimedia.org/r/709383 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [09:36:36] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/30460/console" [puppet] - 10https://gerrit.wikimedia.org/r/709384 (owner: 10Giuseppe Lavagetto) [09:41:19] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] deploy-mwdebug: use proper syslog identifier [puppet] - 10https://gerrit.wikimedia.org/r/709384 (owner: 10Giuseppe Lavagetto) [09:45:39] (03PS1) 10Muehlenhoff: Add record for ganeti testcluster [dns] - 10https://gerrit.wikimedia.org/r/709386 (https://phabricator.wikimedia.org/T286206) [09:48:16] jouncebot: next [09:48:17] In 0 hour(s) and 41 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210802T1030) [09:53:05] (03CR) 10Muehlenhoff: "With the current config a login to gitlab-replica.w.o would be a separate session to a session at gitlab.w.o. Is that intended? Otherwise " [puppet] - 10https://gerrit.wikimedia.org/r/709383 (https://phabricator.wikimedia.org/T285867) (owner: 10Jelto) [09:56:04] (03Abandoned) 10Muehlenhoff: Remove ganeti01 SVC IPs in eqiad/codfw [dns] - 10https://gerrit.wikimedia.org/r/708735 (owner: 10Muehlenhoff) [10:00:24] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/709386 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [10:11:35] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10ssr) Please enable DPL at least at Main Page of RWN. This 1 page seem to be safe for serv... [10:11:58] (03PS1) 10Giuseppe Lavagetto: trafficserver::text: open mwdebug on k8s again [puppet] - 10https://gerrit.wikimedia.org/r/709392 (https://phabricator.wikimedia.org/T283056) [10:16:53] (03PS18) 10Elukey: Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [10:17:20] (03CR) 10Muehlenhoff: [C: 03+2] Add record for ganeti testcluster [dns] - 10https://gerrit.wikimedia.org/r/709386 (https://phabricator.wikimedia.org/T286206) (owner: 10Muehlenhoff) [10:17:27] (03CR) 10Elukey: Add kubeflow's kfserving charts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [10:19:56] (03CR) 10Elukey: Add kubeflow's kfserving charts (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [10:23:59] (03PS19) 10Elukey: Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [10:24:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=rails site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:24:54] ^ 👀 [10:26:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:28:05] job=temporarily_derailed [10:28:21] kormat: it's ruby, hardly temporarily [10:28:41] (03Abandoned) 10David Caro: profile.icinga_exporter: Add label_teams_config_file param [puppet] - 10https://gerrit.wikimedia.org/r/709052 (owner: 10David Caro) [10:28:49] (03Abandoned) 10David Caro: prometheus.icinga-exporter-am: support --labels.team.config-file [puppet] - 10https://gerrit.wikimedia.org/r/708521 (owner: 10David Caro) [10:29:03] (03PS2) 10David Caro: prometheus.icinga_exporter: Add label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709053 [10:29:05] (03PS3) 10David Caro: profile.icinga_exporter: Added label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709054 [10:29:33] (03PS6) 10Muehlenhoff: Default nginx::profile to light flavour [puppet] - 10https://gerrit.wikimedia.org/r/702669 (https://phabricator.wikimedia.org/T164456) [10:29:48] (03PS4) 10David Caro: profile.icinga_exporter: Added label_teams_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/709054 [10:29:55] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/709054 (owner: 10David Caro) [10:30:04] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210802T1030). [10:30:52] (03CR) 10David Caro: prometheus.icinga_exporter: Add label_teams_config parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709053 (owner: 10David Caro) [10:31:44] joe: you are right. the reduced availability happens quite consistent every ~25 hours for some minutes. I'm currently investigating whats happening there and add it to the gitlab monitoring task (T275170) : https://grafana.wikimedia.org/d/R_1IvBZnz/gitlab-omnibus-overview?viewPanel=15&orgId=1&refresh=1m&from=now-5d&to=now [10:31:44] T275170: Define monitoring for gitlab - https://phabricator.wikimedia.org/T275170 [10:46:35] (03PS1) 10Jcrespo: dbbackups: Purge db1139:s2, old eqiad stretch backup source [puppet] - 10https://gerrit.wikimedia.org/r/709393 (https://phabricator.wikimedia.org/T287230) [10:58:15] (03PS1) 10Jcrespo: dbbackups: Move s4 from db1145 to db1139 and reimage db1145 to busterw [puppet] - 10https://gerrit.wikimedia.org/r/709395 (https://phabricator.wikimedia.org/T280979) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210802T1100). Please do the needful. [11:00:05] mepps, eigyan, and phuedx: A patch you scheduled for European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] o/ [11:00:18] o/ [11:00:19] o/ [11:00:24] o/ [11:00:29] I can deploy today [11:00:31] ok [11:01:42] (03PS4) 10Urbanecm: wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [11:01:47] (03CR) 10Urbanecm: [C: 03+2] wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [11:02:17] mepps: I'm going to deploy your patch straightly, as there's no way to test it. [11:02:30] sounds good urbanecm [11:02:50] (03Merged) 10jenkins-bot: wmf-config: Restore logging for mediamoderation script to better understand high error rate occurring when running script [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708815 (https://phabricator.wikimedia.org/T287511) (owner: 10Eigyan) [11:03:21] (03PS2) 10Urbanecm: votewiki: Enable Single Transferable Vote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709380 (https://phabricator.wikimedia.org/T283728) (owner: 10Phuedx) [11:03:29] (03PS3) 10Urbanecm: votewiki: Enable Single Transferable Vote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709380 (https://phabricator.wikimedia.org/T283728) (owner: 10Phuedx) [11:03:35] (03CR) 10Urbanecm: [C: 03+2] votewiki: Enable Single Transferable Vote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709380 (https://phabricator.wikimedia.org/T283728) (owner: 10Phuedx) [11:04:13] urbanecm: I have an account on votewiki that I can test that change with [11:04:45] phuedx: excellent 🙂. I'll ping you when ready for testing. [11:05:06] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 26bcaafdcd57b1b7a78f9e0ad000325baaf36a72: Restore logging for mediamoderation script to better understand high error rate occurring when running script (T287511) (duration: 00m 57s) [11:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:15] T287511: Investigate MediaModeration failures - https://phabricator.wikimedia.org/T287511 [11:05:35] (03Merged) 10jenkins-bot: votewiki: Enable Single Transferable Vote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709380 (https://phabricator.wikimedia.org/T283728) (owner: 10Phuedx) [11:05:40] mepps: should be live! The beta part will be deployed automatically soon, within 30 minutes. [11:05:49] yay, thanks urbanecm! [11:05:54] phuedx: please test at mwdebug2001 and let me know [11:05:56] any time mepps [11:08:19] !log installing openjdk-11 security updates [11:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:41] urbanecm: Thanks. I see the "Single transferable vote with Droop quota" option available on Special:SecurePoll/create on mwdebug2001. LGTM [11:08:47] Great, syncing [11:10:02] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 43020b72e8f466188d738aa73f2023f3017804d0: votewiki: Enable Single Transferable Vote (T283728) (duration: 00m 57s) [11:10:08] phuedx: should be live [11:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:10] T283728: Implement STV tallying in STVTallier::finishTally [XL] - https://phabricator.wikimedia.org/T283728 [11:10:10] Thanks, urbanecm [11:10:10] anything else? [11:12:58] doesn't look so [11:13:02] !log EU B&C window completed [11:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:00] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/709403 (owner: 10L10n-bot) [11:14:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:10] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [11:16:55] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1271.eqiad.wmnet [11:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:00] (03PS1) 10Marostegui: db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/709410 (https://phabricator.wikimedia.org/T287852) [11:17:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:51] (03CR) 10Marostegui: [C: 03+2] db1183: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/709410 (https://phabricator.wikimedia.org/T287852) (owner: 10Marostegui) [11:18:44] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1288.eqiad.wmnet [11:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:59] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Majavah) [11:22:17] 10SRE, 10DBA, 10Infrastructure-Foundations, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Majavah) [11:23:40] 10SRE, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Majavah) [11:24:11] (03PS1) 10Dzahn: site/conftool: decom mw1271 and mw1288, formerly mcrouter proxies [puppet] - 10https://gerrit.wikimedia.org/r/709411 (https://phabricator.wikimedia.org/T280203) [11:24:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:04] (03PS1) 10Marostegui: db1183: Move it from m5 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/709412 (https://phabricator.wikimedia.org/T287852) [11:26:06] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/709411 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [11:26:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:19] !log restarting Jenkins on contint1001 [11:27:24] !log restarting Jenkins on contint2001 [11:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:34] (03PS1) 10Reedy: Fix call to PageUpdater::saveRevision() [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709278 (https://phabricator.wikimedia.org/T287782) [11:28:59] jouncebot: now [11:28:59] For the next 0 hour(s) and 31 minute(s): European mid-day backport windowYour patch may or may not be deployed at the sole discretion of the deployer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210802T1100) [11:29:03] !log restarting gerrit primary server on gerrit1001 [11:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:18] lol [11:29:24] got a 503 literally as that happened [11:29:42] (03CR) 10Marostegui: [C: 03+1] dbbackups: Purge db1139:s2, old eqiad stretch backup source [puppet] - 10https://gerrit.wikimedia.org/r/709393 (https://phabricator.wikimedia.org/T287230) (owner: 10Jcrespo) [11:31:26] (03CR) 10Reedy: [C: 03+2] Fix call to PageUpdater::saveRevision() [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709278 (https://phabricator.wikimedia.org/T287782) (owner: 10Reedy) [11:31:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1271.eqiad.wmnet [11:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:46] (03Merged) 10jenkins-bot: Fix call to PageUpdater::saveRevision() [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709278 (https://phabricator.wikimedia.org/T287782) (owner: 10Reedy) [11:40:27] !log reedy@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/SecurePoll/: T287782 (duration: 00m 56s) [11:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:35] T287782: Can't add or edit translations in SecurePoll - https://phabricator.wikimedia.org/T287782 [11:41:23] Reedy: are you going to deploy https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/709247 too? or do you want a less hacky solution :D [11:41:43] lol [11:42:06] I suspect it may get deployed for the elections, pending a better patch coming later [11:42:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1271.eqiad.wmnet [11:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:33] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1271.eqiad.wmnet` - m... [11:42:49] yeah, I don't like it myself either but but it's certainly better than delaying the elections [11:42:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw1288.eqiad.wmnet [11:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:37] any chance i could still get a backport deployed in this window? https://gerrit.wikimedia.org/r/708990 sorry i'm late [11:44:48] bad MatmaRex [11:44:56] thankfully, jerkins is quick at merging stuff atm [11:45:03] (03CR) 10Reedy: [C: 03+2] Styling fixes for mobile visual editor (and editor loading overlay) [extensions/MobileFrontend] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708990 (https://phabricator.wikimedia.org/T287528) (owner: 10Jdlrobson) [11:45:05] (03CR) 10Kormat: [C: 03+1] db1183: Move it from m5 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/709412 (https://phabricator.wikimedia.org/T287852) (owner: 10Marostegui) [11:45:11] thanks Reedy [11:45:15] (03CR) 10Marostegui: [C: 03+2] db1183: Move it from m5 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/709412 (https://phabricator.wikimedia.org/T287852) (owner: 10Marostegui) [11:45:20] i'll add it to the calendar to be proper [11:45:29] I'm not so fussed about that :P [11:46:46] Reedy: I was just about to +2 that change. Thanks! [11:46:48] (03PS1) 10Reedy: Skip wikis without SecurePoll in FormStore::getWikiList() [extensions/SecurePoll] (wmf/1.36.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709450 (https://phabricator.wikimedia.org/T287780) [11:46:57] (03CR) 10Reedy: [C: 03+2] Skip wikis without SecurePoll in FormStore::getWikiList() [extensions/SecurePoll] (wmf/1.36.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709450 (https://phabricator.wikimedia.org/T287780) (owner: 10Reedy) [11:47:06] (03CR) 10Kormat: [C: 03+2] mariadb: Drop absented cron in check_private_data [puppet] - 10https://gerrit.wikimedia.org/r/705901 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:49:44] (03CR) 10jerkins-bot: [V: 04-1] Skip wikis without SecurePoll in FormStore::getWikiList() [extensions/SecurePoll] (wmf/1.36.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709450 (https://phabricator.wikimedia.org/T287780) (owner: 10Reedy) [11:50:24] um, wat [11:50:57] composer fails to install mediawiki/mediawiki-codesniffer 37.0.0 somehow [11:51:25] and there's a phan failure too in the actual phan job [11:51:49] why did those not fail on master? [11:51:49] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:51:59] why wmf.17? [11:52:03] 1.36 [11:52:03] lmao [11:52:08] (03Abandoned) 10Reedy: Skip wikis without SecurePoll in FormStore::getWikiList() [extensions/SecurePoll] (wmf/1.36.0-wmf.17) - 10https://gerrit.wikimedia.org/r/709450 (https://phabricator.wikimedia.org/T287780) (owner: 10Reedy) [11:52:20] (03PS1) 10Reedy: Skip wikis without SecurePoll in FormStore::getWikiList() [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709451 (https://phabricator.wikimedia.org/T287780) [11:52:29] (03CR) 10Reedy: [C: 03+2] Skip wikis without SecurePoll in FormStore::getWikiList() [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709451 (https://phabricator.wikimedia.org/T287780) (owner: 10Reedy) [11:52:32] ah [11:52:38] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:52:40] bloody autocomplete [11:53:20] ^ me [11:53:48] you're doing my autocomplete? ;) [11:53:53] haha [11:54:02] ACKNOWLEDGEMENT - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [11:54:02] ACKNOWLEDGEMENT - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [11:54:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1288.eqiad.wmnet [11:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:36] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1288.eqiad.wmnet` - m... [11:55:47] (03PS2) 10Dzahn: site/conftool: decom mw1271 and mw1288, formerly mcrouter proxies [puppet] - 10https://gerrit.wikimedia.org/r/709411 (https://phabricator.wikimedia.org/T280203) [11:57:01] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [11:57:51] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:57:53] * mutante checks if it's a good time to add a semi-risky gerrit change :p [11:57:58] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [11:58:41] (03CR) 10Dzahn: [C: 03+2] site/conftool: decom mw1271 and mw1288, formerly mcrouter proxies [puppet] - 10https://gerrit.wikimedia.org/r/709411 (https://phabricator.wikimedia.org/T280203) (owner: 10Dzahn) [11:59:46] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [11:59:55] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) a:03Dzahn [12:01:09] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:01:15] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:05:25] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:05:30] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) 05Open→03Stalled everything is done except 4 canary API servers and decom'ing these is blocked by T273915 [12:05:54] (03Merged) 10jenkins-bot: Styling fixes for mobile visual editor (and editor loading overlay) [extensions/MobileFrontend] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/708990 (https://phabricator.wikimedia.org/T287528) (owner: 10Jdlrobson) [12:05:57] (03Merged) 10jenkins-bot: Skip wikis without SecurePoll in FormStore::getWikiList() [extensions/SecurePoll] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709451 (https://phabricator.wikimedia.org/T287780) (owner: 10Reedy) [12:06:21] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:06:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) all decoms in T280203 are done except 4 canary API server and those are now blocked by this ticket [12:07:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Dzahn) p:05Medium→03High [12:07:28] jouncebot: now [12:07:28] No deployments scheduled for the next 4 hour(s) and 52 minute(s) [12:08:15] Reedy: ^ the backports merged [12:08:23] Yeah, I know [12:08:27] scap isn't instant :P [12:08:32] ah [12:08:44] !log reedy@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/SecurePoll/: T287780 (duration: 00m 57s) [12:08:45] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:51] T287780: SecurePoll CreatePage can no longer correctly select "remote" wiki databases that aren't in the same cluster - https://phabricator.wikimedia.org/T287780 [12:10:31] !log reedy@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/MobileFrontend/: T287528 (duration: 00m 57s) [12:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:38] T287528: [regression-wmf.16] mobile - VE toolbar displayed incorrectly - https://phabricator.wikimedia.org/T287528 [12:11:13] (03PS1) 10MMandere: rake_modules: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/709429 (https://phabricator.wikimedia.org/T282787) [12:13:35] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:13:41] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:13:43] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:15:47] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/709429 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:16:52] Reedy: should my backport be live now, or not yet? i'm still seeing the old code [12:17:02] it's fully live [12:17:07] RL caches etc? [12:17:54] dunno. i'll look into it [12:18:01] just wanted to ask first if it's deployed [12:18:05] (03CR) 10Dzahn: [C: 03+1] rake_modules: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/709429 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:18:07] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:18:13] It is [12:18:31] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:20:07] !log gerrit servers: disabling puppet [12:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:45] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:21:46] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:23:21] i guess it was cached, looks good now [12:23:30] (03CR) 10Dzahn: [C: 03+2] gerrit: listen on all address with iptables rule [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [12:23:34] thanks for deploying [12:23:42] (03PS3) 10Dzahn: gerrit: listen on all address with iptables rule [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [12:25:44] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:25:52] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:25:52] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:26:06] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:26:06] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:26:12] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:27:32] hashar: do you know why cloning from gerrit-replica does not seem to work anymore? [12:27:39] this is before merging the change [12:27:51] but I want to be able to confirm cloning works before and after [12:28:05] pretty sure a bunch of tools were pulling from replica [12:28:21] hi [12:28:31] maybe I broke it earlier today when restarting [12:28:52] Permission denied (publickey). [12:29:02] when I remove "replica" from the command it works normal [12:29:39] not sure it works over ssh [12:30:24] https://gerrit-replica.wikimedia.org/r/ [12:30:24] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:30:25] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:30:38] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:30:40] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:30:45] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:30:55] git fetch https://gerrit-replica.wikimedia.org/r/mediawiki/core.git [12:30:57] that one works [12:31:05] mutante: seems like there is no ssh access available [12:31:22] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/709429 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:31:52] uhm.. but it listens. unfortunate that we can't test first on the replica then [12:32:06] hmm no [12:32:12] that works fine over ssh [12:32:29] maybe a local ssh config difference? There is an auth failure listed for you on gerrit2001 [12:32:41] oh.. hmm.. sec [12:32:49] dzahn vs mutante maybe? [12:33:20] or your ssh config has some: `Host gerrit.wikimedia.org` which would need to be amended for gerrit-replica [12:33:27] Host gerrit gerrit-test [12:33:36] trying to fix [12:34:14] still permission denied.. hmm [12:34:26] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:34:26] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:35:24] ah, also need the negation further up, !gerrit-replica . got it! [12:35:33] (03CR) 10Vgutierrez: [C: 03+1] rake_modules: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/709429 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:35:43] hashar: ok, so I disabled puppet on both, will merge, then re-enable first on replica [12:36:03] confirming that it still listens on both .. then re-enable prod [12:37:12] will need a proper service restart to really check, I will do that (only) on replica [12:38:06] !log gerrit2001 - restarting gerrit after deploying 706049 [12:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:50] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:40:10] hashar: I can still clone but it looks like now it listens only on IPv6 [12:40:16] not both like before [12:40:46] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) @Vgutierrez do you have time today 10am CT to move only lvs2010 ens2f1np1 xe-2/0/44 asw-a4-codfw from A2 to a4 xe-4/0/47 [12:40:46] that's the part I wanted to make sure [12:43:03] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.57 [software/spicerack] - 10https://gerrit.wikimedia.org/r/709432 [12:43:05] (03CR) 10Dzahn: "I confirmed I could clone from gerrit-replica via ssh, disabled puppet on both gerrit servers, merged, re-enabled puppet on gerrit2001 and" [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [12:43:42] mutante: gerrit2001 right? [12:43:58] hashar: that's right. you can telnet to 29418, but that is "Trying ::1... [12:44:13] sudo netstat -tulpen | grep 29418 [12:44:15] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) @Papaul we need to coordinate with @ayounsi or @cmooney to let them configure the ports on asw-a4-codfw. For me it's basically a NOOP on lvs201... [12:44:16] on 2001 vs 1001 [12:45:02] before the change it was listening on 208.80.153.107:29418 as well, but not now [12:45:37] on gerrit2001 I see that is listening on *:29418 [12:45:59] or is that just ipv6 maybe [12:46:02] I don't, i just see tcp6 [12:46:05] :::* [12:46:33] (03CR) 10MMandere: [C: 03+2] rake_modules: Add drmrs DC site [puppet] - 10https://gerrit.wikimedia.org/r/709429 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [12:46:45] I swear it worked locally [12:47:07] I must have messed up my test case [12:47:16] let's test it on the cloud instance? [12:47:23] ... [12:47:31] eh, nevermind, np IPv6 support [12:47:40] it is broken last time I checked and cloud has no v6 :) [12:47:50] why is it broken again? [12:48:00] did some prod changes not get applied there? [12:48:35] java/gerrit are broken [12:48:51] *:29418 does not mean every protocol [12:48:56] and ipv6 is favored over v4 [12:49:08] *nod* [12:49:14] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) I can set up the interface on as-a4-codfw [12:50:18] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Vgutierrez) even better then :) [12:50:19] can't we avoid running the --init command? [12:50:58] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.57 [software/spicerack] - 10https://gerrit.wikimedia.org/r/709432 (owner: 10Volans) [12:51:13] hmm [12:51:14] hold on [12:51:29] nc 208.80.153.106 29418 [12:51:29] SSH-2.0-GerritCodeReview_3.2.11 (APACHE-SSHD-2.4.0) [12:52:47] mutante: so that works for me both with v4 and v6 [12:53:11] matching the behavior I have tested locally [12:54:38] (03CR) 10Klausman: [C: 03+1] Add kubeflow's kfserving charts (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [12:54:40] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [12:55:24] mutante: so that is working ;) [12:55:56] hashar: hmm.. I can confirm that with telnet to the IPv4 IP, but netstat output is not like before [12:56:32] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.57 [software/spicerack] - 10https://gerrit.wikimedia.org/r/709432 (owner: 10Volans) [12:56:34] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [12:57:03] "output of netstat doesn't mean Apache is not listening on IPv4 address. It's a IPv4-mapped IPv6 address." :p [12:57:18] hashar: guess it's normal :) [12:58:31] and looks like iptables is doing its job [12:58:36] I guess the reason the listening connections are only listed as tcp6 sockets is because they really are IPv6 sockets, but with the additional feature that they also accept IPv4 connections, if configured to do so. The sockets are bound to INADDR_ANY6, and when a IPv4 connection comes in the address is mapped to an IPv6 address with the prefix ::ffff:0000/96. [12:59:28] hashar: ok, let's go ahead with 1001 then? [12:59:58] sure [13:00:00] (03PS1) 10Volans: Upstream release v0.0.57 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/709434 [13:00:28] !log gerrit1001 - re-enabling puppet, deploying sshd listening / firewall change [13:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:48] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:01:06] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:01:10] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [13:01:20] hashar: applied, unsure if we want to to the hard service restart (better to make sure but some users get logged out) [13:01:39] opinion on that? [13:01:43] restart definitely [13:01:55] ok [13:01:59] the user being randomly logged out I believe that has been addressed [13:02:14] (03CR) 10Ottomata: admin README - convert to markdown and clarify system user/group docs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [13:02:16] !log gerrit1001 - restarting service after 706049 [13:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1001.wikimedia.org with reason: apply 706049 [13:03:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1001.wikimedia.org with reason: apply 706049 [13:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:43] :) [13:03:45] ok, Gerrit is back [13:04:05] tcp6 0 0 :::29418 [13:04:27] I can clone via ssh [13:04:32] <3 the message on the 503 page when Gerrit is down [13:04:51] mutante: looks all fine [13:04:55] hashar: great:) [13:05:02] phuedx: :]]] [13:05:04] glad to got that one out of the way [13:05:35] I will verify on gerrit2001 that "gerrit init" doesn't cause any unwanted change to gerrit config [13:05:45] cool! +1 [13:07:55] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.57 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/709434 (owner: 10Volans) [13:08:36] (03PS1) 10ArielGlenn: fix dealing with a stack trace in dumps logs if it's the last thing in the file [puppet] - 10https://gerrit.wikimedia.org/r/709466 [13:10:11] (03CR) 10ArielGlenn: [C: 03+2] fix dealing with a stack trace in dumps logs if it's the last thing in the file [puppet] - 10https://gerrit.wikimedia.org/r/709466 (owner: 10ArielGlenn) [13:11:46] (03PS3) 10Ottomata: admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 [13:12:03] (03CR) 10jerkins-bot: [V: 04-1] admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [13:13:15] (03CR) 10Dzahn: "we confirmed it works via IPv4 too even though in netstat output there are is now only tcp6 listening, this behaviour should be normal wit" [puppet] - 10https://gerrit.wikimedia.org/r/706049 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [13:13:34] (03Merged) 10jenkins-bot: Upstream release v0.0.57 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/709434 (owner: 10Volans) [13:13:40] (03PS1) 10David Caro: am: match the team regexes on instance names too [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709468 [13:14:23] (03PS1) 10Hashar: gerrit: explicitly set `sshd.listenAddress` [puppet] - 10https://gerrit.wikimedia.org/r/709469 (https://phabricator.wikimedia.org/T287122) [13:14:47] ^ that one adds `sshd.listenAddress = *:29418` since `gerrit init` insists on adding it back in [13:15:00] tested live on gerrit2001 [13:15:38] (03CR) 10Ottomata: "? I have rebased locally and pushed patchset 3. I can't git review again, because there are no new changes. I can't rebase in the gerri" [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [13:16:55] (03PS1) 10David Caro: prometheus: added some wmcs team label configs [puppet] - 10https://gerrit.wikimedia.org/r/709471 [13:17:44] (03PS2) 10Dzahn: gerrit: explicitly set `sshd.listenAddress` [puppet] - 10https://gerrit.wikimedia.org/r/709469 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [13:19:16] (03CR) 10Dzahn: [C: 03+2] gerrit: explicitly set `sshd.listenAddress` [puppet] - 10https://gerrit.wikimedia.org/r/709469 (https://phabricator.wikimedia.org/T287122) (owner: 10Hashar) [13:20:17] mutante: this way we are no more messing up with the gerrit config provided by puppet [13:20:23] hashar: applied on 2001, try one more time? [13:20:39] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Recommendation-API, 10Znuny: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) @MoritzMuehlenhoff @dpifke @Krinkle @bd808 @hnowlan @kostajh I am planning to failover this hos... [13:21:21] !log uploaded spicerack_0.0.57 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [13:21:25] no newline or something? heh [13:21:26] mutante: solved! The only one left is the file mode change, we want 0444 via puppet while gerrit insists on 0644 [13:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:34] hashar: ok :) [13:21:40] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [13:22:14] applied on gerrit1001, restart only on 2001 [13:22:19] that is one less issue we have to deal with when doing upgrades [13:22:50] yep, and netstat looks good to me on 2001 also after service restart [13:22:55] that sshd.listenAddress issue causd us headhaches last time [13:23:00] (03PS4) 10Ottomata: admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 [13:23:19] I will continue my tests of the plugins for gerrit 3.3 [13:23:24] (03CR) 10jerkins-bot: [V: 04-1] admin README - convert to markdown and clarify system user/group docs [puppet] - 10https://gerrit.wikimedia.org/r/708777 (owner: 10Ottomata) [13:23:55] ack, great, thanks [13:25:41] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10MoritzMuehlenhoff) >>! In T287852#7251796, @Marostegui wrote: > @MoritzMuehlenhoff @dpifke @Krinkle @bd808 @hnowlan @ko... [13:29:54] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [13:30:31] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [13:31:02] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10Marostegui) [13:44:39] (03PS20) 10Elukey: Add kubeflow's kfserving charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) [13:44:46] (03CR) 10Elukey: Add kubeflow's kfserving charts (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/700470 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [13:52:21] (03CR) 10Hashar: [V: 03+2 C: 03+2] Merge 'wmf/stable-3.2' into wmf/stable-3.3 [software/gerrit/plugins/gitiles] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705929 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [14:01:11] (03Abandoned) 10Hashar: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 (https://phabricator.wikimedia.org/T215658) (owner: 10Paladox) [14:01:13] (03Abandoned) 10Hashar: Drop ProjectCreatedListener [software/gerrit/plugins/wikimedia] (stable-3.1) - 10https://gerrit.wikimedia.org/r/593966 (owner: 10Paladox) [14:03:01] (03PS2) 10Hashar: Merge branch 'wmf/stable-3.2' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) [14:03:40] (03CR) 10Hashar: "I have updated our fork of plugins/gitiles to have its stable-3.3 branch to include our patch:" [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [14:05:47] (03PS3) 10Muehlenhoff: os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 [14:07:11] (03CR) 10jerkins-bot: [V: 04-1] os-updates-report: Adapt to new OS tracking (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/707371 (owner: 10Muehlenhoff) [14:10:49] (03PS1) 10David Caro: wmcs.puppet_alert: Add failed resources to the email [puppet] - 10https://gerrit.wikimedia.org/r/709477 (https://phabricator.wikimedia.org/T287747) [14:17:04] 10Puppet, 10Infrastructure-Foundations: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10BTullis) [14:17:23] (03PS3) 10Hashar: Merge branch 'wmf/stable-3.2' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) [14:18:23] (03CR) 10Andrew Bogott: [C: 03+1] "I mostly don't understand the add_multi_constructor() bits but overall this seems like a big improvement." [puppet] - 10https://gerrit.wikimedia.org/r/709477 (https://phabricator.wikimedia.org/T287747) (owner: 10David Caro) [14:22:39] (03PS1) 10Btullis: Improve creation of pkcs12 file by checking contents [puppet] - 10https://gerrit.wikimedia.org/r/709478 (https://phabricator.wikimedia.org/T287869) [14:24:33] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'wmf/stable-3.2' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [14:29:00] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Andrew) a:03Jclark-ctr The failed drive is an OS drive, not one containing ceph storage. So neither this failure nor a replacement should cause Ceph thrashing. @Jclark-ctr,... [14:34:17] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: sslcert::x509_to_pkcs12 fails to overwrite a valid output file when its contents should change - https://phabricator.wikimedia.org/T287869 (10BTullis) I have attempted a patch for this bug, based on using `openssl` commands to extract the private ke... [14:37:30] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - api_80: Servers mw2404.codfw.wmnet, mw2400.codfw.wmnet are marked down but pooled: appservers-https_443: Servers mw2391.codfw.wmnet are marked down but pooled: apaches_80: Servers mw2388.codfw.wmnet are marked down but pooled: api-https_443: Servers mw2297.codfw.wmnet, mw2296.codfw.wmnet, mw2397.codfw.wmnet, mw2292.codfw.wmnet, mw2252.codfw.wmnet, mw [14:37:30] fw.wmnet, mw2400.codfw.wmnet, mw2405.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:37:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs2010.codfw.wmnet with reason: NIC maintenance [14:37:51] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs2010.codfw.wmnet with reason: NIC maintenance [14:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:23] ^^ lvs2010 issues are expected [14:46:01] (03PS2) 10David Caro: wmcs.puppet_alert: Add failed resources to the email [puppet] - 10https://gerrit.wikimedia.org/r/709477 (https://phabricator.wikimedia.org/T287747) [14:46:03] (03PS1) 10David Caro: wmcs.cloud-init: create a ready file for alerts [puppet] - 10https://gerrit.wikimedia.org/r/709482 [14:46:05] (03PS1) 10David Caro: wmcs.puppet_alert: Don't fail if the host is not ready [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) [14:47:08] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:48:35] (03CR) 10jerkins-bot: [V: 04-1] wmcs.puppet_alert: Don't fail if the host is not ready [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) (owner: 10David Caro) [14:57:30] (03PS1) 10Btullis: Set hive default log4j version to 2 [puppet] - 10https://gerrit.wikimedia.org/r/709484 (https://phabricator.wikimedia.org/T279304) [14:58:47] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10dpifke) No objection for xhgui. [15:01:38] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/709482 (owner: 10David Caro) [15:02:07] (03CR) 10Andrew Bogott: "Actually... maybe we could use a more explicit filename like cloud-init-finished or firstboot-finished rather than just 'ready'?" [puppet] - 10https://gerrit.wikimedia.org/r/709482 (owner: 10David Caro) [15:02:10] (03PS1) 10Hashar: Merge branch 'stable-3.2' into wmf/stable-3.3 [software/gerrit/plugins/zuul] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/709487 [15:02:31] (03CR) 10Hashar: [V: 03+2 C: 03+2] Merge branch 'stable-3.2' into wmf/stable-3.3 [software/gerrit/plugins/zuul] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/709487 (owner: 10Hashar) [15:05:12] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10serviceops, and 2 others: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10dancy) [15:05:14] (03PS4) 10Hashar: Merge branch 'wmf/stable-3.2' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) [15:06:14] (03CR) 10Hashar: "The failure was due to plugins/zuul, I have recreated our wmf/stable-3.3 branch to be based on master and merged stable3.2 in which has al" [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/705934 (https://phabricator.wikimedia.org/T262241) (owner: 10Hashar) [15:07:42] * Niharika waves [15:10:45] (03CR) 10Razzi: [C: 03+1] Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [15:13:43] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Failover m2 master (db1107) to a different host to upgrade its kernel - https://phabricator.wikimedia.org/T287852 (10bd808) iegreview and scholarships should handle the maintenance without major issue. [15:20:31] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix NavtimingStaleBeacon false alarms, add test [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [15:20:34] (03CR) 10Legoktm: [C: 03+1] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709235 (https://phabricator.wikimedia.org/T285197) (owner: 10Zabe) [15:20:36] (03PS4) 10Filippo Giunchedi: Fix NavtimingStaleBeacon false alarms, add test [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [15:20:38] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Fix NavtimingStaleBeacon false alarms, add test [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [15:21:24] (03CR) 10Razzi: [C: 03+1] Extend Cumin alias for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/709377 (owner: 10Muehlenhoff) [15:22:20] (03PS2) 10Jcrespo: dbbackups: Purge db1139:s2, old eqiad stretch backup source [puppet] - 10https://gerrit.wikimedia.org/r/709393 (https://phabricator.wikimedia.org/T287230) [15:23:29] (03CR) 10Andrew Bogott: [C: 03+1] "thx for extra comments!" [puppet] - 10https://gerrit.wikimedia.org/r/709477 (https://phabricator.wikimedia.org/T287747) (owner: 10David Caro) [15:25:03] (03CR) 10Volans: "> Patch Set 4:" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/708478 (owner: 10Elukey) [15:27:12] 10SRE, 10Traffic: DNS Discovery for active/passive failover within a data centre - https://phabricator.wikimedia.org/T287584 (10BTullis) To keep you updated, we're currently investigating the possibility of another solution to the Presto coordinator use case that was set out here. Specifically, for the Presto... [15:38:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Papaul) 05Open→03Resolved a:03Papaul Resolving this task since all 3 links on lvs2010 moved to asw-a4,asw-b4... [15:38:26] 10SRE, 10DC-Ops, 10Traffic, 10Sustainability (Incident Followup): Audit eqiad & codfw LVS network links - https://phabricator.wikimedia.org/T286881 (10Papaul) [15:40:20] PROBLEM - Prometheus configuration reload failure on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [15:41:15] (03CR) 10David Caro: [C: 04-1] wmcs.puppet_alert: Don't fail if the host is not ready (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709483 (https://phabricator.wikimedia.org/T287747) (owner: 10David Caro) [15:44:26] !log remove s2 from db1139 T287230 [15:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:34] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Purge db1139:s2, old eqiad stretch backup source [puppet] - 10https://gerrit.wikimedia.org/r/709393 (https://phabricator.wikimedia.org/T287230) (owner: 10Jcrespo) [15:44:34] T287230: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 [15:45:44] PROBLEM - Prometheus configuration reload failure on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus5001&var-datasource=eqsin+prometheus/ops [15:47:14] ^I have not yet merged my change, so that is not me [15:48:11] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Decommission old ELK5 Logstash cluster - https://phabricator.wikimedia.org/T281266 (10Papaul) [15:49:26] I don't see any relevant recent change digging deeper [15:49:28] that's me, [15:49:38] ah, ok, then not touching anything [15:50:03] jynus: please go ahead, no worries [15:50:12] PROBLEM - Prometheus configuration reload failure on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [15:50:21] I mean that letting you handle, it is not blocking me [15:51:01] ah yeah, thanks [15:51:02] (I merged my change) [15:51:20] PROBLEM - Prometheus configuration reload failure on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [15:53:08] PROBLEM - Prometheus configuration reload failure on prometheus4001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus4001&var-datasource=ulsfo+prometheus/ops [15:54:58] PROBLEM - Prometheus configuration reload failure on prometheus3001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus3001&var-datasource=esams+prometheus/ops [15:55:06] fix incoming [15:55:09] (03PS1) 10Filippo Giunchedi: Revert "Fix NavtimingStaleBeacon false alarms, add test" [alerts] - 10https://gerrit.wikimedia.org/r/709493 [15:55:51] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert "Fix NavtimingStaleBeacon false alarms, add test" [alerts] - 10https://gerrit.wikimedia.org/r/709493 (owner: 10Filippo Giunchedi) [15:57:35] PROBLEM - Prometheus configuration reload failure on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops [15:58:01] (03PS1) 10Elukey: Add kfserving basic helmfile config under admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) [15:59:26] RECOVERY - Prometheus configuration reload failure on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [15:59:30] RECOVERY - Prometheus configuration reload failure on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops [15:59:42] RECOVERY - Prometheus configuration reload failure on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [16:00:13] (03CR) 10Klausman: [C: 03+1] Add kfserving basic helmfile config under admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) (owner: 10Elukey) [16:00:42] RECOVERY - Prometheus configuration reload failure on prometheus3001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus3001&var-datasource=esams+prometheus/ops [16:00:45] RECOVERY - Prometheus configuration reload failure on prometheus4001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus4001&var-datasource=ulsfo+prometheus/ops [16:00:54] RECOVERY - Prometheus configuration reload failure on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [16:01:04] RECOVERY - Prometheus configuration reload failure on prometheus5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Configuration_reload_failure https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus5001&var-datasource=eqsin+prometheus/ops [16:03:05] 10SRE, 10Performance-Team, 10serviceops: WARNING: opcache cache-hit ratio is below 99.99% on multiple eqiad appservers and parsoid servers - https://phabricator.wikimedia.org/T287792 (10Legoktm) 05Open→03Resolved a:03Legoktm This appears to have resolved itself, if I had to guess this was caused by cod... [16:08:50] (03CR) 10Volans: [C: 03+1] "LGTM at a first pass, to be tested in dry-run once merged." [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [16:09:34] (03CR) 10Filippo Giunchedi: "My apologies, I reverted this in I501856fc8e67 since we'll need a Prometheus upgrade first https://phabricator.wikimedia.org/T222113 and I" [alerts] - 10https://gerrit.wikimedia.org/r/702477 (owner: 10Dave Pifke) [16:11:04] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001 alerting for 2043 mails in exim queue - https://phabricator.wikimedia.org/T287793 (10herron) 05Open→03Resolved a:03herron This alert has cleared and the queue is now ~50% below the icinga threshold. I did notice last week before this task was created... [16:12:51] (03PS2) 10Elukey: Add kfserving basic helmfile config under admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/709494 (https://phabricator.wikimedia.org/T272919) [16:16:36] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:16:41] (03PS2) 10Jcrespo: dbbackups: Move s4 from db1145 to db1139 and reimage db1145 to buster [puppet] - 10https://gerrit.wikimedia.org/r/709395 (https://phabricator.wikimedia.org/T280979) [16:26:14] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] "Let's get this merged- we discussed in private further improvements, but none are a blocker to have this on trunk already." [software/bernard] - 10https://gerrit.wikimedia.org/r/702781 (https://phabricator.wikimedia.org/T285142) (owner: 10H.krishna123) [16:27:26] (03PS1) 10Hashar: Gerrit 3.3.5 + plugins [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/709501 (https://phabricator.wikimedia.org/T262241) [16:30:57] (03PS12) 10Btullis: Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) [16:38:32] (03PS1) 10Lucas Werkmeister (WMDE): Stop setting $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709503 (https://phabricator.wikimedia.org/T257260) [16:38:34] (03PS1) 10Lucas Werkmeister (WMDE): Remove wmgWBRepoConceptBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709504 (https://phabricator.wikimedia.org/T257260) [16:38:55] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "DNM before wmf.17 is safely rolled out to all servers and won’t be rolled back again." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709503 (https://phabricator.wikimedia.org/T257260) (owner: 10Lucas Werkmeister (WMDE)) [16:41:47] (03CR) 10Btullis: [C: 03+2] Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [16:42:29] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update sre.kafka.roll-restart cookbooks to new API [cookbooks] - 10https://gerrit.wikimedia.org/r/704932 (https://phabricator.wikimedia.org/T269925) (owner: 10Btullis) [16:48:42] (03CR) 10Volans: "> Patch Set 1: Code-Review+2" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/708384 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [16:53:04] (03CR) 10Btullis: [C: 03+2] Extend Cumin alias for Airflow [puppet] - 10https://gerrit.wikimedia.org/r/709377 (owner: 10Muehlenhoff) [17:00:05] ryankemper: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210802T1700). [17:06:02] o/ Who's the best person to talk to about increasing $wgMaxUserDBWriteDuration for a specific wiki. Context: https://phabricator.wikimedia.org/T287859. tl;dr is that creating a SecurePoll poll on votewiki is triggering a DBTransactionSizeError AND the Board Election starts in ~1 day [17:09:04] phuedx: DBAs probably, #wikimedia-databases and tag #DBA in Phabricator [17:09:28] legoktm: Thanks! Will do [17:10:23] (03PS2) 10Hashar: Gerrit 3.3.5 + plugins [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/709501 (https://phabricator.wikimedia.org/T262241) [17:14:14] phuedx: It's possibly something worth pinging Aaron S about specifically too [17:14:33] (03CR) 10Bstorm: metricsinfra: add karma with cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) (owner: 10Majavah) [17:20:34] phuedx: I have replied to your ping onthe task, but there's not much we can do as DBAs [17:21:48] phuedx: I would prefer not to increase the timeout, and rather try to optimize the code as much as possible to see if the transaction can become smaller (maybe split it in chunks if that's doable?) [17:23:47] marostegui: Thanks for the response. For sure, there are a LOT of writes from what I can tell that might be able to broken into smaller transactions. The problem is the deadline. I'm also unsure why this wasn't an issue earlier [17:25:46] There's been quite a lot of changes to this sort of stuff (by aaron, hence saying to ping him) [17:26:06] phuedx: 3 seconds for a write is already quite a big timeout I think, trying to fix big transactions by increasing timeouts isn't ideal - but as Reedy says, can we try to get performance involved? [17:26:52] I'm guessing it's related to CreatePage::processInput... Which will do many queries on potentially 800+ wikis to set everything up [17:27:24] ^ Very much that [17:27:39] It does look like a _big_ transaction [17:28:22] phuedx: It might take 4 seconds now and 5 in a week and such, that's why I was mentioning trying to split it as much as possible, let's see what Aaron says [17:29:16] (03PS1) 10Hashar: pom: overwrite gerrit.war [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/709509 [17:29:51] phuedx: another option might be to starting the transactions/atomic parts later (ie not opening it then doing reads which are not locking) [17:30:09] It doesn't help that the error doesn't really give much clue where it's failing in the process [17:30:52] Reedy: I was thinking the same. Figure out what needs to be updated -> begin tx -> write -> end tx [17:30:54] (03CR) 10Hashar: "Found that the hardway when bumping to 3.3.5, `mvn package` did not copy the dependency over `gerrit.war` :-(" [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/709509 (owner: 10Hashar) [17:31:24] But as marostegui, there are stages that could be their own transactions too [17:31:52] yeah, that'd help as it might reduce the time of that massive transaction (I consider 4 seconds massive hehe) [17:33:26] I'm guessing, based on the code... [17:33:27] // Ok, begin the actual work [17:33:27] $dbw->startAtomic( __METHOD__ ); [17:33:37] it's from there to the end, not the foreach with the start/end inside it [17:34:00] ie not the "create the "redirect" polls on all the local wikis" part [17:36:47] phuedx: $wgMaxUserDBWriteDuration can be set to specific wikis or does it need to be done on a per section basis? (ie: s1, s2...)? [17:39:20] @marostegui AIUI it's a per-wiki configuration value [17:39:44] Per wiki should work [17:40:46] (03PS2) 10Majavah: metricsinfra: add karma with cas [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) [17:40:52] phuedx: that's good [17:40:55] It can probably be hacked around so it's only done for certain requests on certain wikis [17:41:09] Other than vote-y stuff... votewiki wouldn't be doing too many writes [17:41:18] And these creation events are probably the "biggest" thing it'll do [17:43:44] (03CR) 10Bstorm: metricsinfra: add karma with cas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) (owner: 10Majavah) [17:44:21] (03PS3) 10Majavah: metricsinfra: add karma with cas [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) [17:44:53] marostegui: Of course... the other thing, to be really hacky... We could just increase it for long enough to create this election, and then put it back to normal [17:45:08] There is the immediate problem of the current use case and the long term problem of reasonable (and standardized) configuration. If we choose to solve the former with a workaround then let's plan for rolling that back afterwards and looking for the proper solution in parallel. [17:45:16] AFAIK we shouldn't need to keep the timeout higher than usual for a long time [17:45:35] Reedy: exactly :) [17:46:00] (03CR) 10Bstorm: metricsinfra: add karma with cas (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) (owner: 10Majavah) [17:46:08] The other option could be writing a maintenance script to create elections too (as a "later" thing)... So we don't need to worry about errant queries running on random appservers [17:46:23] (03CR) 10Bstorm: [C: 03+2] metricsinfra: add karma with cas [puppet] - 10https://gerrit.wikimedia.org/r/709066 (https://phabricator.wikimedia.org/T285055) (owner: 10Majavah) [17:46:47] Reedy: +1 to both the temporary increase and creating a maintenance script [17:47:07] Getting the refactoring done to the UI one should be longer term work too [17:47:41] (03PS1) 10Legoktm: varnish: Improve comments around maps access, retire T261694 [puppet] - 10https://gerrit.wikimedia.org/r/709511 (https://phabricator.wikimedia.org/T261694) [17:47:48] and/or a couple of temporary increases... one for testing that it all works, a second to actually create the actual election [17:47:53] (03PS3) 10Volans: Class API: add rollback() method [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 [17:48:58] (03CR) 10Volans: "Addressed comments" (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/705720 (owner: 10Volans) [17:51:27] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, and 2 others: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Legoktm) I created https://wikitech.wikimedia.org/wiki/Maps/External_usage just now (please edit/improve!) and submitted t... [17:52:52] (03PS1) 10Majavah: metricsinfra: Add IRC bot for alerting [puppet] - 10https://gerrit.wikimedia.org/r/709514 (https://phabricator.wikimedia.org/T287148) [17:53:39] (03PS2) 10Majavah: metricsinfra: Add IRC bot for alerting [puppet] - 10https://gerrit.wikimedia.org/r/709514 (https://phabricator.wikimedia.org/T287148) [17:54:42] AND to keep it very minimal... You could do it with wikimediadebug and only pull it onto one debug host ;) [17:55:28] that should work [17:55:43] (03PS1) 10Urbanecm: Growth features: Enable features in dark mode on a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709515 (https://phabricator.wikimedia.org/T287876) [17:56:53] Reedy: I'll start editing the config on mwdebug2001 right now ;) [17:57:16] !log razzi@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [17:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:21] To understand this better, is this required to run the transaction once (poll creation) or will further interactions with the poll need this changed setting as well? From the conversation so far it seems like it's the first option. [18:00:05] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210802T1800). [18:00:05] tgr and zabe: A patch you scheduled for Morning backport windowYour patch may or may not be deployed at the sole discretion of the deployer is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:22] o/ [18:00:29] i can deploy today (unless tgr wishes to run the window :)) [18:00:32] o/ [18:00:54] thanks urbanecm! [18:01:13] (03CR) 10Urbanecm: [C: 03+2] Add a link: Show article extract instead of description in the link inspector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709256 (https://phabricator.wikimedia.org/T287636) (owner: 10Gergő Tisza) [18:01:40] sobanski: In my understanding this only affects the creation, which is why raising the limit for a very short time is something being discussed. [18:02:14] Thanks [18:02:36] phuedx: you mentioned you're messing up with mwdebug2001 -- note it's a deployment window now, and I'll be shortly using scap to sync stuff. Are you done? :) [18:02:52] urbanecm: Ah! My apologies. It was a (bad) joke [18:03:04] sobanski: +1 to zabe [18:03:07] urbanecm: let me know when you are done with the window please, I've got a config patch I will deploy [18:03:08] hi folks. per Toby the board elections are being delayed 2 weeks. so hopefully that creates more room for a smoother solution [18:03:32] phuedx: oh, i took it seriously :D. Going to deploy normally then :) [18:03:34] ottomata: ack [18:04:05] sorry for the fire drill! [18:04:26] (03CR) 10Urbanecm: [C: 03+2] Remove unused eswiki celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709232 (https://phabricator.wikimedia.org/T280908) (owner: 10Zabe) [18:05:08] (03Merged) 10jenkins-bot: Remove unused eswiki celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709232 (https://phabricator.wikimedia.org/T280908) (owner: 10Zabe) [18:06:43] zabe: syncing the first patch, no need to test that one [18:07:00] phuedx: considering what tltaylor said, should we wait for Performance team's input before proceeding? [18:07:38] (03PS3) 10Urbanecm: Remove unused enwiki celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709231 (https://phabricator.wikimedia.org/T272108) (owner: 10Zabe) [18:07:41] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 16f97941b7d8eacc9bddae7bc570e03b031bead2: Remove unused eswiki celebration logos (T280908) (duration: 00m 57s) [18:07:46] (03CR) 10Urbanecm: [C: 03+2] Remove unused enwiki celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709231 (https://phabricator.wikimedia.org/T272108) (owner: 10Zabe) [18:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:48] T280908: Change Spanish Wikipedia logo due to its 20th anniversary as of May 1 for one month - https://phabricator.wikimedia.org/T280908 [18:08:35] sobanski: Yes. Thanks for asking :) I was also involved in that conversation but am in a meeting so couldn't post a timely message here [18:08:36] 10SRE, 10ops-eqiad, 10User-fgiunchedi: Disk failed on thanos-be1003 - https://phabricator.wikimedia.org/T285664 (10Cmjohnson) The TSR report I sent them on July 1st was the wrong server. Re-sent the report. The new disk should arrive this week. [18:08:50] I'll post more updates on the task [18:08:59] Perfect, thanks [18:09:26] (03Merged) 10jenkins-bot: Remove unused enwiki celebration logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709231 (https://phabricator.wikimedia.org/T272108) (owner: 10Zabe) [18:10:59] syncing the enwiki patch, too [18:11:47] !log urbanecm@deploy1002 Synchronized static/images/project-logos/: 97b68972108feaf52ab328991f563617f3594d81: Remove unused enwiki celebration logos (T272108) (duration: 00m 57s) [18:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:54] T272108: Change EnWiki logo's back to the standard one, on or after 2021-02-04 - https://phabricator.wikimedia.org/T272108 [18:12:09] (03PS2) 10Urbanecm: Add media.defense.gov to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709233 (https://phabricator.wikimedia.org/T287264) (owner: 10Zabe) [18:12:25] (03CR) 10Urbanecm: [C: 03+2] Add media.defense.gov to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709233 (https://phabricator.wikimedia.org/T287264) (owner: 10Zabe) [18:13:41] (03Merged) 10jenkins-bot: Add media.defense.gov to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709233 (https://phabricator.wikimedia.org/T287264) (owner: 10Zabe) [18:14:15] syncing whitelist too, trivial change [18:14:56] *allowlist [18:15:03] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 11e96bab3375d604126619169964a2db96808152: Add media.defense.gov to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T287264) (duration: 00m 56s) [18:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:11] T287264: Add media.defense.gov to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T287264 [18:15:22] (03PS2) 10Urbanecm: Add tewikisource as import source for tewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709234 (https://phabricator.wikimedia.org/T286978) (owner: 10Zabe) [18:15:36] (03CR) 10Urbanecm: [C: 03+2] Add tewikisource as import source for tewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709234 (https://phabricator.wikimedia.org/T286978) (owner: 10Zabe) [18:17:41] (03Merged) 10jenkins-bot: Add tewikisource as import source for tewikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709234 (https://phabricator.wikimedia.org/T286978) (owner: 10Zabe) [18:18:32] and I'm going to test this one myself, as Special:Import requires advanced permissions to load [18:18:46] yes, thanks [18:19:07] works fine, syncing [18:19:31] (03PS2) 10Urbanecm: Enable SUL autologin for wikimania.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709235 (https://phabricator.wikimedia.org/T285197) (owner: 10Zabe) [18:19:39] (03CR) 10Urbanecm: [C: 03+2] Enable SUL autologin for wikimania.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709235 (https://phabricator.wikimedia.org/T285197) (owner: 10Zabe) [18:20:17] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cc8ca452e66994c211efd684b7ed3810bdc84aaf: Add tewikisource as import source for tewikibooks (T286978) (duration: 00m 56s) [18:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:24] T286978: Site configuration change to enable import from tewikisource into tewikibooks - https://phabricator.wikimedia.org/T286978 [18:21:11] (03Merged) 10jenkins-bot: Enable SUL autologin for wikimania.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709235 (https://phabricator.wikimedia.org/T285197) (owner: 10Zabe) [18:21:22] (03Merged) 10jenkins-bot: Add a link: Show article extract instead of description in the link inspector [extensions/GrowthExperiments] (wmf/1.37.0-wmf.16) - 10https://gerrit.wikimedia.org/r/709256 (https://phabricator.wikimedia.org/T287636) (owner: 10Gergő Tisza) [18:22:17] zabe: tgr: your patches were pulled onto mwdebug2001, can you have a look? [18:22:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:46] (not really sure if there's a reliable way to test that one though) [18:23:05] urbanecm: thanks! works as expected [18:23:40] interestingly i managed to auto-login to wikimaniawiki via my staff account (which was not logged there yet) even w/o mwdebuug2001 [18:23:44] tgr: thanks, syncing [18:24:18] tgr: i should do extension.json first and then the modules, right? For the callback to be available? [18:24:29] 10ops-eqiad, 10DBA: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T287137 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson Ticket created with Dell Create Dispatch: Service Tag: DYV8773 [18:24:42] or can i just sync the extension dir? [18:25:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:49] Mine looks good? I doesn't look bad. I don't know how to reliably test this. [18:27:27] urbanecm: in theory extension.json first is more correct, although ResourceLoader regenerates things every five minutes so the amount of breakage caused by syncing all at once would be minimal. [18:28:50] zabe: it certainly doesn't look bad, SUL still works and no errors or something logged to logstash [18:28:52] i'll sync it [18:29:13] and thanks for the reply tgr [18:29:14] 10SRE, 10ops-eqiad, 10DBA: Broken RAM on db1127 - https://phabricator.wikimedia.org/T286763 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson Dispatch created with Dell, You have successfully submitted request SR1066677487. [18:30:11] doing extension.json first then the rest just in case [18:30:39] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/GrowthExperiments/extension.json: 05cf1d6de1695d2e38531f3fecb26381f4dc0b1d: Add a link: Show article extract instead of description in the link inspector (T287636; 1/2) (duration: 00m 57s) [18:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:46] T287636: Add a link: link inspector should show article extract, not article description - https://phabricator.wikimedia.org/T287636 [18:31:55] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.16/extensions/GrowthExperiments/modules/: 05cf1d6de1695d2e38531f3fecb26381f4dc0b1d: Add a link: Show article extract instead of description in the link inspector (T287636; 2/2) (duration: 00m 56s) [18:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:32] tgr: should be live :) [18:33:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:53] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: eec997cf88437fc6e2e27a835301aef968c548c4: Enable SUL autologin for wikimania.wikimedia.org (T285197) (duration: 00m 55s) [18:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:00] T285197: Enable SUL autologin for wikimania.wikimedia.org - https://phabricator.wikimedia.org/T285197 [18:34:01] zabe: your patch live as well [18:34:11] (03PS3) 10Urbanecm: Add rollbacker group for kswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709214 (https://phabricator.wikimedia.org/T286789) (owner: 10Zabe) [18:34:15] (03CR) 10Urbanecm: [C: 03+2] Add rollbacker group for kswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709214 (https://phabricator.wikimedia.org/T286789) (owner: 10Zabe) [18:35:18] (03Merged) 10jenkins-bot: Add rollbacker group for kswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709214 (https://phabricator.wikimedia.org/T286789) (owner: 10Zabe) [18:35:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:55] zabe: your patch is at mwdebug2001, please have a look [18:36:36] (03PS2) 10Urbanecm: Growth features: Enable features in dark mode on a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709515 (https://phabricator.wikimedia.org/T287876) [18:36:40] (03CR) 10Urbanecm: [C: 03+2] Growth features: Enable features in dark mode on a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709515 (https://phabricator.wikimedia.org/T287876) (owner: 10Urbanecm) [18:36:53] and also i'm taking the liberty to squeeze this one in [18:37:25] (03Merged) 10jenkins-bot: Growth features: Enable features in dark mode on a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709515 (https://phabricator.wikimedia.org/T287876) (owner: 10Urbanecm) [18:38:12] urbanecm: works the supposed way [18:38:29] thanks, syncing [18:38:51] looks all good, thanks for deploying urbanecm! [18:38:57] any time tgr ! [18:40:13] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ee47f9d9a867f0bc419928c010579fb4f6fea425: Add rollbacker group for kswiki (T286789) (duration: 00m 56s) [18:40:15] zabe: and all live. Enjoy! [18:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:21] T286789: Create Rollbackers user group on Ks Wiki - https://phabricator.wikimedia.org/T286789 [18:40:41] thanks for your help :) [18:40:45] any time :) [18:40:57] !log Create GrowthExperiments database tables for a bunch of wikis (T287876, T287871, T287878, T287880, T287875, T287879, T287872) [18:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:11] T287875: Deploy Growth features on Marathi Wikipedia - https://phabricator.wikimedia.org/T287875 [18:41:12] T287880: Deploy Growth features on Georgian Wikipedia - https://phabricator.wikimedia.org/T287880 [18:41:12] T287879: Deploy Growth features on Malayalam Wikipedia - https://phabricator.wikimedia.org/T287879 [18:41:12] T287876: Deploy Growth features on Slovenian Wikipedia - https://phabricator.wikimedia.org/T287876 [18:41:12] T287872: Deploy Growth features on Finnish Wikipedia - https://phabricator.wikimedia.org/T287872 [18:41:13] T287878: Deploy Growth features on Kazakh Wikipedia - https://phabricator.wikimedia.org/T287878 [18:41:13] T287871: Deploy Growth features on Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T287871 [18:45:04] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson A dell ticket for a new DIMM has been submitted. You have successfully submitted request SR1066678833. [18:45:30] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 18cd360773a2a236f9817ac0a4eaf3790b6d8cff: Growth features: Enable features in dark mode on a few wikis (T287876, T287871, T287878, T287880, T287875, T287879, T287872; 1/2) (duration: 00m 56s) [18:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:36] 10SRE, 10ops-eqiad, 10Discovery, 10Discovery-Search (Current work): Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T285643 (10Cmjohnson) 05Open→03Resolved [18:46:42] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 18cd360773a2a236f9817ac0a4eaf3790b6d8cff: Growth features: Enable features in dark mode on a few wikis (T287876, T287871, T287878, T287880, T287875, T287879, T287872; 2/2) (duration: 00m 56s) [18:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:53] T287875: Deploy Growth features on Marathi Wikipedia - https://phabricator.wikimedia.org/T287875 [18:46:54] T287880: Deploy Growth features on Georgian Wikipedia - https://phabricator.wikimedia.org/T287880 [18:46:54] T287879: Deploy Growth features on Malayalam Wikipedia - https://phabricator.wikimedia.org/T287879 [18:46:54] T287876: Deploy Growth features on Slovenian Wikipedia - https://phabricator.wikimedia.org/T287876 [18:46:55] T287872: Deploy Growth features on Finnish Wikipedia - https://phabricator.wikimedia.org/T287872 [18:46:55] T287878: Deploy Growth features on Kazakh Wikipedia - https://phabricator.wikimedia.org/T287878 [18:46:55] T287871: Deploy Growth features on Azerbaijani Wikipedia - https://phabricator.wikimedia.org/T287871 [18:47:01] ottomata: you can go ahead with your stuff [18:47:04] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:47:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: Disk failure for elastic1039.eqiad.wmnet - https://phabricator.wikimedia.org/T286497 (10Cmjohnson) This server is out of warranty, I may have some spares from decom'd systems on-site. I will check and update the task. [18:47:28] urbanecm: thank you [18:47:34] (03PS9) 10Ottomata: Stream config for android_notification_interaction schema Bug: T287652 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan) [18:47:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:21] !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [18:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:12] (03CR) 10Ottomata: [C: 03+2] Stream config for android_notification_interaction schema Bug: T287652 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708653 (https://phabricator.wikimedia.org/T287652) (owner: 10Sharvaniharan) [18:49:12] !log Run extensions/GrowthExperiments/maintenance/initWikiConfig.php on a couple of wikis to init on-wiki config for Growth features (T287876, T287871, T287878, T287880, T287875, T287879, T287872) [18:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:45] ottomata: the bug: was not detected because of missing empty line, fyi [18:49:46] !log razzi@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [18:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:00] oh [18:50:04] right, ohwell [18:50:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:40] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Stream config for android_notification_interaction - T287652 (duration: 00m 56s) [18:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:46] T287652: Migrate MobileWikiAppNotificationInteraction to MEP - https://phabricator.wikimedia.org/T287652 [18:54:29] (03PS1) 10Urbanecm: Enable Growth features on a couple of wikis in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709521 (https://phabricator.wikimedia.org/T287868) [18:54:43] ottomata: if that was all, can i sync ^^ now? [18:54:53] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1008 - https://phabricator.wikimedia.org/T287838 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson a disk has been ordered through Dell, hopefully, they do not push back because the disk does not show failed in the h/w log I sent the... [18:55:23] urbanecm: yup i'm done [18:55:23] thank you [18:55:27] thanks [18:55:38] (03PS2) 10Urbanecm: Enable Growth features on a couple of wikis in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709521 (https://phabricator.wikimedia.org/T287868) [18:55:42] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features on a couple of wikis in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709521 (https://phabricator.wikimedia.org/T287868) (owner: 10Urbanecm) [18:56:32] (03Merged) 10jenkins-bot: Enable Growth features on a couple of wikis in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709521 (https://phabricator.wikimedia.org/T287868) (owner: 10Urbanecm) [18:57:01] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudcephosd1018 - https://phabricator.wikimedia.org/T285799 (10Cmjohnson) 05Open→03Resolved @dcaro, sorry for the late response, I was out all month. No, there isn't anything left to do, it appears to be working fine now. If it breaks... [18:57:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Cmjohnson) [18:57:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bebf4a9819f80e19cbb94f115f47c1ff4d05b7d2: Enable Growth features on a couple of wikis in dark mode (T287868, T287874, T287873; 1/2) (duration: 00m 57s) [18:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:40] T287868: Deploy Growth features on Kurdish Wikipedia - https://phabricator.wikimedia.org/T287868 [18:59:40] T287873: Deploy Growth features on Lithuanian Wikipedia - https://phabricator.wikimedia.org/T287873 [18:59:40] T287874: Deploy Growth features on Estonian Wikipedia - https://phabricator.wikimedia.org/T287874 [19:00:42] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: bebf4a9819f80e19cbb94f115f47c1ff4d05b7d2: Enable Growth features on a couple of wikis in dark mode (T287868, T287874, T287873; 2/2) (duration: 00m 56s) [19:00:45] and, done [19:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:51] !log Morning B&C window completed [19:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:23] !log Run extensions/GrowthExperiments/maintenance/initWikiConfig.php on a couple of wikis to init on-wiki config for Growth features (T287868, T287874, T287873) [19:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:50] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:44] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:48] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:35:38] !log razzi@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 [19:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:05] (03PS1) 10Ryan Kemper: analytics: commission new webserver [puppet] - 10https://gerrit.wikimedia.org/r/709530 (https://phabricator.wikimedia.org/T285355) [19:47:56] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:59:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Jclark-ctr) I have not received any calls or emails from dell yet. only emails received where from previous motherboard replacement with Chris [20:00:05] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210802T2000). Please do the needful. [20:01:14] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:16] (03PS1) 10Ottomata: Enable canary events by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709531 (https://phabricator.wikimedia.org/T287789) [20:18:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson mw1451 A2 U8 Port20 Cableid#23000030 mw1452 A2 U22 Port21 Cableid#23000022 mw1453 A8 U4 Port17... [20:18:39] (03PS2) 10Ottomata: Enable canary events by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709531 (https://phabricator.wikimedia.org/T287789) [20:26:38] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:56] (03CR) 10Ottomata: [C: 03+1] Set hive default log4j version to 2 [puppet] - 10https://gerrit.wikimedia.org/r/709484 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [20:40:51] (03CR) 10Btullis: [C: 03+2] Set hive default log4j version to 2 [puppet] - 10https://gerrit.wikimedia.org/r/709484 (https://phabricator.wikimedia.org/T279304) (owner: 10Btullis) [20:45:26] (03CR) 10Cwhite: [C: 03+1] prometheus: tweak external url to reflect reality [puppet] - 10https://gerrit.wikimedia.org/r/709032 (https://phabricator.wikimedia.org/T284213) (owner: 10Filippo Giunchedi) [21:00:04] Reedy and sbassett: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210802T2100). [21:01:01] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:15] (03CR) 10Cwhite: prometheus.icinga_exporter: Add label_teams_config parameter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/709053 (owner: 10David Caro) [21:03:05] (03CR) 10RLazarus: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/708384 (https://phabricator.wikimedia.org/T285803) (owner: 10RLazarus) [21:06:30] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10Jclark-ctr) Replaced Ends on console cable clip did not want to lock in switch [21:06:56] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/709054 (owner: 10David Caro) [21:07:16] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/709471 (owner: 10David Caro) [21:16:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Jclark-ctr) @jijiki I am only held up on two servers for racking mc1039, mc1040 per racking proposal i am waiting for rack A7 to have 2... [21:16:49] !log removing 7 files for legal compliance [21:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:08] it's odd to see removals at this level [21:20:39] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10DonTrung) I think that it would be unwise to disable DPL, perhaps it would be better to s... [21:26:24] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:26:26] Platonides: it doesn't happen every day, but it does happen -- https://sal.toolforge.org/production?q=%22legal+compliance%22 [21:27:02] it surprises me as a SAL action [21:27:42] I thought these were (almost always?) handled through the provided tools, like OFFICE oversights [21:28:30] but it seems it's not as uncommon [21:30:49] even oversight doesn't fully remove the file from all servers (it just gets hidden from non-overshighters), which is sometimes required and done via a maintenance script [21:31:47] !log removing 1 file for legal compliance [21:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:45] Platonides: what majavah said basically [21:35:39] what maintenance script is used for this? [21:36:03] eraseArchivedFile.php [21:36:30] https://wikitech.wikimedia.org/wiki/Media_storage#Removing_archived_files [21:36:57] thx [21:40:02] (03CR) 10Cwhite: "One comment inline, but Filippo should weigh in." (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/709468 (owner: 10David Caro) [21:48:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Jclark-ctr) @jijiki If the remaining two work for being racked in A2 instead fo A7 they have been racked and can be configured by @Cmjohnso... [21:49:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [21:55:47] 10SRE, 10DynamicPageList (Wikimedia), 10MW-1.37-notes (1.37.0-wmf.16; 2021-07-26), 10Patch-For-Review, 10Sustainability (Incident Followup): Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 (10ssr) Please pay attention to RWN's call for Board of Trustees candidates to share their o... [22:02:38] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:14] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10Jclark-ctr) ganeti1023 A8. u8 port21 cableId#23000024 ganeti1024 C5 u27 port33 cableId#1963 [22:06:36] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [22:26:18] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:26:34] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install cloudcephosd102[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Jclark-ctr) no ports available in D5 Waiting for new switches to be configured T277340 [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Evening backport windowYour patch may or may not be deployed at the sole discretion of the deployer. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210802T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:20] oh, I'm going to sync out some stuff in a few minutes [23:02:31] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:07:28] (03PS3) 10Legoktm: Move ruwikinews to large wikis dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 [23:07:30] (03PS2) 10Legoktm: Stop enabling DPL on new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708374 (https://phabricator.wikimedia.org/T287380) [23:08:26] 10SRE, 10ops-eqiad: Rack/power audit in eqiad c8/d5 - https://phabricator.wikimedia.org/T280977 (10wiki_willy) [23:08:28] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10wiki_willy) [23:08:49] (03PS1) 10Legoktm: Sort list of wikis in $wmgUseGlobalAbuseFilters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709560 [23:10:30] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-a7-eqiad power over threshold alerts - https://phabricator.wikimedia.org/T276743 (10wiki_willy) [23:10:34] 10SRE, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10wiki_willy) [23:11:48] (03PS2) 10Legoktm: Improve docs on $wmgUseGlobalAbuseFilters and sort list of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/709560 [23:12:34] (03CR) 10Legoktm: [C: 03+2] Move ruwikinews to large wikis dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 (owner: 10Legoktm) [23:13:19] (03Merged) 10jenkins-bot: Move ruwikinews to large wikis dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708364 (owner: 10Legoktm) [23:13:25] (03CR) 10Legoktm: Stop enabling DPL on new wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708374 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [23:13:27] (03CR) 10Legoktm: [C: 03+2] Stop enabling DPL on new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708374 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [23:14:10] (03Merged) 10jenkins-bot: Stop enabling DPL on new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708374 (https://phabricator.wikimedia.org/T287380) (owner: 10Legoktm) [23:16:57] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Move ruwikinews to large wikis dblist (1/2) (duration: 00m 57s) [23:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:36] !log legoktm@deploy1002 Synchronized dblists/: Move ruwikinews to large wikis dblist (2/2) (duration: 00m 56s) [23:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:34] oops, I mis-staged that [23:21:13] !log Previous sync also deployed c38998f03f "Stop enabling DPL on new wikis" (T287380) [23:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:20] T287380: Decide on the future of DPL - https://phabricator.wikimedia.org/T287380 [23:26:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:08] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:27:08] o.O [23:28:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:03] !log legoktm@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [23:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:57] 10SRE, 10Services, 10Wikibase-Quality-Constraints, 10Wikidata, and 4 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) [23:38:57] !log legoktm@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [23:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:52] (03PS1) 10BryanDavis: toolhub: initial chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) [23:47:48] (03CR) 10jerkins-bot: [V: 04-1] toolhub: initial chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/709565 (https://phabricator.wikimedia.org/T287716) (owner: 10BryanDavis) [23:50:38] !log legoktm@cumin1001 START - Cookbook sre.dns.netbox [23:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:25] !log legoktm@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log