[00:00:05] RoanKattouw and Urbanecm: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T0000). [00:00:05] EricGardner and legoktm: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:31] o/ [00:00:42] Present for today's late backport window. Just waiting for CI to complete on my cherry-picked patches now [00:00:47] RECOVERY - Check systemd state on aphlict1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:50] Hello, I can deploy today [00:01:06] I've already +2ed all the patches, just waiting for CI [00:01:33] bblack: the mobile endpoints are sometimes flakey, I think if there was a real issue more alerts would have fired along with it [00:01:39] !log lvs1015: start pybal, back to normal [00:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:51] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 96.18% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:02:53] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:03:01] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 77 connections established with conf1004.eqiad.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [00:03:35] RECOVERY - pybal on lvs1015 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [00:03:51] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [00:05:39] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:09:26] (03Merged) 10jenkins-bot: Remove multiple instance of VUEX initialization [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747081 (https://phabricator.wikimedia.org/T297690) (owner: 10Eric Gardner) [00:10:01] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:13:37] (03Merged) 10jenkins-bot: Don't attempt to scroll to a non-existing result [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747078 (owner: 10Eric Gardner) [00:13:41] (03Merged) 10jenkins-bot: Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747079 (https://phabricator.wikimedia.org/T297744) (owner: 10Legoktm) [00:14:39] (03Merged) 10jenkins-bot: Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747080 (https://phabricator.wikimedia.org/T297744) (owner: 10Legoktm) [00:15:54] RoanKattouw: Looks like everything completed CI successfully [00:18:17] Thanks for the ping, I'll deploy in a second [00:18:22] Was just finishing dinner [00:19:56] (03PS1) 10Legoktm: docker_registry_ha: Set log level to debug [puppet] - 10https://gerrit.wikimedia.org/r/747216 [00:22:05] EricGardner, legoktm: All four of your patches are now on mwdebug1002 for testing [00:22:13] * legoktm tries [00:22:21] checking now [00:22:43] RoanKattouw: lgtm! [00:22:46] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [00:23:05] RoanKattouw: everything looks correct here too [00:24:38] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [00:25:27] OK great. Deploying legoktm's wmf.12 patch now, then I'll do his wmf.13 patch, then EricGardner'S [00:26:02] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.12/includes/: Backport: [[gerrit:747080|Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" (T297744)]] (duration: 01m 11s) [00:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:07] T297744: Page tabs like "Edit", "View history" do not appear on Special:WhatLinksHere - https://phabricator.wikimedia.org/T297744 [00:28:11] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.13/includes/: Backport: [[gerrit:747079|Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" (T297744)]] (duration: 01m 12s) [00:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:40] still looks good with no mwdebug. thanks RoanKattouw! [00:29:53] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/MediaSearch/resources/components/SearchResults.vue: Backport: [[gerrit:747078|Don't attempt to scroll to a non-existing result]] (duration: 01m 05s) [00:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:15] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/MediaSearch/resources/store/index.js: Backport: [[gerrit:747081|Remove multiple instance of VUEX initialization (T297690)]] (duration: 01m 04s) [00:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:21] T297690: Remove multiple instance of VUEX initialization - https://phabricator.wikimedia.org/T297690 [00:31:59] RoanKattouw: all looks good here sans mwdebug [00:32:23] (am checking Mediasearch against testcommons since it is on wmf 13) [00:33:03] Great! Successful deployment then [00:33:23] Thanks! [00:33:54] RoanKattouw: OK for me to run a scap sync command now? [00:34:17] dancy: Go for it, I'm done [00:34:21] thx [00:34:55] ew.. php lint failed [00:37:05] `Parse error: syntax error, unexpected 'public' (T_PUBLIC), expecting variable (T_VARIABLE) in /srv/mediawiki-staging/php-1.38.0-wmf.9/vendor/symfony/console/Attribute/AsCommand.php on line 21` [00:37:43] wmf.9 is not live right now so it's not a tragedy but it is making scap fail. [00:39:48] o.O [00:40:57] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/vendor/+/refs/heads/master/symfony/console/Attribute/AsCommand.php is not syntax compatible with PHP 7.2 [00:41:03] I guess it just never gets loaded [00:41:31] Why is this now a problem? [00:41:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:50] ooh, mw image build finished. [00:41:54] we only check syntax on sync-file/dir, and people usually don't sync vendor/ [00:42:11] aaah I was testing sync-dir. [00:42:26] well that's confusing. Less checks for the initial train [00:43:27] I don't remember why we don't do it on full scaps [00:43:42] maybe because of this problem? :| [00:43:47] haha probably [00:47:19] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:38] !log dancy@deploy1002 Synchronized /: testing (duration: 00m 37s) [00:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:55:38] (03CR) 10Legoktm: [C: 03+1] "belated thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [00:55:49] !log dancy@deploy1002 Started scap: testing [00:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:27] !log dancy@deploy1002 Finished scap: testing (duration: 03m 38s) [00:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:02] (03CR) 10Dave Pifke: Make fix-staging-perms also fix /srv/patches permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747187 (owner: 10Urbanecm) [01:04:05] I'm stepping away for the day. [01:11:07] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:59:40] (03PS2) 10Andrew Bogott: cloudmetrics: remove refs for cloudmetrics1001/1002 and prepare for decom [puppet] - 10https://gerrit.wikimedia.org/r/745951 (https://phabricator.wikimedia.org/T289888) [02:00:51] (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics: remove refs for cloudmetrics1001/1002 and prepare for decom [puppet] - 10https://gerrit.wikimedia.org/r/745951 (https://phabricator.wikimedia.org/T289888) (owner: 10Andrew Bogott) [02:07:57] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:45:31] (03PS2) 10Urbanecm: Make fix-staging-perms also fix /srv/patches permissions [puppet] - 10https://gerrit.wikimedia.org/r/747187 [06:45:40] (03CR) 10Urbanecm: Make fix-staging-perms also fix /srv/patches permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747187 (owner: 10Urbanecm) [07:04:25] !log Enable full_crc32 on db2094 (s1, s3, s5 and s8) T287244 [07:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:31] T287244: Considering switching innodb_checksum_algorithm=full_crc32 - https://phabricator.wikimedia.org/T287244 [07:07:34] (03PS1) 10Marostegui: sanitarium_multiinstance.my.cnf.erb: Enable innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/747413 (https://phabricator.wikimedia.org/T287244) [07:09:20] (03CR) 10Marostegui: [C: 03+2] sanitarium_multiinstance.my.cnf.erb: Enable innodb_checksum_algorithm=full_crc32 [puppet] - 10https://gerrit.wikimedia.org/r/747413 (https://phabricator.wikimedia.org/T287244) (owner: 10Marostegui) [07:22:34] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:26] (03CR) 10Hashar: "recheck after having deployed https://gerrit.wikimedia.org/r/c/integration/config/+/747148" [deployment-charts] - 10https://gerrit.wikimedia.org/r/746864 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:37:17] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libsamplerate [puppet] - 10https://gerrit.wikimedia.org/r/747110 (owner: 10Muehlenhoff) [08:44:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2018.codfw.wmnet with OS buster [08:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:18] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2018.codfw.wmnet with OS buster [08:53:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, CC'ing Matthew as well" [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [09:02:22] (03PS4) 10Elukey: helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414) [09:02:24] (03PS1) 10Elukey: Update utils.rb's helm_version function [deployment-charts] - 10https://gerrit.wikimedia.org/r/747460 (https://phabricator.wikimedia.org/T251305) [09:02:26] (03PS1) 10Elukey: knative-serving: refactor istio egress gateway configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/747461 (https://phabricator.wikimedia.org/T294414) [09:06:31] (03CR) 10Sergio Gimeno: [C: 03+1] betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) (owner: 10Kosta Harlan) [09:10:05] (03PS1) 10Giuseppe Lavagetto: mediawiki: add ability to inject apache configurations early [deployment-charts] - 10https://gerrit.wikimedia.org/r/747462 (https://phabricator.wikimedia.org/T297613) [09:12:02] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add ability to inject apache configurations early [deployment-charts] - 10https://gerrit.wikimedia.org/r/747462 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [09:13:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: remove helm2 from Rakefile, bump scaffold to v2 api [deployment-charts] - 10https://gerrit.wikimedia.org/r/746864 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [09:14:54] (03Abandoned) 10Elukey: Update utils.rb's helm_version function [deployment-charts] - 10https://gerrit.wikimedia.org/r/747460 (https://phabricator.wikimedia.org/T251305) (owner: 10Elukey) [09:17:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2018.codfw.wmnet with OS buster [09:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:58] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2018.codfw.wmnet with OS buster completed: - ganeti2018 (**PASS**) - Downtimed on Icinga... [09:18:55] (03Merged) 10jenkins-bot: Rakefile: remove helm2 from Rakefile, bump scaffold to v2 api [deployment-charts] - 10https://gerrit.wikimedia.org/r/746864 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [09:22:05] (03PS2) 10Giuseppe Lavagetto: mediawiki: add ability to inject apache configurations early [deployment-charts] - 10https://gerrit.wikimedia.org/r/747462 (https://phabricator.wikimedia.org/T297613) [09:22:40] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add ability to inject apache configurations early [deployment-charts] - 10https://gerrit.wikimedia.org/r/747462 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [09:23:21] (03PS2) 10Elukey: knative-serving: refactor istio egress gateway configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/747461 (https://phabricator.wikimedia.org/T294414) [09:23:23] (03PS5) 10Elukey: helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414) [09:23:42] (03CR) 10DCausse: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [09:24:20] 10SRE, 10observability, 10Patch-For-Review: Open Phab tasks on SMART failure - https://phabricator.wikimedia.org/T196994 (10fgiunchedi) This is possible now with Prometheus and Alertmanager, cc T294564 [09:26:51] 10SRE, 10SRE-swift-storage: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications - https://phabricator.wikimedia.org/T222362 (10fgiunchedi) [09:27:33] 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-herron, 10User-jbond: Prevent puppet catalog compiler workers from running out of disk space - https://phabricator.wikimedia.org/T222075 (10fgiunchedi) [09:28:10] (03PS1) 10MVernon: admin: add approver for the "restricted" group [puppet] - 10https://gerrit.wikimedia.org/r/747463 [09:28:19] !log pool cp4025 - T271421 [09:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:25] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [09:32:15] (03CR) 10Elukey: [C: 03+2] knative-serving: refactor istio egress gateway configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/747461 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [09:32:30] 10SRE, 10serviceops, 10User-Elukey: Test memsniff as possible replacement of memkeys - https://phabricator.wikimedia.org/T228970 (10fgiunchedi) [09:33:17] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [09:34:31] 10SRE, 10observability, 10service-runner, 10serviceops-radar, 10Patch-For-Review: Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10fgiunchedi) I believe this is now (partially?) done, and service-runner supports Prometheus nat... [09:36:45] 10SRE, 10observability, 10User-CDanis: Expose pooled status of gdnsd and conftool managed services as metrics - https://phabricator.wikimedia.org/T230733 (10fgiunchedi) I have implemented part of this work for service::catalog network probes, specifically I needed to exporter per-service `state` field. Even... [09:42:59] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [09:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:30] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [09:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:41] (03PS6) 10Elukey: helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414) [09:47:40] (03PS3) 10Giuseppe Lavagetto: mediawiki: add ability to inject apache configurations early [deployment-charts] - 10https://gerrit.wikimedia.org/r/747462 (https://phabricator.wikimedia.org/T297613) [09:49:46] (03CR) 10Elukey: [C: 03+2] helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [09:51:22] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:51:59] (03PS1) 10Jgiannelos: Enable tegola on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747486 [09:52:32] (03PS2) 10Jgiannelos: Enable tegola on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747486 (https://phabricator.wikimedia.org/T280767) [09:53:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [09:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:22] (03PS3) 10Jgiannelos: Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) [09:56:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1043-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [09:57:54] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:58] (03PS1) 10Jelto: Rakefile/rake_modules: remove unused function helm_version() and cleanup [deployment-charts] - 10https://gerrit.wikimedia.org/r/747487 (https://phabricator.wikimedia.org/T251305) [10:00:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:36] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [10:05:54] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Reverting to `5.10.0-9` has brought back stability, resolving. We still have T297433 to update firmware,... [10:07:53] (03PS1) 10Btullis: Update the version of log4j2 that is in use [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) [10:10:09] (03CR) 10Elukey: "Let's also remove the old jars, otherwise there may be conflicts and/or Druid may pick up the wrong version." [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:11:13] (03PS2) 10Btullis: Update the version of log4j2 that is in use [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) [10:11:58] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) This is the list of media backup errors (making it NDA-only, as I haven't checked yet everything there is non-private)... [10:20:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:23:54] (03CR) 10Elukey: "Moritz what is the best course of action for this patch? We have a dedicated Debian branch for Druid, and we usually (IIRC) merge master o" [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:24:48] (03CR) 10Elukey: "I also see the following:" [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:25:17] (03CR) 10Jcrespo: puppetmaster: Install 'age' on puppetmaster frontends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [10:26:19] (03CR) 10Btullis: "I see that Gerrit allows me to move the change to a different branch. I could move it to the debian branch, but I'm not sure if I have per" [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:27:11] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [10:27:13] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [10:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:39] (03CR) 10Btullis: "I could also update the LICENCE file, which specifically mentions versions 2.8.2 of the log4j2 components. It won't make any material diff" [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:33:27] (03CR) 10Elukey: Update the version of log4j2 that is in use (032 comments) [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:34:55] (03PS3) 10Btullis: Update the version of log4j2 that is in use [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) [10:37:43] (03CR) 10Elukey: "The debian/README file states that we do:" [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:37:45] (03CR) 10Kormat: "One minor comment." [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [10:38:26] (03CR) 10Elukey: "Ben can we change LOG4J2_VERSION's value in the jconsole.sh too?" [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:39:55] (03PS4) 10Btullis: Update the version of log4j2 that is in use [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) [10:42:20] (03CR) 10Btullis: "OK, that's fine. I've update the patch by adding log4j-1.2-api-2.16.0 as well." [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:43:35] (03CR) 10Elukey: [C: 03+1] Update the version of log4j2 that is in use [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [10:59:53] (03CR) 10Jbond: [C: 03+2] "manually Tested on pcc-workers and working" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/747207 (owner: 10Jbond) [11:01:12] (03Merged) 10jenkins-bot: populate_puppetdb: update tp use config class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/747207 (owner: 10Jbond) [11:06:01] (CirrusSearchJVMGCOldPoolFlatlined) resolved: Elasticsearch instance elastic1043-production-search-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [11:11:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet [11:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:28] (03PS3) 10Filippo Giunchedi: prometheus: add mini-textfile-exporter [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946) [11:13:30] (03PS3) 10Filippo Giunchedi: prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) [11:13:32] (03PS1) 10Filippo Giunchedi: prometheus: consider non-discovery case when sending SNI in blackbox [puppet] - 10https://gerrit.wikimedia.org/r/747493 (https://phabricator.wikimedia.org/T291946) [11:13:45] (03PS2) 10Jbond: P:age::store: Add profile and class to configure age secret store [puppet] - 10https://gerrit.wikimedia.org/r/747193 [11:14:13] (03PS3) 10Jbond: O:puppetmaster: Add age::store to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/747194 [11:15:51] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33001/console" [puppet] - 10https://gerrit.wikimedia.org/r/747493 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [11:17:52] (03PS4) 10Giuseppe Lavagetto: mediawiki: add ability to inject apache configurations early [deployment-charts] - 10https://gerrit.wikimedia.org/r/747462 (https://phabricator.wikimedia.org/T297613) [11:17:57] (03CR) 10Jbond: "This is still very much WIP and low priority but you may be intrested" [puppet] - 10https://gerrit.wikimedia.org/r/747193 (owner: 10Jbond) [11:17:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet [11:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:32] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10jcrespo) 05Open→03Resolved a:03jcrespo I don't think it is worth this being open anymore- there indeed is a need to review it to generate a be... [11:20:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33002/console" [puppet] - 10https://gerrit.wikimedia.org/r/747194 (owner: 10Jbond) [11:21:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs-srpeadcheck-tools: add new shorter webgrid names [puppet] - 10https://gerrit.wikimedia.org/r/731113 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [11:21:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs-srpeadcheck-tools: add new shorter webgrid names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/731113 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [11:24:34] (03PS1) 10Majavah: LabsServices: refresh cloudmetrics server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747494 (https://phabricator.wikimedia.org/T289888) [11:28:16] (03PS1) 10Jbond: role_hosts: capitilise role before testing [puppet] - 10https://gerrit.wikimedia.org/r/747495 [11:29:14] (03CR) 10Btullis: [C: 03+2] Update the version of log4j2 that is in use [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [11:29:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33003/console" [puppet] - 10https://gerrit.wikimedia.org/r/747495 (owner: 10Jbond) [11:30:29] (03PS4) 10Jbond: O:puppetmaster: Add age::store to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/747194 [11:30:42] (03CR) 10Btullis: [V: 03+2 C: 03+2] Update the version of log4j2 that is in use [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [11:31:35] (03CR) 10Jbond: [C: 03+1] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans) [11:31:37] (03PS1) 10Arturo Borrero Gonzalez: cookbook/wmcs: format with black [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/747497 [11:32:21] (03CR) 10Jbond: [V: 03+1 C: 03+2] role_hosts: capitilise role before testing [puppet] - 10https://gerrit.wikimedia.org/r/747495 (owner: 10Jbond) [11:32:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [11:33:36] (03PS7) 10Arturo Borrero Gonzalez: DONOTMERGE toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [11:34:18] (03PS1) 10Kormat: wmfdb/cli_admin: Expand the description for db-mysql. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747498 [11:34:31] (03CR) 10jerkins-bot: [V: 04-1] cookbook/wmcs: format with black [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/747497 (owner: 10Arturo Borrero Gonzalez) [11:35:33] (03PS2) 10Kormat: wmfdb/cli_admin: Expand the description for db-mysql. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747498 [11:35:39] <_joe_> !log uploading php 7.2 7.2.34-18+0~20210223.60+debian10~1.gbpb21322+wmf4 to buster T297667 [11:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:45] T297667: mysqli/mysqlnd memory leak - https://phabricator.wikimedia.org/T297667 [11:36:31] (03CR) 10jerkins-bot: [V: 04-1] DONOTMERGE toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [11:36:51] <_joe_> !log upgrading php7.2 on mw1414, T297667 [11:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33004/console" [puppet] - 10https://gerrit.wikimedia.org/r/747194 (owner: 10Jbond) [11:37:48] (03PS3) 10Kormat: wmfdb/cli_admin: Expand the description for db-mysql. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747498 [11:39:57] <_joe_> !log repooling mw1414 T297667 [11:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:49] (03PS1) 10Btullis: Merge branch 'master' into debian [debs/druid] - 10https://gerrit.wikimedia.org/r/747499 [11:42:10] (03Abandoned) 10Arturo Borrero Gonzalez: cookbook/wmcs: format with black [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/747497 (owner: 10Arturo Borrero Gonzalez) [11:42:32] (03CR) 10Kormat: [C: 03+2] wmfdb/cli_admin: Expand the description for db-mysql. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747498 (owner: 10Kormat) [11:44:13] (03PS2) 10Btullis: Merge branch 'master' into debian [debs/druid] - 10https://gerrit.wikimedia.org/r/747499 (https://phabricator.wikimedia.org/T297468) [11:46:57] (03PS1) 10Kormat: debian: Initial debian packaging. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747501 (https://phabricator.wikimedia.org/T297618) [11:48:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:48:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [11:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:12] (03CR) 10Kormat: [C: 03+2] debian: Initial debian packaging. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747501 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [11:53:16] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [11:56:03] (03PS5) 10Jbond: O:puppetmaster: Add age::store to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/747194 [11:56:10] (03PS1) 10Vgutierrez: cfssl: fix create chained cert typo [puppet] - 10https://gerrit.wikimedia.org/r/747502 [12:00:02] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1200). [12:00:05] Lucas_WMDE and nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:07] o/ [12:00:15] hi [12:00:23] hey [12:00:34] Lucas is not connected it seems [12:00:35] I can deploy today [12:00:38] uh [12:00:41] why am I a guest lol [12:00:43] oh, he is [12:00:44] (Lucas here) [12:00:46] lemme rejoin [12:01:32] welcome back Lucas_WMDE [12:01:35] leaving it to you :)) [12:01:37] ayyyy [12:01:41] thx ^^ [12:01:43] ok [12:01:48] <_joe_> "thanks" [12:01:55] <_joe_> :D [12:02:05] * urbanecm waves to _joe_ [12:02:08] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:02:27] let’s start with nn1l2’s change, mine will take longer to verify [12:02:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33005/console" [puppet] - 10https://gerrit.wikimedia.org/r/747194 (owner: 10Jbond) [12:03:14] oof, long diff [12:03:24] let’s look at the diffConfig instead [12:03:37] which looks like entries were only reordered, as you’d expect [12:04:13] (03CR) 10Urbanecm: [C: 03+1] "SGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [12:04:19] yup, patch looks good to me [12:05:21] (03PS7) 10Lucas Werkmeister (WMDE): Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [12:05:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [12:06:00] (03PS1) 10Majavah: P::toolforge::grid: fix tomcat on buster [puppet] - 10https://gerrit.wikimedia.org/r/747503 (https://phabricator.wikimedia.org/T277653) [12:06:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add ability to inject apache configurations early [deployment-charts] - 10https://gerrit.wikimedia.org/r/747462 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [12:06:51] (03PS2) 10Majavah: P::toolforge::grid: fix tomcat on buster [puppet] - 10https://gerrit.wikimedia.org/r/747503 (https://phabricator.wikimedia.org/T277653) [12:07:30] (03Merged) 10jenkins-bot: Remove redundant project namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [12:08:04] nn1l2: the change is on mwdebug1001, can you test it? [12:08:11] ok [12:08:26] !log added ganeti2025 to codfw ganeti cluster T282603 [12:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:31] T282603: (Need By: TBD) rack/setup/install ganeti202[56] - https://phabricator.wikimedia.org/T282603 [12:08:42] tested on da.wiki [12:08:47] WP:V works [12:08:52] LGTM [12:08:54] I’m not seeing any change in https://as.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespacealiases&formatversion=2 [12:08:56] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [12:08:59] not even in the order [12:09:01] Do I need to do other test? [12:09:20] (though I’m bewildered where those social media icons on the right come from… does common.js run on api.php with the default format?!) [12:09:26] no other test needed I think [12:09:41] so let's sync [12:10:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:44] (03Merged) 10jenkins-bot: mediawiki: add ability to inject apache configurations early [deployment-charts] - 10https://gerrit.wikimedia.org/r/747462 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [12:11:11] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:745220|Remove redundant project namespace aliases (T296643)]] (no-op) (duration: 01m 07s) [12:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:16] T296643: Remove redundant project namespace aliases - https://phabricator.wikimedia.org/T296643 [12:11:16] does common.js run on api.php with the default format?!) <====== that, or common.css [12:11:41] clicking it triggers an alert() so I assume it’s not CSS [12:12:12] yeah there’s some ResourceLoader stuff in the [12:12:34] including an enwiki User CSS that’s probably added dynamically :| [12:12:40] :( [12:12:51] Lucas_WMDE: can you fill it? :D [12:13:11] fill? [12:13:18] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:27] (&debug=true also seems to work to trigger RL debug mode, though the API also warns about the unknown param ^^) [12:13:27] as a task [12:13:31] ok [12:13:47] to not run site JS on api.php? [12:14:07] yes. i think that¨s highly unexpectable behavior [12:14:21] yeah, the JS is clearly not expecting it [12:14:25] with errors like Uncaught TypeError: mw.user.isAnon is not a function [12:14:29] I’ll file it [12:14:41] thanks [12:16:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [12:17:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2150.codfw.wmnet with reason: Maintenance [12:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:55] created https://phabricator.wikimedia.org/T297779 [12:21:12] (03CR) 10Volans: [C: 03+2] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans) [12:21:56] alright, let’s move onto the other change [12:22:15] (03PS3) 10Lucas Werkmeister (WMDE): Enable Lexeme Lua access on first four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746928 (https://phabricator.wikimedia.org/T294159) [12:22:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "diffConfig LGTM, let’s go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746928 (https://phabricator.wikimedia.org/T294159) (owner: 10Lucas Werkmeister (WMDE)) [12:25:26] (03Merged) 10jenkins-bot: Enable Lexeme Lua access on first four wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746928 (https://phabricator.wikimedia.org/T294159) (owner: 10Lucas Werkmeister (WMDE)) [12:26:08] testing on mwdebug1001 [12:29:07] (03Merged) 10jenkins-bot: remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans) [12:29:20] looks good as far as I can tell… let’s sync :hypehype: [12:32:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:13] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:746928|Enable Lexeme Lua access on first four wikis (T294159)]] happy holidays :) (duration: 01m 06s) [12:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:18] T294159: Enable Lexeme access on first set of projects - https://phabricator.wikimedia.org/T294159 [12:40:19] PROBLEM - ensure kvm processes are running on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [12:40:42] !log UTC morning backport+config window done [12:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:26] (03CR) 10Volans: "Reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [12:43:09] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454 [12:43:11] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454 [12:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:15] T297454: WCQS gives "502 Bad Gateway Error" - https://phabricator.wikimedia.org/T297454 [12:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:23] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454 [12:43:25] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cloudvirt-wdqs1001.eqiad.wmnet with reason: T297454 [12:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:07] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [12:48:18] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) FWIW the increase in memory slop is back to wmf.9: https://grafana.wikimedia.org/d/000000607/cluste... [12:50:11] (03CR) 10Jcrespo: [C: 03+2] puppetmaster: Install 'age' on puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [12:50:12] !log drain primary/secondary instances off ganeti2024 T296622 [12:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:17] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [13:00:04] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1300) [13:02:03] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Add an encryption key to store private files securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [13:02:11] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) >>! In T263277#7570288, @JAllemandou wrote: > Am I right in assuming that this data has the same schema as the original `n... [13:11:54] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) No need to detail the fields and schema :) About data augmentation, [[ https://github.com/wikimedia/analytics-refiner... [13:15:45] (03CR) 10Jbond: [C: 03+2] cfssl: fix create chained cert typo [puppet] - 10https://gerrit.wikimedia.org/r/747502 (owner: 10Vgutierrez) [13:18:24] (03PS1) 10Giuseppe Lavagetto: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/747506 [13:18:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/747506 (owner: 10Giuseppe Lavagetto) [13:21:31] (03PS1) 10Muehlenhoff: Make ganeti2028 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/747507 [13:22:47] (03Merged) 10jenkins-bot: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/747506 (owner: 10Giuseppe Lavagetto) [13:24:01] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul) [13:28:01] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Cool, only `ip_version` and `region` are useful here. [13:28:37] !log uploaded wmfdb 0.1 to apt.wm.o for buster+bullseye T297618 [13:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:43] T297618: Write replacement for wmfmiaradbpy/mysql.py - https://phabricator.wikimedia.org/T297618 [13:29:51] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [13:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [13:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:58] (03CR) 10Jbond: "done a quick first pass and looks good to me. Haven't had a chance to read the api docs yet but lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [13:53:15] (03PS1) 10Kormat: tox: Support python 3.9 [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747510 [13:53:17] (03PS1) 10Kormat: wmfdb/cli_admin: Support wikireplicas auth. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747511 (https://phabricator.wikimedia.org/T297618) [13:55:08] (03CR) 10Jbond: spicerack.redfish: add support for Redfish API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [14:00:04] hashar and dancy: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1400) [14:01:34] (03PS2) 10Kormat: wmfdb/cli_admin: Support wikireplicas auth. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747511 (https://phabricator.wikimedia.org/T297618) [14:03:25] are you deploying hashar? [14:03:42] (03CR) 10Jbond: sre.hosts.provision: add new cookbook (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [14:04:08] (03CR) 10Kormat: [C: 03+2] wmfdb/cli_admin: Support wikireplicas auth. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747511 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [14:05:26] (03Merged) 10jenkins-bot: wmfdb/cli_admin: Support wikireplicas auth. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747511 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [14:09:45] (03PS1) 10Ladsgroup: blameStartupRegistry: Fix clash in $startupBytes variable name [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747472 (https://phabricator.wikimedia.org/T295413) [14:09:49] (03CR) 10Ladsgroup: [C: 03+2] blameStartupRegistry: Fix clash in $startupBytes variable name [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747472 (https://phabricator.wikimedia.org/T295413) (owner: 10Ladsgroup) [14:10:09] (03PS1) 10BBlack: lvs1016: unconfig lvs, move to insetup [puppet] - 10https://gerrit.wikimedia.org/r/747515 (https://phabricator.wikimedia.org/T295804) [14:13:26] (03Merged) 10jenkins-bot: blameStartupRegistry: Fix clash in $startupBytes variable name [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747472 (https://phabricator.wikimedia.org/T295413) (owner: 10Ladsgroup) [14:15:31] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/WikimediaMaintenance/blameStartupRegistry.php: Backport: [[gerrit:747472|blameStartupRegistry: Fix clash in $startupBytes variable name (T295413)]] (duration: 01m 07s) [14:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:36] T295413: Include module bundle size in RL Graphite stats - https://phabricator.wikimedia.org/T295413 [14:16:47] (03PS1) 10Kormat: Prepare for v0.1.1 release. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747516 [14:18:25] (03CR) 10Kormat: [C: 03+2] Prepare for v0.1.1 release. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747516 (owner: 10Kormat) [14:19:46] (03Merged) 10jenkins-bot: Prepare for v0.1.1 release. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747516 (owner: 10Kormat) [14:19:54] (03PS1) 10JMeybohm: cert-manager: Define resources for all deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/747517 (https://phabricator.wikimedia.org/T294560) [14:20:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:46] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Define resources for all deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/747517 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:23:50] !log uploaded wmfdb 0.1.1 to apt.wm.o for buster+bullseye T297618 [14:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:55] T297618: Write replacement for wmfmiaradbpy/mysql.py - https://phabricator.wikimedia.org/T297618 [14:24:16] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) Would it hurt to keep the same augmentations? If the schema is the sameish (it sounds like it is), we can just apply the... [14:27:00] (03Merged) 10jenkins-bot: cert-manager: Define resources for all deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/747517 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [14:29:56] (03PS1) 10Giuseppe Lavagetto: mwdebug: mark internal IPs as trusted proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/747518 [14:30:04] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: mark internal IPs as trusted proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/747518 (owner: 10Giuseppe Lavagetto) [14:30:58] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) >>! In T263277#7572471, @Ottomata wrote: > Would it hurt to keep the same augmentations? If the schema is the sameish... [14:31:44] (03PS2) 10Giuseppe Lavagetto: mwdebug: mark internal IPs as trusted proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/747518 [14:33:28] (03CR) 10Ottomata: Update the version of log4j2 that is in use (031 comment) [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [14:33:48] (03PS1) 10Kormat: wmfdb/cli_admin: Fix ordering of --defaults-group-suffix [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747521 (https://phabricator.wikimedia.org/T297618) [14:34:51] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) I'd prefer to avoid scheduling another special job for this if we can. Can we make the NetflowTransform functions smart... [14:36:54] (03CR) 10Elukey: Update the version of log4j2 that is in use (031 comment) [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [14:37:39] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) I have the opposite view: I'd rather have another job instead of custom logic to prevent doing something :) [14:43:20] (03PS2) 10Kormat: wmfdb/cli_admin: Fix ordering of --defaults-group-suffix [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747521 (https://phabricator.wikimedia.org/T297618) [14:43:44] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10Ottomata) Yeahh...but then we have to manage and maintain another custom ingestion job. We're trying to reduce the number of those... [14:44:08] (03CR) 10Ottomata: Update the version of log4j2 that is in use (031 comment) [debs/druid] - 10https://gerrit.wikimedia.org/r/747488 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [14:44:16] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:44:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:48] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:02] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:46:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:16] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [14:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:17] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:28] spam ended :) [14:56:14] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:56:30] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:58:38] (03CR) 10Btullis: [V: 03+2 C: 03+2] Merge branch 'master' into debian [debs/druid] - 10https://gerrit.wikimedia.org/r/747499 (https://phabricator.wikimedia.org/T297468) (owner: 10Btullis) [15:02:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: mark internal IPs as trusted proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/747518 (owner: 10Giuseppe Lavagetto) [15:03:17] hashar: Is the deployment train happening today? [15:06:03] (03Merged) 10jenkins-bot: mwdebug: mark internal IPs as trusted proxies [deployment-charts] - 10https://gerrit.wikimedia.org/r/747518 (owner: 10Giuseppe Lavagetto) [15:07:46] (03PS1) 10Elukey: custom_deploy.d: set 2 replicas for the ml-serve's Istio egress gw pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/747524 [15:08:12] (03PS3) 10Kormat: wmfdb/cli_admin: Fix ordering of --defaults-group-suffix [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747521 (https://phabricator.wikimedia.org/T297618) [15:08:24] (03PS1) 10Jbond: wmflib::deep_merge: add a deep merge that support arrays [puppet] - 10https://gerrit.wikimedia.org/r/747525 [15:09:16] (03CR) 10jerkins-bot: [V: 04-1] wmflib::deep_merge: add a deep merge that support arrays [puppet] - 10https://gerrit.wikimedia.org/r/747525 (owner: 10Jbond) [15:09:20] (03PS1) 10Muehlenhoff: CAS: Update to 6.4.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/747526 [15:11:13] (03CR) 10Kormat: [C: 03+2] wmfdb/cli_admin: Fix ordering of --defaults-group-suffix [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747521 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [15:11:15] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:33] (03Merged) 10jenkins-bot: wmfdb/cli_admin: Fix ordering of --defaults-group-suffix [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747521 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [15:12:44] (03PS2) 10Jbond: wmflib::deep_merge: add a deep merge that support arrays [puppet] - 10https://gerrit.wikimedia.org/r/747525 [15:13:14] (03PS2) 10Muehlenhoff: CAS: Update to 6.4.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/747526 [15:14:20] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:02] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: set 2 replicas for the ml-serve's Istio egress gw pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/747524 (owner: 10Elukey) [15:19:37] (03PS1) 10Kormat: Prepare for v0.1.2 release. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747527 [15:20:45] (03PS2) 10BBlack: Define enterprise names for redirects [dns] - 10https://gerrit.wikimedia.org/r/747168 (https://phabricator.wikimedia.org/T296445) [15:22:18] (03CR) 10Kormat: [C: 03+2] Prepare for v0.1.2 release. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747527 (owner: 10Kormat) [15:23:32] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:23:49] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) A first pass on eqiad finished successfully: 101,970,844 files backed up successfully, with a total size of 373,335,32... [15:25:46] (03PS3) 10Jgiannelos: Enable tegola on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747486 (https://phabricator.wikimedia.org/T280767) [15:26:54] (03PS5) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) [15:27:03] (03CR) 10Volans: "Addressed comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [15:32:18] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:44] 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10elukey) @Papaul Hi! Any chance that we could work on this today/tomorrow? [15:35:46] (03CR) 10MSantos: [C: 03+1] Enable tegola on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747486 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [15:41:52] (03PS1) 10Kormat: setup.py: Update version number. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747530 [15:49:07] (03PS2) 10Volans: sre.hosts.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) [15:51:24] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti2028 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/747507 (owner: 10Muehlenhoff) [15:52:19] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [15:54:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/747526 (owner: 10Muehlenhoff) [15:56:19] (03PS8) 10Arturo Borrero Gonzalez: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:57:53] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "I think this is good to be merged." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:59:12] (03CR) 10jerkins-bot: [V: 04-1] toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [15:59:26] (03PS1) 10Papaul: Add backup2008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/747533 (https://phabricator.wikimedia.org/T294973) [15:59:55] (03PS2) 10Kormat: wmfdb: Export a version number. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747530 [16:01:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] P::toolforge::grid: fix tomcat on buster [puppet] - 10https://gerrit.wikimedia.org/r/747503 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [16:01:12] (03CR) 10jerkins-bot: [V: 04-1] wmfdb: Export a version number. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747530 (owner: 10Kormat) [16:03:41] (03CR) 10Papaul: [C: 03+2] Add backup2008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/747533 (https://phabricator.wikimedia.org/T294973) (owner: 10Papaul) [16:04:07] (03PS2) 10Papaul: Add backup2008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/747533 (https://phabricator.wikimedia.org/T294973) [16:04:18] (03CR) 10Papaul: [V: 03+2 C: 03+2] Add backup2008 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/747533 (https://phabricator.wikimedia.org/T294973) (owner: 10Papaul) [16:07:35] (03PS3) 10Kormat: wmfdb: Export a version number. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747530 [16:11:04] (03CR) 10Kormat: [C: 03+2] wmfdb: Export a version number. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747530 (owner: 10Kormat) [16:11:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host backup2008.codfw.wmnet with OS buster [16:11:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:15] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host backup2008.codfw.wmnet with OS buster [16:12:20] !log shutdown kafka-main2003 to allow work for DCops (firmware upgrade) [16:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:24] (03Merged) 10jenkins-bot: wmfdb: Export a version number. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747530 (owner: 10Kormat) [16:13:12] jouncebot: nownandnext [16:13:19] jouncebot: nowandnext [16:13:19] No deployments scheduled for the next 2 hour(s) and 46 minute(s) [16:13:19] In 2 hour(s) and 46 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1900) [16:13:19] In 2 hour(s) and 46 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1900) [16:14:19] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] CAS: Update to 6.4.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/747526 (owner: 10Muehlenhoff) [16:14:34] !log Deployed security patch for T297731 [16:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:04] --help [16:16:34] logmsgbot: i hear you [16:16:36] nope :D [16:19:37] I didn't send that one, the first one is me [16:20:14] (03PS1) 10Jelto: gitlab_runner: use config template for registering new runners [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) [16:22:20] I assume someone tried `dologmsg --help` somewhere and found out that there is no cli arg processing there :) [16:23:01] hehe [16:23:12] I think I made the same mistake with the Toolforge dologmsg and then added --help support to it [16:23:26] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:23:26] (03PS1) 10Btullis: Add a single node from the aqs_next cluster to the pool [puppet] - 10https://gerrit.wikimedia.org/r/747540 (https://phabricator.wikimedia.org/T297803) [16:23:27] Why did --help not actually log [16:23:42] RhinosF1: it did "[16:15] --help" [16:23:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:23:50] No to wiki / Sal [16:23:56] looks like dologmsg doesn’t add the !log, it expects that to be included in the command line arguments [16:24:06] ah, because that does not start with `!log` [16:24:13] (03PS2) 10Jelto: gitlab_runner: use config template for registering new runners [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) [16:25:25] yeah, dologmsg in prod is just an nc relay to the logmsgbot input port. It is not hard coded for only sending !log messages (although that is most certainly the common use case) [16:27:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [16:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:49] (WdqsStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:30:27] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@132455b] (codfw): apply overzoom on tegola [16:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [16:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:33] !log ladsgroup: Deployed security patch for T297731 [16:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:49] (WdqsStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:33:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:34:25] (03CR) 10Elukey: [C: 03+1] "LGTM, I verified that the aqs vip is among aqs1010's loopback addresses." [puppet] - 10https://gerrit.wikimedia.org/r/747540 (https://phabricator.wikimedia.org/T297803) (owner: 10Btullis) [16:34:39] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@132455b] (codfw): apply overzoom on tegola (duration: 04m 11s) [16:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:57] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@132455b] (eqiad): apply overzoom on tegola [16:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:30] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@132455b] (eqiad): apply overzoom on tegola (duration: 02m 33s) [16:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:58] nemo-yiannis: ^ overzoom fix deployed to maps eqiad and codfw succesfully [16:41:49] (WdqsStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:41:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:42:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:42:31] !log import wmf-log4j 2.16.0-1 for stretch-wikimedia (stub package to provide log4j jars for the ELK5 cluster) [16:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:54] (03PS1) 10Ayounsi: Rename cr1-codfw:xe-5/1/2 to xe-1/0/1:2 [homer/public] - 10https://gerrit.wikimedia.org/r/747546 (https://phabricator.wikimedia.org/T289241) [16:45:10] (03PS3) 10Jelto: gitlab_runner: use config template for registering new runners [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) [16:48:29] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/747546 (https://phabricator.wikimedia.org/T289241) (owner: 10Ayounsi) [16:51:49] (WdqsStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:51:58] (03CR) 10Cathal Mooney: [C: 03+2] Rename cr1-codfw:xe-5/1/2 to xe-1/0/1:2 [homer/public] - 10https://gerrit.wikimedia.org/r/747546 (https://phabricator.wikimedia.org/T289241) (owner: 10Ayounsi) [16:52:32] (03Merged) 10jenkins-bot: Rename cr1-codfw:xe-5/1/2 to xe-1/0/1:2 [homer/public] - 10https://gerrit.wikimedia.org/r/747546 (https://phabricator.wikimedia.org/T289241) (owner: 10Ayounsi) [16:53:45] (03PS4) 10Jelto: gitlab_runner: use config template for registering new runners [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) [16:55:00] (03CR) 10jerkins-bot: [V: 04-1] gitlab_runner: use config template for registering new runners [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:55:57] (03PS5) 10Jelto: gitlab_runner: use config template for registering new runners [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) [16:56:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [16:59:05] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33010/console" [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:59:14] (03CR) 10Jbond: "I noticed the following in the spec:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [16:59:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P::toolforge::grid: fix tomcat on buster [puppet] - 10https://gerrit.wikimedia.org/r/747503 (https://phabricator.wikimedia.org/T277653) (owner: 10Majavah) [17:00:32] papaul: merged a puppet change for you, insetup() for backup20088 [17:05:41] (03PS8) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [17:06:48] (03PS2) 10Ryan Kemper: query_service: Collect wdqs and wcqs jmx metrics separately [puppet] - 10https://gerrit.wikimedia.org/r/742566 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [17:09:25] (03PS1) 10Ladsgroup: Set dummy wikiuser and wikiadmin passwords [labs/private] - 10https://gerrit.wikimedia.org/r/747548 (https://phabricator.wikimedia.org/T296537) [17:09:47] (03CR) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [17:10:28] (03PS1) 10Eigyan: wmf-config: Add audience to gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) [17:10:43] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Set dummy wikiuser and wikiadmin passwords [labs/private] - 10https://gerrit.wikimedia.org/r/747548 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [17:12:38] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Collect wdqs and wcqs jmx metrics separately [puppet] - 10https://gerrit.wikimedia.org/r/742566 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [17:14:15] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [17:15:57] (03PS9) 10Arturo Borrero Gonzalez: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [17:16:50] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/747550 [17:17:30] (03CR) 10Jbond: mariadb: Make centralauth GRANTs conditional to s7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [17:19:24] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Ladsgroup) [17:19:50] (03CR) 10jerkins-bot: [V: 04-1] toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [17:20:03] (03PS2) 10JMeybohm: RBAC: Add ClusterRole and ClusterRoleBinding for imagecatalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/745196 (https://phabricator.wikimedia.org/T287130) [17:20:46] (03CR) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [17:21:12] (03PS1) 10Ayounsi: Cleanup transport-in filters for codfw/eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/747551 [17:21:17] (03CR) 10Jsn.sherman: [C: 03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [17:22:15] (03CR) 10Ayounsi: "Diff:" [homer/public] - 10https://gerrit.wikimedia.org/r/747551 (owner: 10Ayounsi) [17:22:22] (03CR) 10BBlack: [C: 03+2] lvs1016: unconfig lvs, move to insetup [puppet] - 10https://gerrit.wikimedia.org/r/747515 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [17:22:27] (03PS2) 10BBlack: lvs1016: unconfig lvs, move to insetup [puppet] - 10https://gerrit.wikimedia.org/r/747515 (https://phabricator.wikimedia.org/T295804) [17:23:11] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10dduvall) The removal of tiller has broken PipelineLib's `deploy` functionality. For example, https://integration.wikimedia.org/ci/job/blubber-pipeline-rehearse/84/console We'll need to... [17:24:16] 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10Papaul) a:03Papaul [17:24:30] (03CR) 10EllenR: [C: 03+1] wmf-config: Add audience to gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [17:25:49] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1016.eqiad.wmnet with OS buster [17:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox, 10Patch-For-Review: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1016.eqiad.wmnet with OS buster [17:26:54] (03CR) 10Scardenasmolinar: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [17:26:58] (03CR) 10JMeybohm: [C: 03+2] RBAC: Add ClusterRole and ClusterRoleBinding for imagecatalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/745196 (https://phabricator.wikimedia.org/T287130) (owner: 10JMeybohm) [17:30:10] (03Merged) 10jenkins-bot: RBAC: Add ClusterRole and ClusterRoleBinding for imagecatalog [deployment-charts] - 10https://gerrit.wikimedia.org/r/745196 (https://phabricator.wikimedia.org/T287130) (owner: 10JMeybohm) [17:32:08] (03PS4) 10RLazarus: imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) [17:32:26] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [17:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup2008.codfw.wmnet with OS buster [17:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host backup2008.codfw.wmnet with OS buster comp... [17:33:31] !log removing grant on letter a on all of s3 hosts (T296537) [17:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:35] T296537: Check and fix GRANT issues of wikiuser - https://phabricator.wikimedia.org/T296537 [17:33:52] (03CR) 10jerkins-bot: [V: 04-1] imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [17:35:10] (03PS5) 10RLazarus: imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) [17:35:46] jouncebot nowandnext [17:35:46] No deployments scheduled for the next 1 hour(s) and 24 minute(s) [17:35:47] In 1 hour(s) and 24 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1900) [17:35:47] In 1 hour(s) and 24 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1900) [17:35:49] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:09] I'm going to run a couple of test syncs to collect dat. [17:36:11] *data [17:37:25] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [17:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:33] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [17:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:41] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 269 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [17:37:45] !log dancy@deploy1002 Synchronized README: testing (duration: 01m 06s) [17:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:16] (03PS6) 10RLazarus: imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) [17:38:22] (03CR) 10Herron: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/747550 (https://phabricator.wikimedia.org/T292603) (owner: 10Herron) [17:38:40] !log dancy@deploy1002 Started scap: testing [17:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:57] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2004 is CRITICAL: 63 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2004 [17:40:14] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [17:40:47] !log dancy@deploy1002 Finished scap: testing (duration: 02m 07s) [17:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:37] Amir1: can you expound on what makes you fear scap sync-world? [17:41:57] dancy: there might be files that have been rebased but not synced [17:42:16] it can happen quite often tbh [17:42:49] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:01] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:43:05] (03PS1) 10Andrew Bogott: Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004 [puppet] - 10https://gerrit.wikimedia.org/r/747554 [17:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:10] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:18] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:38] !log jayme@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [17:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:48] !log jayme@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [17:43:48] (03PS7) 10RLazarus: imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) [17:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:56] !log jayme@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [17:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:08] !log jayme@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [17:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:32] !log deployed imagecatalog RBAC rules to all k8s clusters - T287130 [17:44:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul) @jcrespo this is complete ` Disk /dev/sda: 446.6 GiB, 479559942144 bytes, 936640512 sectors Disk /dev/sdb: 446.6 GiB, 47955994... [17:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:37] T287130: Container image lifecycle management - https://phabricator.wikimedia.org/T287130 [17:44:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul) [17:44:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul) 05Open→03Resolved [17:45:03] (03PS1) 10Ladsgroup: Revert "Set dummy wikiuser and wikiadmin passwords" [labs/private] - 10https://gerrit.wikimedia.org/r/747479 [17:45:13] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "Set dummy wikiuser and wikiadmin passwords" [labs/private] - 10https://gerrit.wikimedia.org/r/747479 (owner: 10Ladsgroup) [17:45:31] (03PS2) 10Andrew Bogott: Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004 [puppet] - 10https://gerrit.wikimedia.org/r/747554 (https://phabricator.wikimedia.org/T297814) [17:45:33] (03PS1) 10Andrew Bogott: make cloudmetrics1004 the primary cloudmetrics endpoint [dns] - 10https://gerrit.wikimedia.org/r/747555 (https://phabricator.wikimedia.org/T297814) [17:45:44] (03Abandoned) 10JMeybohm: Use dedicated imagecatalog kubernetes user [puppet] - 10https://gerrit.wikimedia.org/r/745202 (https://phabricator.wikimedia.org/T287130) (owner: 10JMeybohm) [17:45:44] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2004 [17:46:00] 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10Papaul) 05Open→03Resolved This is complete [17:46:00] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [17:47:02] !log kafka-main2003 up and running (dcops maintenance done) [17:47:04] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33013/console" [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [17:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:12] 10SRE, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) [17:47:52] (03CR) 10Andrew Bogott: [C: 03+2] Cloudmetrics/statsd: exchange cloudmetrics1003 and 1004 [puppet] - 10https://gerrit.wikimedia.org/r/747554 (https://phabricator.wikimedia.org/T297814) (owner: 10Andrew Bogott) [17:48:36] !log dancy@deploy1002 Started scap: testing [17:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:42] RECOVERY - ensure kvm processes are running on cloudvirt-wdqs1001 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:48:44] (03CR) 10Andrew Bogott: [C: 03+2] make cloudmetrics1004 the primary cloudmetrics endpoint [dns] - 10https://gerrit.wikimedia.org/r/747555 (https://phabricator.wikimedia.org/T297814) (owner: 10Andrew Bogott) [17:48:45] 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [17:48:46] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1016.eqiad.wmnet with OS buster [17:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox, 10Patch-For-Review: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1016.eqiad.wmnet with OS buster completed:... [17:50:24] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:50:37] !log dancy@deploy1002 Finished scap: testing (duration: 02m 01s) [17:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:04] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:51:46] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw@0 on kafka-main2003 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw@0/producer\.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker [17:54:51] (03CR) 10JMeybohm: [V: 03+1] imagecatalog: Install and configure OCI image catalog on deploy hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [17:58:53] (03PS10) 10Arturo Borrero Gonzalez: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [17:59:18] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "Apart from the comment nit this looks good to me. The k8s part (users and RBAC) has been deployed to all clusters so I think you're good t" [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [18:02:42] (03CR) 10jerkins-bot: [V: 04-1] toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [18:05:28] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) >>! In T263277#7572522, @Ottomata wrote: > The custom logic could even just be varied on the hardcoded stream / tablen... [18:17:55] (03PS8) 10RLazarus: imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) [18:19:20] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:49] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:02] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt10[2,3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T296792 (10Cmjohnson) [18:25:11] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudvirt10[2,3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T296792 (10Cmjohnson) 05Open→03Resolved [18:26:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS buster [18:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:32] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster [18:29:03] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [18:29:45] (03CR) 10RLazarus: [C: 03+2] imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [18:30:47] (03CR) 10RLazarus: [C: 03+2] imagecatalog: Install and configure OCI image catalog on deploy hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [18:31:44] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) This is the error I am getting, I verified there are disks in the server. I also checked BIOS and it's set to auto but I do see the disks. I am n... [18:33:43] (03PS1) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [18:33:49] (03PS1) 10RLazarus: imagecatalog: Fix outdated TODO comment [puppet] - 10https://gerrit.wikimedia.org/r/747566 (https://phabricator.wikimedia.org/T287130) [18:34:46] (03CR) 10jerkins-bot: [V: 04-1] P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 (owner: 10Jbond) [18:36:19] (03CR) 10RLazarus: [C: 03+2] imagecatalog: Fix outdated TODO comment [puppet] - 10https://gerrit.wikimedia.org/r/747566 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [18:36:26] (03PS2) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [18:41:05] (03CR) 10Jbond: mariadb: Make centralauth GRANTs conditional to s7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [18:45:04] (03PS3) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [18:46:28] (03CR) 10Majavah: [C: 04-1] wmf-config: Add audience to gdi-survey on cawiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [18:48:43] (03PS4) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [18:49:44] jbond: typo --^ :) [18:49:56] (mariada) [18:50:42] (03PS5) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [18:51:32] (03CR) 10jerkins-bot: [V: 04-1] P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 (owner: 10Jbond) [18:51:44] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS buster [18:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:51] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with... [18:52:16] (03PS6) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [18:53:21] (03CR) 10Elukey: P:mariada::core: update to use profile instead of core module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747565 (owner: 10Jbond) [18:53:33] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) @papaul or @robh could you look at this and let me know what I am missing. [18:54:39] (03PS7) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [18:59:06] (03PS8) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [19:00:04] RoanKattouw and Urbanecm: (Dis)respected human, time to deploy UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1900). Please do the needful. [19:00:04] nemo-yiannis and eigyan: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:04] hashar and dancy: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1900). [19:01:47] hello I am here [19:01:58] hey [19:03:02] (03PS1) 10Legoktm: mediawiki: Enable php-yaml on jobrunners, parsoid, and maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/747570 (https://phabricator.wikimedia.org/T296331) [19:03:17] (03PS2) 10Legoktm: docker_registry_ha: Set log level to debug [puppet] - 10https://gerrit.wikimedia.org/r/747216 [19:03:19] (03PS2) 10Legoktm: mediawiki: Enable php-yaml on jobrunners, parsoid, and maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/747570 (https://phabricator.wikimedia.org/T296331) [19:03:37] (03PS3) 10Legoktm: mediawiki: Enable php-yaml on jobrunners, parsoid, and maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/747570 (https://phabricator.wikimedia.org/T296331) [19:03:59] (03Abandoned) 10Legoktm: docker_registry_ha: Set log level to debug [puppet] - 10https://gerrit.wikimedia.org/r/747216 (owner: 10Legoktm) [19:04:28] dancy: thcipriani: I can't find the event for the train log triage with cpt :/ [19:04:39] It's tomorrow on my calendar. [19:05:04] ahhh so the deployments page is off by one day nice [19:05:39] (03PS9) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [19:05:53] eigyan: hello! [19:05:58] dancy: thank you [19:06:20] I will run the backport window since I haven't done that ina ges [19:07:07] Greetings hashar [19:07:19] enjoy! I use https://deploy-commands.toolforge.org/ [19:08:13] (03PS10) 10Jbond: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 [19:09:12] eigyan: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/747549/comment/321b1ac5_e608d5df/ :( [19:09:26] I am willing to let it through as to not loose all the cr+1 already given [19:10:39] (03CR) 10Jbond: P:mariada::core: update to use profile instead of core module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747565 (owner: 10Jbond) [19:10:59] (03PS2) 10Hashar: wmf-config: Add audience to gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [19:11:10] (03CR) 10Hashar: wmf-config: Add audience to gdi-survey on cawiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [19:11:13] (03CR) 10Eigyan: wmf-config: Add audience to gdi-survey on cawiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [19:11:29] eigyan: so that change will eventually land on the beta cluster and is not going to affect anything on production [19:11:33] hashar: I'm surprised Jenkins let it through with a tabs + spaces conflict [19:11:44] me too [19:11:48] typically they can be deployed at anytime if you manage to catch someone having the +2 right [19:11:58] yeah space should be found surely [19:12:06] but I guess it is a bug in phpcodesniffer maybe :\ [19:13:18] one can file a bug about it against #mediawiki-codesniffer in phabricator [19:13:57] and state that "\t\t // comment" has an extra space which is not caught ( or point to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/747549/comment/321b1ac5_e608d5df/ [19:15:10] (03CR) 10Hashar: [C: 03+2] wmf-config: Add audience to gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [19:16:58] (03Merged) 10jenkins-bot: wmf-config: Add audience to gdi-survey on cawiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747549 (https://phabricator.wikimedia.org/T297623) (owner: 10Eigyan) [19:17:12] I feel rusty [19:17:25] thank you hashar will this block my change [19:17:55] My goal for next quarter: `scap backport 747549` [19:18:13] oh I see it merged -> thank you for your assistance hashar [19:18:17] maybe a prototype this quarter, but not too many work days left. [19:19:12] eigyan: that will be deployed to the beta cluster automatically by Jenkins in a few minutes [19:19:36] (03CR) 10Hashar: [C: 03+2] "Lets fly!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747486 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [19:20:14] Oh, you're doing it all the way from master. That's a process. :-) [19:20:34] You still got it, hashar! [19:20:44] (03Merged) 10jenkins-bot: Enable tegola on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747486 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [19:21:43] I am in https://meet.google.com/xcm-ziut-ekz with Yiannis :) [19:21:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2024.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [19:22:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2024.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [19:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:15] ^^ those are all automatic right? [19:23:20] yes [19:23:23] to catch up with manually entered scap commands [19:23:26] rad [19:23:40] next step curl deployments && helm [19:23:41] :D [19:25:28] !log hashar@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable tegola on enwiki T2980767 (duration: 01m 06s) [19:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:32] Stepping out to pick up a child. [19:25:44] change is live! [19:27:36] so great to see Maps being taken care of :) [19:27:39] cc mbsantos [19:27:47] !log UTC evening backport window completed [19:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:05] yay! thanks hashar and nemo-yiannis [19:28:12] eigyan: hopefully your change is live on beta cluster now [19:28:31] mbsantos: you are welcome and thank you for Maps ! :D [19:29:06] * hashar checks lgos [19:29:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:08] (03CR) 10Volans: "Thanks for all the comments, addressed." [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [19:33:10] (03PS6) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) [19:38:44] (03CR) 10Dzahn: "Hello Lucas, Adam, Michael. I was wondering how annoying on a scale of 1 to 10 it would be for you if after a merge in the WDQS GUI repos " [puppet] - 10https://gerrit.wikimedia.org/r/745634 (https://phabricator.wikimedia.org/T218900) (owner: 10Dzahn) [19:40:32] PROBLEM - logstash JSON linesTCP port on logstash2004 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [19:45:34] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS buster [19:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:43] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster [19:47:12] RECOVERY - logstash JSON linesTCP port on logstash2004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 https://wikitech.wikimedia.org/wiki/Logstash [19:51:59] (03PS1) 10Ladsgroup: miscweb: Set up static_tendril microsite [puppet] - 10https://gerrit.wikimedia.org/r/747600 (https://phabricator.wikimedia.org/T297605) [19:52:38] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-test-coord1002.eqiad.wmnet with OS buster [19:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:46] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS buster executed with err... [19:52:56] (03CR) 10jerkins-bot: [V: 04-1] miscweb: Set up static_tendril microsite [puppet] - 10https://gerrit.wikimedia.org/r/747600 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [19:52:58] 10SRE, 10API Platform, 10Desktop Improvements, 10MediaWiki-REST-API, and 10 others: CVE-2021-44854: Rest API incorrectly publicly caches results from private wikis - https://phabricator.wikimedia.org/T292763 (10Reedy) [19:53:23] jouncebot: now [19:53:23] For the next 0 hour(s) and 6 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1900) [19:53:23] For the next 0 hour(s) and 6 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T1900) [19:53:29] :D [19:54:05] (03PS2) 10Ladsgroup: miscweb: Set up static_tendril microsite [puppet] - 10https://gerrit.wikimedia.org/r/747600 (https://phabricator.wikimedia.org/T297605) [19:58:41] 10SRE, 10Application Security Reviews, 10Security Awareness, 10Security-Team, and 3 others: Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10Dzahn) [20:00:04] hashar and dancy: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T2000) [20:00:45] (03PS1) 10Hashar: group1 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747601 [20:00:47] (03CR) 10Hashar: [C: 03+2] group1 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747601 (owner: 10Hashar) [20:00:59] * hashar whistles a little tune [20:01:22] (03PS1) 10Ladsgroup: trafficserver: Point dbtree.wm.o to miscweb instead of dbmonitor [puppet] - 10https://gerrit.wikimedia.org/r/747602 (https://phabricator.wikimedia.org/T297605) [20:01:47] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747601 (owner: 10Hashar) [20:03:16] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.13 refs T293954 [20:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:23] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [20:04:22] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.13 refs T293954 (duration: 01m 05s) [20:04:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:14] hmm [20:05:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:09] at this point we can really just click a `[deploy]` button [20:07:07] so we are live on group1 [20:07:32] 👍🏾 [20:08:10] [e26c0a58-bc74-4314-8ec9-5bc87cb3bc63] /wiki/Fitxer:Passeig_Bertrand_27_P1370597.JPG Error: Call to a member function getId() on null [20:08:27] search=the&title=Special:MediaSearch PHP Notice: Array to string conversion [20:08:45] so hmm [20:09:55] (03PS6) 10Jbond: O:puppetmaster: Add age::store to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/747194 [20:10:06] I some how missed those this afternoon [20:10:59] (03PS1) 10Jdlrobson: Remove migration script [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747483 (https://phabricator.wikimedia.org/T297484) [20:11:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:56] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/747565 (owner: 10Jbond) [20:11:58] Hey hashar [20:12:01] you are running the train right? [20:12:32] yes! [20:12:42] landed on group 1 wikis there are a few errors here and ther eI am looking at them [20:12:46] trying to gauge the impact [20:12:49] 10SRE, 10Application Security Reviews, 10Security Awareness, 10Security-Team, and 3 others: Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10MSRamos) Hi. I'm already in contact with the security team and have already received a Privacy Engineering review because I was told that this w... [20:12:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:12:51] It's probably fine, but as a precaution we'd like to backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/747483 to wmf13 ASAP. [20:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:55] I have looked at the javascript / client side ones yet [20:13:46] The short of it, is that for some logged in users, we are establishing a primary DB connection for GET requests. [20:13:54] Jdlrobson: lets go for it [20:14:12] I mean I can +2 it right now and sync whenever it is merged [20:14:26] 10SRE, 10Application Security Reviews, 10Security Awareness, 10Security-Team, and 3 others: Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10RhinosF1) > So, no, the assessment made before, when this tasked was closed, that this wasn't submitted to any team is incorrect How is anyone o... [20:14:34] I dunno if there's any DBPerf logs showing up, but given the last few trains have been risky, seems like we should err on side of caution. [20:14:42] sure thing [20:14:51] Cool [20:15:28] I am pretty sure we have DB related perf metrics somewhere [20:15:38] (03CR) 10Hashar: [C: 03+2] Remove migration script [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747483 (https://phabricator.wikimedia.org/T297484) (owner: 10Jdlrobson) [20:18:12] 10SRE, 10Application Security Reviews, 10Security Awareness, 10Security-Team, and 3 others: Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10MSRamos) I was under the impression that the Application Security team had received a heads-up regarding the project. As for tags, this being my... [20:22:43] hashar bbiab. [20:24:42] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:24:58] (03CR) 10Dzahn: "I like to see this! thank you, will review this afternoon PST :)" [puppet] - 10https://gerrit.wikimedia.org/r/747600 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [20:25:08] * hashar messes up with logstash [20:27:09] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10sbassett) Hey @MSRamos - Yes, the #security-team did get a heads up about this. Typically we follow [[ https://www.mediawiki.org/wiki/Security/SOP/Applic... [20:27:50] (03PS4) 10Ebernhardson: sre.wdqs: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) [20:27:52] (03CR) 10Ebernhardson: sre.wdqs: Integrate wcqs with wdqs cookbooks (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [20:28:17] tgr_: growthexperiments might have a breakage ( https://phabricator.wikimedia.org/T297827 ), I am not sure who else to ping :) [20:28:32] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10sbassett) [20:30:29] hashar: what's up with growthexperiments? [20:30:39] (we're all in a meeting rn, ftr) [20:31:03] there is some stacktrace on cawiki that originates from growtheperiments https://phabricator.wikimedia.org/T297827 [20:31:10] it mean some feature is broken [20:31:13] let me have a quick look [20:31:20] (03PS9) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [20:31:36] looks like an UBN in my own code :/ [20:31:51] (03Abandoned) 10Ladsgroup: P:mariada::core: update to use profile instead of core module [puppet] - 10https://gerrit.wikimedia.org/r/747565 (owner: 10Jbond) [20:31:54] ah great! [20:31:58] I mean [20:32:18] finding the author or someone knowing about the code is like 80% of the work done from my perspective :] [20:32:30] hehe [20:32:50] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10Dzahn) Hi @MSRamos, so there really is more to this than just the security review. One thing for example is the fol... [20:35:19] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10sbassett) >>! In T297816#7573659, @Dzahn wrote: > wikimediafoundation.org is not running on WIkimedia's servers. That... [20:36:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:36:52] hashar: ehm....cawiki is FULLY down [20:36:54] can you rollback train? [20:37:04] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [20:37:06] (03Merged) 10jenkins-bot: Remove migration script [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747483 (https://phabricator.wikimedia.org/T297484) (owner: 10Jdlrobson) [20:37:26] (I'd normally revert the patch, but it's in the bottom of the dependency tree, so not exactly easy w/o investigating) [20:38:12] cawiki is not down for me [20:38:24] it is down for my staff acc [20:38:25] weird [20:38:32] https://usercontent.irccloud-cdn.com/file/XfRIxLjn/image.png [20:38:43] works for me [20:38:47] maybe specific permission? [20:38:49] hashar: back [20:38:59] urbanecm: paste req id ? [20:39:05] so I don't have to type it out of the screenshot... [20:39:14] 3e66e1d5-05e0-4d90-8887-88a8cf3d3e95 [20:39:26] I'll file a task using phatality [20:39:27] it's very likely related to growth features enabled [20:39:32] it's https://phabricator.wikimedia.org/T297827 dancy [20:39:38] ah good [20:39:47] hashar: I'll do the rollback [20:40:05] it breaks for any newcomer trying to register [20:40:05] thanks [20:40:09] Call to a member function getId() on null [20:40:20] (03CR) 10jerkins-bot: [V: 04-1] sre.wdqs: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [20:40:26] ok yeah, same trace as the task [20:40:51] rollback all wikis? [20:41:05] I think just group1 [20:41:06] group0 _should_ be ok [20:41:07] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10Dzahn) Ok, thank you @sbassett. (This is a good example why this kind of thing needs clarification and planning ear... [20:41:14] Growth is at Wikipedias only and frwiktionary [20:41:16] not sure any group0 has growthexperiments or that mentorship program enabled [20:41:31] at least the issue was not happening from group0 wikis [20:41:49] hashar: I didn't make any changes yet.. turning back over to you [20:42:53] I was looking at the log trying to file a few [20:42:55] rolling back now [20:43:26] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10sbassett) [20:43:32] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:43:49] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10sbassett) [20:44:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:23] (03PS1) 10Hashar: Revert "group1 wikis to 1.38.0-wmf.13 refs T293954" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747606 (https://phabricator.wikimedia.org/T293954) [20:44:30] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10MSRamos) [20:45:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:19] (03CR) 10Hashar: [C: 03+2] "I have done it directly on the deployment server" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747606 (https://phabricator.wikimedia.org/T293954) (owner: 10Hashar) [20:46:11] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.38.0-wmf.13 refs T293954 [20:46:14] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.38.0-wmf.13 refs T293954" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747606 (https://phabricator.wikimedia.org/T293954) (owner: 10Hashar) [20:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:17] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [20:46:46] rolled back [20:46:54] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:32] thanks hashar [20:47:45] i realized a certain flaw in the way how Growth handles config [20:47:47] we...put it on wiki [20:48:04] (which is not easy to edit if i can't access the wiki, heh :D) [20:49:40] Jdlrobson: I will sync your change even if we have rolled back wmf.13 [20:50:19] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10MSRamos) @sbassett @Dzahn The landing page will live on wikimediafoundation.org, like our 20th Birthday Landing Page... [20:50:26] sounds good thanks [20:51:20] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10sbassett) [20:51:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:09] !log hashar@deploy1002 Synchronized php-1.38.0-wmf.13/includes/skins/Skin.php: Remove migration script - T297484 (duration: 01m 06s) [20:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:14] T297484: Update how destination of top-right search form is set - https://phabricator.wikimedia.org/T297484 [20:52:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:52:32] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10sbassett) >>! In T297816#7573717, @MSRamos wrote: > @sbassett @Dzahn The landing page will live on wikimediafoundatio... [20:52:32] sorry for the train blocker, btw :( [20:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:44] (03PS5) 10Ebernhardson: sre.wdqs: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) [20:53:11] urbanecm: no worries, I am more than happy you have stepped in and immediately identified it was a good reason to rollback [20:53:32] better than discovering that I gotta rollback tomorrow morning cause nobody noticed and the Catalan news channel complain that wikipedia is broken :] [20:53:41] heh, definitely [20:53:58] rolling back is a feature :] [20:55:26] the other blocker is an array to string conversion in MediaSearch https://phabricator.wikimedia.org/T297828 [20:55:46] filtering it out from logstash [20:56:15] that's not me though :) [20:58:23] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10MSRamos) @sbassett Correct! It lives under our Corp website umbrella. Happy to clarify if any other questions arise!... [21:00:05] hashar and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T2000). [21:00:05] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211215T2100). [21:02:16] (03PS1) 10RLazarus: imagecatalog: 0770, not 0440, so the init command can create a DB [puppet] - 10https://gerrit.wikimedia.org/r/747610 (https://phabricator.wikimedia.org/T287130) [21:03:42] what [21:03:49] (03CR) 10Legoktm: [C: 03+1] imagecatalog: 0770, not 0440, so the init command can create a DB [puppet] - 10https://gerrit.wikimedia.org/r/747610 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [21:04:17] (03CR) 10RLazarus: [C: 03+2] imagecatalog: 0770, not 0440, so the init command can create a DB [puppet] - 10https://gerrit.wikimedia.org/r/747610 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [21:06:43] so hmm train got rolled back [21:06:53] I have a meeting and after that I will claim it a night [21:07:08] happy meeting & night hashar :)) [21:13:08] 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10nshahquinn-wmf) [21:13:46] 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10nshahquinn-wmf) Mary still needs to sign L3; that should happen shortly. [21:14:37] will catch up tomorrow to find out what has to be pushed / deployed [21:16:40] I'll likely self deploy the fix to "my" bug once there is any :)) [21:28:29] 10SRE-Access-Requests: Requesting wmf LDAP and analytics-private-data access for Mary Munyoki - https://phabricator.wikimedia.org/T297842 (10nshahquinn-wmf) @Arrbee could you approve this access for Mary? [21:37:12] train blocked announce is going to be send (I am in a meeting can't really do) [21:49:31] bed time & [22:11:06] (03CR) 10Ladsgroup: "It worked on some, not all https://puppet-compiler.wmflabs.org/pcc-worker1002/33024/db1123.eqiad.wmnet/change.db1123.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [22:23:27] (03PS10) 10Ladsgroup: mariadb: Make centralauth GRANTs conditional to s7 [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) [22:27:21] (03CR) 10Ladsgroup: "PCC is finally happy: https://puppet-compiler.wmflabs.org/pcc-worker1001/33025/" [puppet] - 10https://gerrit.wikimedia.org/r/746826 (https://phabricator.wikimedia.org/T296537) (owner: 10Ladsgroup) [22:29:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [22:34:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [22:38:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson elastic1084 A4 U3 Cableid#1214202101 Port#12 elastic1085 B7 U12 Cab... [22:39:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Jclark-ctr) [22:45:50] (03PS1) 10Urbanecm: MentorPageMentorManager: Do not fail hard with no mentor list configured [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747484 (https://phabricator.wikimedia.org/T297827) [22:47:52] (03CR) 10Dzahn: "0660 and 0770 are basically the same because puppet always adds the +1/x for a directory" [puppet] - 10https://gerrit.wikimedia.org/r/747610 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [22:48:12] (03PS1) 10Ladsgroup: auto_schema: Move away from mysql.py [software] - 10https://gerrit.wikimedia.org/r/747624 (https://phabricator.wikimedia.org/T288235) [22:49:55] (03CR) 10Legoktm: [C: 03+2] mediawiki: Enable php-yaml on jobrunners, parsoid, and maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/747570 (https://phabricator.wikimedia.org/T296331) (owner: 10Legoktm) [22:50:12] thcipriani: almost choked on my tea reading the train blocker mail, outstanding [22:50:35] !log installing php-yaml on parsoid, jobrunners and maint servers [22:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:47] 10SRE, 10Application Security Reviews, 10Security-Team, 10Traffic, and 2 others: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10Dzahn) @MSRamos Thank you for the clarification! Yes, as Scott said this makes this a bit less of a concern. You can... [22:54:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Jclark-ctr) @LSobanski Quick question we are limited in space to keep in same row as what we are replacing it ms-fe10[09..10] would share in same rack also... [22:54:07] 10SRE, 10Application Security Reviews, 10Security-Team, 10secscrum, 10serviceops-radar: Application Security Review Request : Wikipedia Birthday 2022 - https://phabricator.wikimedia.org/T297816 (10Dzahn) [23:00:09] (03PS1) 10Ladsgroup: auto_schema: Add a timeout for depooling + downtime replicas for longer [software] - 10https://gerrit.wikimedia.org/r/747627 (https://phabricator.wikimedia.org/T288235) [23:05:18] (03PS2) 10Ladsgroup: auto_schema: Add a timeout for depooling + downtime replicas for longer [software] - 10https://gerrit.wikimedia.org/r/747627 (https://phabricator.wikimedia.org/T288235) [23:10:30] !log milimetric@deploy1002 Started deploy [analytics/refinery@0d74de0]: Pushing 0.1.23 for SparkSQLNCLIDriver job [23:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:37] !log milimetric@deploy1002 Started deploy [analytics/refinery@0d74de0]: Pushing 0.1.23 for SparkSQLNCLIDriver job [23:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:26] 10SRE, 10Infrastructure-Foundations, 10Mail: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) Since T128647 back in 2016, all fundraising related aliases should have been moved over to OIT (now ITS). See details from T128647#2087211 ff how Google groups were cr... [23:15:18] 10SRE, 10Infrastructure-Foundations, 10Mail: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) > However, the donor was sent to donations@ (rather than donate@ or fundraising@), This and the part that they are identical on our side makes me think it could be in... [23:16:05] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny: Forwards from VRT not making it to donate@ - https://phabricator.wikimedia.org/T297307 (10Dzahn) [23:21:57] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [23:22:07] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:26:12] !log milimetric@deploy1002 Finished deploy [analytics/refinery@0d74de0]: Pushing 0.1.23 for SparkSQLNCLIDriver job (duration: 15m 35s) [23:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:21] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [23:26:33] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:28:19] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:29:31] (03PS2) 10Jdlrobson: Enable VectorLanguageInMainPageHeader on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745335 (https://phabricator.wikimedia.org/T293470) [23:31:11] (03PS3) 10Jdlrobson: Enable VectorLanguageInMainPageHeader on main page [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745335 (https://phabricator.wikimedia.org/T293470) [23:37:55] !log milimetric@deploy1002 Started deploy [analytics/refinery@0d74de0] (thin): Pushing 0.1.23 for SparkSQLNCLIDriver job (THIN) [23:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:02] !log milimetric@deploy1002 Finished deploy [analytics/refinery@0d74de0] (thin): Pushing 0.1.23 for SparkSQLNCLIDriver job (THIN) (duration: 00m 07s) [23:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:56] (03PS1) 10Legoktm: Add "all-mw" cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/747632 (https://phabricator.wikimedia.org/T294802) [23:44:39] (03CR) 10Dzahn: "How about saying it's just "A:all-mw-eqiad plus A:all-mw-codfw"" [puppet] - 10https://gerrit.wikimedia.org/r/747632 (https://phabricator.wikimedia.org/T294802) (owner: 10Legoktm) [23:46:11] mutante: having an A:{foo} and then A:{foo}-{eqiad,codfw} seems to be the pattern in other per-DC aliases [23:47:56] (03PS1) 10Legoktm: mediawiki: Enable php-yaml unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/747633 (https://phabricator.wikimedia.org/T296331) [23:48:23] (03PS1) 10Cwhite: logstash: add optional document_type parameter to es output config [puppet] - 10https://gerrit.wikimedia.org/r/747634 (https://phabricator.wikimedia.org/T297239) [23:48:25] (03PS1) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [23:48:31] (03PS1) 10Cwhite: apifeatureusage: clean up legacy apifeatureusage config [puppet] - 10https://gerrit.wikimedia.org/r/747636 (https://phabricator.wikimedia.org/T297239) [23:48:50] (03CR) 10Dzahn: [C: 03+1] Add "all-mw" cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/747632 (https://phabricator.wikimedia.org/T294802) (owner: 10Legoktm) [23:49:26] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33026/console" [puppet] - 10https://gerrit.wikimedia.org/r/747633 (https://phabricator.wikimedia.org/T296331) (owner: 10Legoktm) [23:49:46] legoktm: first wanted to say "yes, I agree on creating the combined one. Just saying to define it differently.." but yes. yes. ok :) +1 [23:49:59] (03CR) 10jerkins-bot: [V: 04-1] role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [23:50:21] gotcha. thanks :) [23:50:25] to me it was at first "highest level is "all-mw" defined as "all-eqiad-mw plus all-codfw-mw" and those based on "mw" [23:50:28] (03CR) 10Legoktm: [C: 03+2] Add "all-mw" cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/747632 (https://phabricator.wikimedia.org/T294802) (owner: 10Legoktm) [23:50:29] but either works :) [23:53:13] (03PS2) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [23:53:57] (03CR) 10jerkins-bot: [V: 04-1] role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [23:54:03] (03CR) 10Legoktm: [V: 03+1 C: 03+2] mediawiki: Enable php-yaml unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/747633 (https://phabricator.wikimedia.org/T296331) (owner: 10Legoktm) [23:55:21] (03PS3) 10Cwhite: role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) [23:55:41] (03PS4) 10Legoktm: fpm-multiversion-base: Add PHP yaml extension [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/742790 (https://phabricator.wikimedia.org/T296331) [23:55:52] (03CR) 10Legoktm: [V: 03+2 C: 03+2] fpm-multiversion-base: Add PHP yaml extension [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/742790 (https://phabricator.wikimedia.org/T296331) (owner: 10Legoktm) [23:57:46] (03CR) 10jerkins-bot: [V: 04-1] role: add apifeatureusage role [puppet] - 10https://gerrit.wikimedia.org/r/747635 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)