[00:00:05] RoanKattouw and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211130T0000). [00:00:05] cjming and AntiComposite: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:11] o/ [00:00:36] Hi [00:00:36] o/ [00:00:44] Wikisource seems to be down again.. [00:00:56] Fine for me [00:01:02] I am in the UK [00:01:05] can you be more specific? [00:01:14] (about how it is down) [00:01:15] Server not found errors in browser [00:01:36] "An error occurred during a connection to en.wikisource.org. " [00:01:44] can you reach any of the other sites? [00:01:50] I can reach discord [00:02:02] wikimedia sites [00:03:11] Wikivoyage seems OK, but I might be seeing cached versions [00:05:17] I can do the backport deployment today [00:05:19] what about a wikisource language you haven't been to before, like https://ml.wikisource.org/wiki/ [00:05:22] RoanKattouw, thanks [00:05:24] cjming: Are you around for your scheduled deployment? [00:05:25] ShakespeareFan00: see https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue [00:05:31] I'm here! [00:05:55] (03CR) 10Catrope: [C: 03+2] Provide fallback for config variable when not present [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742517 (owner: 10Clare Ming) [00:06:24] (03CR) 10Catrope: [C: 03+2] Provide fallback for config variable when not present [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742517 (owner: 10Clare Ming) [00:06:41] 10SRE, 10Analytics, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Legoktm) [00:06:52] (03PS2) 10Catrope: allow sysops to set/remove reviewer group on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738530 (https://phabricator.wikimedia.org/T294696) (owner: 10AntiCompositeNumber) [00:06:54] AntiComposite: Seems OK now... [00:06:56] (03CR) 10Catrope: [C: 03+2] allow sysops to set/remove reviewer group on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738530 (https://phabricator.wikimedia.org/T294696) (owner: 10AntiCompositeNumber) [00:07:42] (03Merged) 10jenkins-bot: allow sysops to set/remove reviewer group on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738530 (https://phabricator.wikimedia.org/T294696) (owner: 10AntiCompositeNumber) [00:08:41] (03PS3) 10Clare Ming: Enable scroll tracking for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742524 (https://phabricator.wikimedia.org/T292586) [00:09:11] (03Merged) 10jenkins-bot: Provide fallback for config variable when not present [extensions/WikimediaEvents] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742517 (owner: 10Clare Ming) [00:09:53] 10SRE, 10Analytics, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Legoktm) @dcausse the $kafka_reporting_topic variable is still in puppet (https://gerrit.wikimedia.org/g/operations/puppet/+/974... [00:10:44] AntiComposite: Your change is on mwdebug1002, please test [00:11:36] cjming: Your WikimediaEvents change is on mwdebug1002, please test [00:11:58] RoanKattouw, ckb:Special:ListGroupRights looks correct to me [00:12:45] (03CR) 10Catrope: [C: 03+2] Enable scroll tracking for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742524 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [00:13:18] !log catrope@deploy1002 Synchronized wmf-config/flaggedrevs.php: Config: [[gerrit:738530|allow sysops to set/remove reviewer group on ckbwiki (T294696)]] (duration: 00m 55s) [00:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:22] T294696: Add "Pending changes reviewer" User group to Central Kurdish Wikipedia - https://phabricator.wikimedia.org/T294696 [00:13:29] RoanKattouw: thanks! i think it's gtg - we should see results immediately in logstash (without patch, it's blowing lots of errors) [00:13:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:53] (03Merged) 10jenkins-bot: Enable scroll tracking for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742524 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [00:14:43] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/WikimediaEvents/modules/ext.wikimediaEvents/readingDepth.js: Backport: [[gerrit:742517|Provide fallback for config variable when not present]] (duration: 00m 55s) [00:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:39] cjming: OK great! Your WikimediaEvents patch is now deployed, and the config patch for scroll tracking is now on mwdebug1002 for testing [00:16:14] Although since it's a 1% rate thing, it's probably hard to meaningfully test without me deploying it to the real site? [00:16:38] RoanKattouw: ya - not sure how to properly test [00:16:46] OK I'll just roll it out then [00:16:50] ty! [00:17:28] Thanks! [00:17:55] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742524|Enable scroll tracking for all users (T292586)]] (duration: 00m 55s) [00:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:00] T292586: Sticky Header: Create schema to track returning to the top of the page - https://phabricator.wikimedia.org/T292586 [00:20:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:04:59] (03PS1) 10GergΕ‘ Tisza: Newcomer tasks: Fix filtering of non-existent task types [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742548 (https://phabricator.wikimedia.org/T296366) [01:31:36] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.35% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [02:04:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:56] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.11 [core] (wmf/1.38.0-wmf.11) - 10https://gerrit.wikimedia.org/r/742582 [02:07:00] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.11 [core] (wmf/1.38.0-wmf.11) - 10https://gerrit.wikimedia.org/r/742582 (owner: 10TrainBranchBot) [02:19:37] 10SRE, 10MediaWiki-Parser: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Bawolff) Afaik, on (file object) cache miss, parser doesn't bulk load file objects, but loads them one at a time from db as it encounters them in the wikitext (compared to say h... [02:22:44] 10SRE, 10MediaWiki-Parser: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Bawolff) Basically i suspect this is a dupe of T56033 [02:27:39] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.38.0-wmf.11 [core] (wmf/1.38.0-wmf.11) - 10https://gerrit.wikimedia.org/r/742582 (owner: 10TrainBranchBot) [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211130T0300) [03:11:48] 10SRE, 10DNS, 10Traffic, 10WMF-Communications: Setup subdomain for Foundation messaging site - https://phabricator.wikimedia.org/T296570 (10Varnent) I also have the certificates info from VIP - and can share that with whomever will need it - presuming that is something we will need. [04:48:26] 10SRE, 10foundation.wikimedia.org, 10serviceops, 10User-Urbanecm_WMF (GovWiki): Investigate and restore foundationwiki 302 httpbb test - https://phabricator.wikimedia.org/T296687 (10Krinkle) I suspect the reason redirects like this one broke, as result of that (very good and reasonable) config change, is t... [05:06:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-sqoop-mediawiki-production-daily.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:55] (03PS1) 10Marostegui: control-mariadb-10.4-bullseye: Control file for 10.4 on Bullseye [software] - 10https://gerrit.wikimedia.org/r/742588 (https://phabricator.wikimedia.org/T295965) [06:20:33] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4-bullseye: Control file for 10.4 on Bullseye [software] - 10https://gerrit.wikimedia.org/r/742588 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [06:21:06] (03Merged) 10jenkins-bot: control-mariadb-10.4-bullseye: Control file for 10.4 on Bullseye [software] - 10https://gerrit.wikimedia.org/r/742588 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [06:51:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Packaging fixes: [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/742280 (owner: 10RLazarus) [07:05:28] (03CR) 10Giuseppe Lavagetto: "Not sure why you would need a merge commit here." [docker-images/imagecatalog] (debian) - 10https://gerrit.wikimedia.org/r/742573 (owner: 10RLazarus) [07:12:08] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Initial deb package [docker-images/imagecatalog] (debian) - 10https://gerrit.wikimedia.org/r/738500 (owner: 10RLazarus) [07:17:42] (03PS2) 10Elukey: update-wmf-ca-certificates: add group/other read flags to cert bundle [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742485 [07:17:44] (03PS2) 10Elukey: Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742486 [07:18:36] (03CR) 10Elukey: "replaced go+r with 0644 as requested by John :)" [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742485 (owner: 10Elukey) [07:21:12] (03PS1) 10ArielGlenn: remove snapshot02 stretch instance from deployment-prep scap targets [puppet] - 10https://gerrit.wikimedia.org/r/742608 [07:22:17] (03CR) 10ArielGlenn: [C: 03+2] remove snapshot02 stretch instance from deployment-prep scap targets [puppet] - 10https://gerrit.wikimedia.org/r/742608 (owner: 10ArielGlenn) [07:38:06] (03PS1) 10DCausse: [wdqs] cleanup streaming updater config [puppet] - 10https://gerrit.wikimedia.org/r/742669 [07:39:39] (03PS1) 10DCausse: [wdqs] switch wdqs1010 to the streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/742670 [07:44:01] (03PS1) 10Elukey: atskafka: use the same ca certificate as varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/742671 (https://phabricator.wikimedia.org/T296064) [07:45:33] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32723/console" [puppet] - 10https://gerrit.wikimedia.org/r/742671 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [07:46:12] (03CR) 10Elukey: [V: 03+1 C: 03+2] atskafka: use the same ca certificate as varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/742671 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [07:52:25] 10SRE, 10Analytics, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10dcausse) >>! In T296699#7536195, @Legoktm wrote: > @dcausse the $kafka_reporting_topic variable is still in puppet (https://gerr... [07:52:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1] service::catalog: DRY the wikireplicas section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [08:14:25] !log revoking DROP from wikiadmin on all pooled replicas (T249683) [08:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:31] T249683: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 [08:15:25] (03CR) 10Muehlenhoff: [C: 03+2] Retire labnet-users group [puppet] - 10https://gerrit.wikimedia.org/r/742440 (https://phabricator.wikimedia.org/T296574) (owner: 10Muehlenhoff) [08:16:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742499 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [08:18:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10WMDE-leszek) I approve this request from WMDE side. [08:18:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @JanJaquemot - https://phabricator.wikimedia.org/T296633 (10WMDE-leszek) I approve this request from WMDE side. [08:19:53] (03PS5) 10Giuseppe Lavagetto: service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703 [08:21:24] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32724/console" [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [08:22:11] (03CR) 10Muehlenhoff: [C: 03+2] Disable cluster rebalances temporarily [puppet] - 10https://gerrit.wikimedia.org/r/742499 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [08:22:16] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] service::catalog: DRY the wikireplicas section [puppet] - 10https://gerrit.wikimedia.org/r/741703 (owner: 10Giuseppe Lavagetto) [08:24:06] !log restarting blazegraph on wdqs1006 (jvm stuck for 6hours) [08:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:44] (03PS1) 10Elukey: Move kafkatee instances to the new CA bundle location [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) [08:25:46] (03PS1) 10Elukey: Move coal, navtiming and statsv to the new canonical CA bundle path [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) [08:26:17] (03CR) 10Elukey: "elukey@cumin1001:~$ sudo cumin 'r:kafkatee::instance'" [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [08:28:19] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32725/console" [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [08:28:50] (03CR) 10Elukey: Move kafkatee instances to the new CA bundle location [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [08:29:32] (03PS1) 10Ladsgroup: mariadb: Fix production grants [puppet] - 10https://gerrit.wikimedia.org/r/742675 (https://phabricator.wikimedia.org/T249683) [08:30:44] (03CR) 10Elukey: "elukey@cumin1001:~$ sudo cumin 'c:profile::webperf::processors'" [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [08:31:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32726/console" [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [08:32:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32727/console" [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [08:34:43] (03CR) 10Elukey: "This needs https://gerrit.wikimedia.org/r/c/operations/debs/wmf-certificates/+/742485 rolled out before merging :)" [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [08:34:53] (03CR) 10Elukey: [V: 03+1] "This needs https://gerrit.wikimedia.org/r/c/operations/debs/wmf-certificates/+/742485 rolled out before merging :)" [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [08:36:33] (03CR) 10Marostegui: [C: 04-1] "Let's also add the performance_schema and sys one!" [puppet] - 10https://gerrit.wikimedia.org/r/742675 (https://phabricator.wikimedia.org/T249683) (owner: 10Ladsgroup) [08:41:44] (03PS2) 10Ladsgroup: mariadb: Fix production grants [puppet] - 10https://gerrit.wikimedia.org/r/742675 (https://phabricator.wikimedia.org/T249683) [08:41:54] (03CR) 10Ladsgroup: mariadb: Fix production grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742675 (https://phabricator.wikimedia.org/T249683) (owner: 10Ladsgroup) [08:42:04] (03CR) 10Marostegui: [C: 03+1] mariadb: Fix production grants [puppet] - 10https://gerrit.wikimedia.org/r/742675 (https://phabricator.wikimedia.org/T249683) (owner: 10Ladsgroup) [08:42:57] (03PS3) 10Muehlenhoff: Add Cumin alias for wcqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/741084 [08:44:51] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Fix production grants [puppet] - 10https://gerrit.wikimedia.org/r/742675 (https://phabricator.wikimedia.org/T249683) (owner: 10Ladsgroup) [08:47:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] update-wmf-ca-certificates: add group/other read flags to cert bundle [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742485 (owner: 10Elukey) [08:52:29] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for wcqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/741084 (owner: 10Muehlenhoff) [08:53:28] RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:43] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/742480 (owner: 10Majavah) [08:56:46] (03CR) 10Muehlenhoff: [C: 03+2] add phab task for role::doc stretch deprecation [puppet] - 10https://gerrit.wikimedia.org/r/742480 (owner: 10Majavah) [09:00:48] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10Jelto) [09:01:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 2 others: Create a cookbook for depooling one or all services from one kubernetes cluster - https://phabricator.wikimedia.org/T260663 (10Jelto) [09:05:47] (03CR) 10Elukey: [V: 03+2 C: 03+2] update-wmf-ca-certificates: add group/other read flags to cert bundle [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742485 (owner: 10Elukey) [09:06:03] (03CR) 10Elukey: [V: 03+2 C: 03+2] Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/742486 (owner: 10Elukey) [09:06:48] !log dropping wikiadmin@localhost from all pooled replicas of s6 (T296511) [09:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:52] T296511: Drop wikiadmin@localhost MySQL user from core dbs - https://phabricator.wikimedia.org/T296511 [09:06:53] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10Patch-For-Review, 10cloud-services-team (Kanban): Consider removing labnet-users group - https://phabricator.wikimedia.org/T296574 (10MoritzMuehlenhoff) 05Openβ†’03Resolved a:03MoritzMuehlenhoff The group has been deprecated. [09:42:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:44:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:45:17] 10SRE, 10LDAP-Access-Requests: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10Ollie.Shotton_WMDE) [09:50:54] 10SRE, 10LDAP-Access-Requests: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10WMDE-leszek) As WMDE Engineering Manager I confirm @Ollie.Shotton_WMDE's affiliation, and approve the request on WMDE's end. [09:53:38] (03CR) 10Cathal Mooney: [C: 03+2] Add drmrs public prefix to ntp allowed config (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742462 (https://phabricator.wikimedia.org/T296623) (owner: 10Cathal Mooney) [09:56:21] (03PS1) 10Muehlenhoff: Add Cumin aliases for edge sites [puppet] - 10https://gerrit.wikimedia.org/r/742686 [09:58:05] (03CR) 10Majavah: dynamicproxy: Validate route project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742267 (https://phabricator.wikimedia.org/T129800) (owner: 10Majavah) [09:58:07] !log installing remaining ICU security updates [09:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:22] (03PS1) 10ArielGlenn: fix up filesize config option for commons mediainfo dumps in deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/742687 [10:03:27] (03CR) 10ArielGlenn: [C: 03+2] fix up filesize config option for commons mediainfo dumps in deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/742687 (owner: 10ArielGlenn) [10:23:02] 10SRE, 10Patch-For-Review: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) Very weird result in deployment-prep: ` elukey@deployment-webperf11:~$ sudo rm /etc/ssl/localcerts/WMF_TEST_CA.pem elukey@deployment-webperf11:~$ sudo puppet agent -tv Info: Using c... [10:24:16] (03PS1) 10Elukey: sslcert::trusted_ca: ensure cert bundle readability for group/others [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) [10:25:15] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32728/console" [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [10:26:41] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32729/console" [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [10:27:12] jouncebot: nowandnext [10:27:12] No deployments scheduled for the next 1 hour(s) and 32 minute(s) [10:27:12] In 1 hour(s) and 32 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211130T1200) [10:27:55] alright, I’ll finish my termbox deployment from yesterday (shortly before the outage) [10:29:23] !log lucaswerkmeister-wmde@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'termbox' for release 'production' . [10:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:33] !log lucaswerkmeister-wmde@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [10:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:16] alright, all done I think :) [10:36:30] (03CR) 10Ayounsi: [C: 03+1] Modified loopback4 filter to allow NTP commands to run [homer/public] - 10https://gerrit.wikimedia.org/r/742460 (https://phabricator.wikimedia.org/T296623) (owner: 10Cathal Mooney) [10:39:46] (03PS1) 10ArielGlenn: update man pages and bump version for new build [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/742692 [10:39:58] !log rollout wmf-certificates 0~20211129-1 fleet wide (add group/others permissions to the cert bundle) [10:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:53] (03PS1) 10ArielGlenn: version 0.1.4 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/742693 [10:58:27] (03PS6) 10Arturo Borrero Gonzalez: ceph: auth: introduce new parameter 'import_to_ceph' [puppet] - 10https://gerrit.wikimedia.org/r/742175 (https://phabricator.wikimedia.org/T293752) [10:58:29] (03PS14) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [11:00:19] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [11:01:05] !log restarting tilerator, kartotherian and tileratorui in codfw [11:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:59] !log restarting tilerator, kartotherian and tileratorui for updates in eqiad [11:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:26] (03PS15) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [11:22:34] (03PS16) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [11:26:28] (03PS1) 10Btullis: Remove stray double-quote from btullis dot-profile [puppet] - 10https://gerrit.wikimedia.org/r/742698 [11:29:20] (03CR) 10Btullis: [C: 03+2] Remove stray double-quote from btullis dot-profile [puppet] - 10https://gerrit.wikimedia.org/r/742698 (owner: 10Btullis) [11:36:47] (03PS1) 10Arturo Borrero Gonzalez: ceph: auth: introduce keydata for mon.xxxx entries [labs/private] - 10https://gerrit.wikimedia.org/r/742699 (https://phabricator.wikimedia.org/T293752) [11:40:26] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] update man pages and bump version for new build [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/742692 (owner: 10ArielGlenn) [11:41:23] (03CR) 10ArielGlenn: [C: 03+2] version 0.1.4 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/742693 (owner: 10ArielGlenn) [11:43:53] (03PS1) 10Volans: install_server: fix mgmt DHCP [puppet] - 10https://gerrit.wikimedia.org/r/742701 (https://phabricator.wikimedia.org/T271583) [11:45:20] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) >>! In T296622#7534512, @MoritzMuehlenhoff wrote: > These UUIDs refer to ganeti2007.codfw.wmnet and ganeti2008.codfw.wmnet, this needs more investigation. The node cert... [11:46:11] (03CR) 10Volans: "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/32734/install2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/742701 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [11:47:56] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/742701 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [11:48:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10jcrespo) > Are you ok leaving this server as is, until the refresh happens in Q3... [11:50:17] !log running "sudo gnt-cluster renew-crypto --new-node-certificates --new-rapi-certificate --new-spice-certificate" for Ganeti codfw cluster T296622 [11:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:22] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [11:50:49] (03CR) 10Volans: [C: 03+2] install_server: fix mgmt DHCP [puppet] - 10https://gerrit.wikimedia.org/r/742701 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [11:54:19] (03PS5) 10Jbond: cookbook sre.puppet.netbox: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 [11:54:48] PROBLEM - ganeti-confd running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:55:04] PROBLEM - ganeti-mond running on ganeti2022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [11:55:42] PROBLEM - ganeti-mond running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [11:55:48] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:56:12] PROBLEM - ganeti-wconfd running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:56:14] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:16] (03PS1) 10Volans: install_server: notify dhcp server [puppet] - 10https://gerrit.wikimedia.org/r/742702 (https://phabricator.wikimedia.org/T271583) [11:56:38] PROBLEM - ganeti-confd running on ganeti2023 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:56:42] PROBLEM - HTTPS Ganeti RAPI codfw on ganeti2019 is CRITICAL: connect to address ganeti01.svc.codfw.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [11:56:46] PROBLEM - ganeti-mond running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [11:56:47] moritzm: are the criticals from ganeti expected for the renew crypto? [11:56:48] PROBLEM - ganeti-confd running on ganeti2024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:56:50] ^ ganeti alerts are expected, the daemons are temporarily stopped during the cert renewal [11:56:54] ok [11:57:01] it's almost done [11:57:02] PROBLEM - ganeti-confd running on ganeti2022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [11:57:03] PROBLEM - ganeti-noded running on ganeti2024 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [11:58:26] RECOVERY - ganeti-wconfd running on ganeti2019 is OK: PROCS OK: 1 process with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:59:07] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10ayounsi) >>! In T296271#7523887, @Cmjohnson wrote: > I am confused, msw1-eqiad in A8 is already an EX-4300 48T. Do we want to replace with the same switch? This is just the time to configure and te... [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211130T1200). [12:00:05] No Gerrit patches in the queue for this window AFAICS. [12:01:06] RECOVERY - ganeti-confd running on ganeti2023 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:01:14] RECOVERY - ganeti-mond running on ganeti2021 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [12:01:16] RECOVERY - ganeti-confd running on ganeti2024 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:01:30] RECOVERY - ganeti-confd running on ganeti2022 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:01:32] RECOVERY - ganeti-noded running on ganeti2024 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [12:01:46] RECOVERY - ganeti-mond running on ganeti2022 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [12:02:26] RECOVERY - ganeti-mond running on ganeti2019 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [12:02:43] (03PS1) 10Jbond: cookbook sre.hosts.decommision: dont sleep in dry-run mode [cookbooks] - 10https://gerrit.wikimedia.org/r/742703 [12:02:52] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetboard1001.eqiad.wmnet [12:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:24] RECOVERY - HTTPS Ganeti RAPI codfw on ganeti2019 is OK: HTTP OK: Status line output matched 401 - 309 bytes in 0.016 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [12:03:46] RECOVERY - ganeti-confd running on ganeti2019 is OK: PROCS OK: 1 process with UID = 113 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [12:06:54] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) >>! In T296622#7536998, @MoritzMuehlenhoff wrote: > Looking at the timestamp they were in fact touched, but maybe there's an ordering bug somewhere. I'll re-run > > ` >... [12:07:00] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:07:22] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:51] (03PS1) 10Jbond: scap::dsh: update puppetboard hosts to new host [puppet] - 10https://gerrit.wikimedia.org/r/742704 (https://phabricator.wikimedia.org/T264276) [12:09:42] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts puppetboard1001.eqiad.wmnet [12:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:37] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [12:13:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [12:13:42] (03CR) 10Muehlenhoff: "But we can simply remove it, can't we? After all with the new setup puppetboard gets deployed via the deb and no longer via scap, right?" [puppet] - 10https://gerrit.wikimedia.org/r/742704 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [12:16:22] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/742703 (owner: 10Jbond) [12:18:12] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [12:23:16] (03CR) 10Jbond: sslcert::trusted_ca: ensure cert bundle readability for group/others (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [12:23:54] (03CR) 10Jbond: [C: 03+1] install_server: notify dhcp server [puppet] - 10https://gerrit.wikimedia.org/r/742702 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [12:25:44] (03PS2) 10Jbond: scap::dsh: remove puppetboard; hosts now use apt [puppet] - 10https://gerrit.wikimedia.org/r/742704 (https://phabricator.wikimedia.org/T264276) [12:25:46] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/742704 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [12:25:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance T277354 [12:25:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2114.codfw.wmnet with reason: Maintenance T277354 [12:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:54] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:25:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T277354)', diff saved to https://phabricator.wikimedia.org/P17900 and previous config saved to /var/cache/conftool/dbconfig/20211130-122555-marostegui.json [12:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2114 (T277354)', diff saved to https://phabricator.wikimedia.org/P17901 and previous config saved to /var/cache/conftool/dbconfig/20211130-122610-marostegui.json [12:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. Not 100% sure if there are other steps needed to "close" a scap repo, maybe ping someone from release engineering for co" [puppet] - 10https://gerrit.wikimedia.org/r/742704 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [12:30:00] (03CR) 10Jbond: [V: 03+2 C: 03+2] cookbook sre.hosts.decommision: dont sleep in dry-run mode (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/742703 (owner: 10Jbond) [12:31:03] (03CR) 10Jbond: scap::dsh: remove puppetboard; hosts now use apt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742704 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [12:32:16] (03PS1) 10Ayounsi: Allow DHCP reply from install servers to relay [homer/public] - 10https://gerrit.wikimedia.org/r/742708 (https://phabricator.wikimedia.org/T271583) [12:32:38] (03PS7) 10Arturo Borrero Gonzalez: ceph: auth: introduce new parameter 'import_to_ceph' [puppet] - 10https://gerrit.wikimedia.org/r/742175 (https://phabricator.wikimedia.org/T293752) [12:32:40] (03PS17) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [12:32:42] (03PS1) 10Arturo Borrero Gonzalez: ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) [12:33:08] (03Merged) 10jenkins-bot: cookbook sre.hosts.decommision: dont sleep in dry-run mode [cookbooks] - 10https://gerrit.wikimedia.org/r/742703 (owner: 10Jbond) [12:33:45] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:34:19] (03CR) 10jerkins-bot: [V: 04-1] ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:41:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2114 (T277354)', diff saved to https://phabricator.wikimedia.org/P17902 and previous config saved to /var/cache/conftool/dbconfig/20211130-124115-marostegui.json [12:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:20] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:44:12] (03PS2) 10Arturo Borrero Gonzalez: ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) [12:44:14] (03PS18) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [12:45:08] (03CR) 10jerkins-bot: [V: 04-1] ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:45:23] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:56:00] (03CR) 10Jbond: "As mentioned on irc this would probably be better as a puppet function" [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:56:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2114 (T277354)', diff saved to https://phabricator.wikimedia.org/P17903 and previous config saved to /var/cache/conftool/dbconfig/20211130-125620-marostegui.json [12:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:25] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:57:09] (03PS3) 10Arturo Borrero Gonzalez: ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) [12:57:11] (03PS19) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [12:57:51] (03CR) 10jerkins-bot: [V: 04-1] ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [12:58:14] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:01:26] (03CR) 10Elukey: [V: 03+1] sslcert::trusted_ca: ensure cert bundle readability for group/others (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:01:28] (03PS4) 10Arturo Borrero Gonzalez: ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) [13:01:30] (03PS20) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [13:02:21] (03CR) 10jerkins-bot: [V: 04-1] ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:02:40] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:02:47] (03CR) 10Cathal Mooney: [C: 03+2] Modified loopback4 filter to allow NTP commands to run (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/742460 (https://phabricator.wikimedia.org/T296623) (owner: 10Cathal Mooney) [13:03:15] (03CR) 10Elukey: [V: 03+1] Move coal, navtiming and statsv to the new canonical CA bundle path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:03:31] (03Merged) 10jenkins-bot: Modified loopback4 filter to allow NTP commands to run [homer/public] - 10https://gerrit.wikimedia.org/r/742460 (https://phabricator.wikimedia.org/T296623) (owner: 10Cathal Mooney) [13:04:00] (03PS5) 10Arturo Borrero Gonzalez: ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) [13:04:34] (03PS1) 10Hnowlan: api-gateway: Create read and write clusters for mw and discovery APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/742715 (https://phabricator.wikimedia.org/T294445) [13:04:57] (03CR) 10jerkins-bot: [V: 04-1] ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:05:01] !log Running homer against CR routers to adjust loopback4 filter enabling local NTP queries for status. T296623 [13:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:06] T296623: Enable NTP for drmrs network devices - https://phabricator.wikimedia.org/T296623 [13:05:41] (03CR) 10Jbond: [C: 03+1] "i see you have allready updated to be simlar to the comment i made earlier so lgtm 😊" [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:08:11] (03PS2) 10Hnowlan: api-gateway: Create read and write clusters for mw and discovery APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/742715 (https://phabricator.wikimedia.org/T294445) [13:10:36] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10cloud-services-team (Kanban): Consider removing labnet-users group - https://phabricator.wikimedia.org/T296574 (10hashar) Confirmed: ` $ ssh cloudnet1003.eqiad.wmnet Password: ` Danke Schon. [13:11:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db2114 (T277354)', diff saved to https://phabricator.wikimedia.org/P17904 and previous config saved to /var/cache/conftool/dbconfig/20211130-131124-marostegui.json [13:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:31] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:12:17] (03PS6) 10Arturo Borrero Gonzalez: ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) [13:13:04] (03CR) 10Jbond: ceph: auth: introduce function to calculate keyring_path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:15:17] (03PS21) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [13:16:18] (03CR) 10jerkins-bot: [V: 04-1] ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:16:34] (03CR) 10Jbond: sslcert::trusted_ca: ensure cert bundle readability for group/others (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:17:28] (03PS22) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) [13:18:30] (03PS2) 10Elukey: sslcert::trusted_ca: ensure cert bundle readability for group/others [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) [13:19:01] (03CR) 10Elukey: sslcert::trusted_ca: ensure cert bundle readability for group/others (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:19:16] (03CR) 10jerkins-bot: [V: 04-1] sslcert::trusted_ca: ensure cert bundle readability for group/others [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:20:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:20:52] (03PS3) 10Jbond: sslcert::trusted_ca: ensure cert bundle readability for group/others [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:21:06] (03CR) 10Jbond: [C: 03+1] "LGTM (i made a quick fix for the style violations)" [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:23:26] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] ceph: auth: introduce keydata for mon.xxxx entries [labs/private] - 10https://gerrit.wikimedia.org/r/742699 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:23:29] (03CR) 10Elukey: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/742690 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:23:32] RECOVERY - exim queue on mx2001 is OK: OK: Less than 2000 mails in exim queue. https://wikitech.wikimedia.org/wiki/Exim [13:25:14] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [13:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:28] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "turns out the change is a PCC NOOP https://puppet-compiler.wmflabs.org/compiler1001/32736/" [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:26:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph: auth: introduce new parameter 'import_to_ceph' [puppet] - 10https://gerrit.wikimedia.org/r/742175 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:26:47] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph: auth: introduce function to calculate keyring_path [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:26:49] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/742708 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [13:27:14] (03CR) 10Ayounsi: [C: 03+2] Allow DHCP reply from install servers to relay [homer/public] - 10https://gerrit.wikimedia.org/r/742708 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [13:27:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:27:50] (03Merged) 10jenkins-bot: Allow DHCP reply from install servers to relay [homer/public] - 10https://gerrit.wikimedia.org/r/742708 (https://phabricator.wikimedia.org/T271583) (owner: 10Ayounsi) [13:29:30] (03PS2) 10Elukey: Move kafkatee instances to the new CA bundle location [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) [13:29:39] (03CR) 10Elukey: Move kafkatee instances to the new CA bundle location (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:30:06] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10SRE Observability (FY2021/2022-Q2): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) [13:30:35] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10SRE Observability (FY2021/2022-Q2): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [13:32:09] 10SRE-OnFire (FY2021/2022-Q2), 10SRE Observability (FY2021/2022-Q2): incidents occurring during Q2 have been scored with the scorecard - https://phabricator.wikimedia.org/T292254 (10lmata) [13:34:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:37:22] (03CR) 10Jbond: [C: 03+1] Move kafkatee instances to the new CA bundle location [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:38:50] (03CR) 10David Caro: ceph: auth: introduce function to calculate keyring_path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742709 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:39:23] (03PS1) 10Arturo Borrero Gonzalez: hiera: ceph: typo in mon number [puppet] - 10https://gerrit.wikimedia.org/r/742720 [13:40:17] (03PS1) 10Arturo Borrero Gonzalez: hiera: ceph: auth: fix typo in ceph mon name [labs/private] - 10https://gerrit.wikimedia.org/r/742721 [13:40:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hiera: ceph: typo in mon number [puppet] - 10https://gerrit.wikimedia.org/r/742720 (owner: 10Arturo Borrero Gonzalez) [13:40:48] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hiera: ceph: auth: fix typo in ceph mon name [labs/private] - 10https://gerrit.wikimedia.org/r/742721 (owner: 10Arturo Borrero Gonzalez) [13:41:48] (03CR) 10David Caro: ceph: migrate mon auth to the new abstraction (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:42:55] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 (10MatthewVernon) @Papaul you asked for the drive serial. From the above listing `Device Id: 15`, so: ` mvernon@ms-be2059:~$ sudo smartctl -d megar... [13:43:56] 10SRE-tools, 10Infrastructure-Foundations, 10User-jbond: Spicerack: improve Icinga module to support mgmt interfaces - https://phabricator.wikimedia.org/T226470 (10Volans) 05Openβ†’03Resolved a:03Volans This has been actually fixed in [[ https://doc.wikimedia.org/spicerack/master/release.html#v0-0-52-202... [13:44:59] (03CR) 10Arturo Borrero Gonzalez: ceph: migrate mon auth to the new abstraction (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742176 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:45:34] (03PS1) 10Filippo Giunchedi: nagios_common: remove bstorm-email from production contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/742723 [13:45:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [13:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nagios_common: remove bstorm-email from production contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/742723 (owner: 10Filippo Giunchedi) [13:48:54] (03CR) 10Volans: [C: 03+2] install_server: notify dhcp server [puppet] - 10https://gerrit.wikimedia.org/r/742702 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [13:49:19] (03CR) 10Filippo Giunchedi: [C: 03+2] nagios_common: remove bstorm-email from production contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/742723 (owner: 10Filippo Giunchedi) [13:53:56] (03PS3) 10Elukey: Move kafkatee instances to the new CA bundle location [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) [13:53:58] (03PS2) 10Elukey: Move coal, navtiming and statsv to the new canonical CA bundle path [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) [13:54:00] (03PS1) 10Elukey: Add helper functions to retrieve CA bundle paths/passwords [puppet] - 10https://gerrit.wikimedia.org/r/742724 (https://phabricator.wikimedia.org/T296089) [13:54:02] (03PS1) 10Elukey: profile::kafka::broker: use new get ca bundle path helpers [puppet] - 10https://gerrit.wikimedia.org/r/742725 (https://phabricator.wikimedia.org/T296089) [13:55:40] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32737/console" [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:56:55] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32738/console" [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [13:59:39] 10SRE, 10Infrastructure-Foundations, 10netops: Enable NTP for drmrs network devices - https://phabricator.wikimedia.org/T296623 (10cmooney) Ok so this has been addressed for CR routers. You can view the NTP status as follows: ` cmooney@cr2-eqord> show ntp associations remote refid st t... [14:00:53] 10SRE, 10Infrastructure-Foundations, 10netops: Enable NTP for drmrs network devices - https://phabricator.wikimedia.org/T296623 (10cmooney) 05Openβ†’03Resolved Scrap that it does seem to be working, perhaps it only failed to query against itself after the initial change. ` cmooney@mr1-drmrs> show ntp assoc... [14:03:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32740/console" [puppet] - 10https://gerrit.wikimedia.org/r/742725 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:11:29] (03CR) 10Jbond: [C: 03+1] Add helper functions to retrieve CA bundle paths/passwords [puppet] - 10https://gerrit.wikimedia.org/r/742724 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:11:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:12:27] 10SRE, 10ops-drmrs, 10Traffic: Degraded RAID on cp6002 - https://phabricator.wikimedia.org/T295747 (10BBlack) 05Openβ†’03Invalid This was a false alarm due to monitoring anomalies while first bringing up the host. [14:13:10] (03CR) 10Jbond: [C: 03+1] Move coal, navtiming and statsv to the new canonical CA bundle path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:13:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/742725 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:14:46] 10SRE, 10Patch-For-Review: Migrate puppetboard to Bullseye - https://phabricator.wikimedia.org/T264276 (10Volans) Uncomplete list of things to cleanup in random order that I think are not needed anymore if we can ditch the old puppetboard puppet module: - Gerrit repositories that could be deleted: https://g... [14:15:37] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742704 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [14:16:35] (03CR) 10Elukey: [C: 03+2] Add helper functions to retrieve CA bundle paths/passwords [puppet] - 10https://gerrit.wikimedia.org/r/742724 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:16:44] (03CR) 10Elukey: [V: 03+1 C: 03+2] Move kafkatee instances to the new CA bundle location [puppet] - 10https://gerrit.wikimedia.org/r/742673 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [14:17:49] (03CR) 10Jbond: [C: 03+2] scap::dsh: remove puppetboard; hosts now use apt [puppet] - 10https://gerrit.wikimedia.org/r/742704 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [14:19:49] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10BTullis) Tagging #data-engineering because we will likely be managing the Gobblin and/or Druid ingestion parts of this pipeline.... [14:20:04] 10SRE, 10Patch-For-Review: Migrate puppetboard to Bullseye - https://phabricator.wikimedia.org/T264276 (10jbond) Thanks riccardo although unless im missing something: >>! In T264276#7537473, @Volans wrote: > - puppet's `profile::puppetdb::puppetboard_hosts` This is still needed for the ferm ACL's [14:23:34] 10SRE, 10Patch-For-Review: Migrate puppetboard to Bullseye - https://phabricator.wikimedia.org/T264276 (10Volans) >>! In T264276#7537486, @jbond wrote: > Thanks riccardo although unless im missing something: > >>>! In T264276#7537473, @Volans wrote: >> - puppet's `profile::puppetdb::puppetboard_hosts` > Thi... [14:24:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:24:20] 10SRE, 10Patch-For-Review: Migrate puppetboard to Bullseye - https://phabricator.wikimedia.org/T264276 (10jbond) > Yes but I think still refer to all 4 hosts. I meant cleanup the old hostnames from there, sorry. ack thanks >Forgot to add: manual cleanup of the /srv/deployment/puppetboard directory in the depl... [14:26:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:32:02] 10SRE, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Resolve kernel hang on wcqs* instances - https://phabricator.wikimedia.org/T294961 (10MoritzMuehlenhoff) >>! In T294961#7534879, @EBernhardson wrote: > Another round of import tests completed, nothing fell over.... [14:34:42] (03PS1) 10Muehlenhoff: Add profile::base::linux419 to the WCQS role [puppet] - 10https://gerrit.wikimedia.org/r/742729 (https://phabricator.wikimedia.org/T294961) [14:36:24] (03PS1) 10Muehlenhoff: Point back irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/742730 (https://phabricator.wikimedia.org/T296721) [14:40:43] (03PS1) 10Jbond: P:puppetdb: drop old puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/742731 (https://phabricator.wikimedia.org/T264276) [14:41:04] (03CR) 10Jbond: [C: 03+2] P:puppetdb: drop old puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/742731 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [14:41:09] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppetdb: drop old puppetboard hosts [puppet] - 10https://gerrit.wikimedia.org/r/742731 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [14:43:26] (03PS1) 10Btullis: Add /srv/spark-tmp to the list of allowed read-write paths [puppet] - 10https://gerrit.wikimedia.org/r/742732 (https://phabricator.wikimedia.org/T295346) [14:43:46] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Decommission puppetboard[12]001 - https://phabricator.wikimedia.org/T296744 (10jbond) [14:43:50] (03CR) 10Jforrester: [C: 03+2] "Let's land this, even though we're not going to deploy it." [core] (wmf/1.38.0-wmf.11) - 10https://gerrit.wikimedia.org/r/742582 (owner: 10TrainBranchBot) [14:43:59] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Decommission puppetboard[12]001 - https://phabricator.wikimedia.org/T296744 (10jbond) p:05Triageβ†’03Medium [14:44:40] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Upgrade puppetboard to the latest version - https://phabricator.wikimedia.org/T292522 (10jbond) [14:44:55] 10SRE, 10Patch-For-Review: Migrate puppetboard to Bullseye - https://phabricator.wikimedia.org/T264276 (10jbond) 05In progressβ†’03Resolved Ill close this and haneld the rest of the deom in https://phabricator.wikimedia.org/T296744 [14:50:08] (03PS1) 10Jbond: O:puppetboard: drop ol puppetboard code [puppet] - 10https://gerrit.wikimedia.org/r/742733 (https://phabricator.wikimedia.org/T296744) [14:50:38] (03CR) 10Jforrester: "https://wikifunctions.beta.wmflabs.org/ is responding with "Invalid host name (wikifunctions.beta.wmflabs.org)." which I believe is coming" [puppet] - 10https://gerrit.wikimedia.org/r/714068 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [14:52:23] (03CR) 10Ottomata: [C: 03+1] Add /srv/spark-tmp to the list of allowed read-write paths [puppet] - 10https://gerrit.wikimedia.org/r/742732 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [14:52:49] (03PS1) 10Milimetric: analytics/systemd/sqoop: Force daily sqoop to overwrite [puppet] - 10https://gerrit.wikimedia.org/r/742734 [14:52:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32741/console" [puppet] - 10https://gerrit.wikimedia.org/r/742733 (https://phabricator.wikimedia.org/T296744) (owner: 10Jbond) [14:53:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppetboard: drop ol puppetboard code [puppet] - 10https://gerrit.wikimedia.org/r/742733 (https://phabricator.wikimedia.org/T296744) (owner: 10Jbond) [14:58:14] (03PS1) 10Jbond: O:puppetboard::ng: rename to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/742735 (https://phabricator.wikimedia.org/T296744) [14:58:33] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Upgrade puppetboard to the latest version - https://phabricator.wikimedia.org/T292522 (10jbond) [14:58:39] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Decommission puppetboard[12]001 - https://phabricator.wikimedia.org/T296744 (10jbond) 05Openβ†’03In progress [14:59:53] (03PS2) 10Jbond: O:puppetboard::ng: rename to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/742735 (https://phabricator.wikimedia.org/T296744) [15:00:23] PROBLEM - Keyholder SSH agent on deploy1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [15:00:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32743/console" [puppet] - 10https://gerrit.wikimedia.org/r/742735 (https://phabricator.wikimedia.org/T296744) (owner: 10Jbond) [15:01:13] (03PS1) 10Majavah: multiversion: Add wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742736 [15:01:20] James_F: ^ [15:01:49] PROBLEM - Keyholder SSH agent on deploy2002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [15:01:59] majavah: Oh, right, that's needed now for Beta Cluster. [15:02:04] Duh, sorry. [15:02:07] jouncebot: now [15:02:08] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [15:02:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppetboard::ng: rename to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/742735 (https://phabricator.wikimedia.org/T296744) (owner: 10Jbond) [15:02:21] I'm going to sling that out unless someone shouts? [15:02:36] James_F: keyholder is not armed at deploy1002 according to icinga [15:02:45] umh [15:02:46] so likely scap won't work [15:02:52] what's going on with keyholder? [15:02:53] Eurgh, yeah, I'll wait then. [15:03:30] did any of you test it actually broke and is not just a monitoring bug? [15:03:37] let me try [15:04:03] I don't see any related puppet changes [15:04:07] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Decommission puppetboard[12]001 - https://phabricator.wikimedia.org/T296744 (10jbond) [15:04:19] what should i see in keyholder status when it's not armed? [15:04:21] i see a list of keys... [15:04:32] under keyholder-proxy: active, and none under keyholder-agent: active [15:04:42] hmm [15:04:45] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.11 [core] (wmf/1.38.0-wmf.11) - 10https://gerrit.wikimedia.org/r/742582 (owner: 10TrainBranchBot) [15:05:07] and `SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oIdentitiesOnly=yes -oIdentityFile=/etc/keyholder.d/mwdeploy mwdeploy@mwdebug1001` does sth [15:05:16] so...it might be a monitoring issue? [15:05:26] if that works, it should be fine then [15:05:31] Let's try? [15:05:41] +1 for trying from me [15:06:08] majavah: Oh, wait. wikifunctionswiki is a 'special' like wikidata; it won't need a suffix, surely? [15:06:44] So it'll need the static mapping to wikifunctions[wiki] but not the first part? [15:06:55] (03PS2) 10Majavah: multiversion: Add wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742736 [15:07:01] umm, good point [15:07:05] Code review by IRC. ;-) [15:07:30] I'm guessing it'll use the wiki suffix like wikidata / meta / mw.o / other specials [15:07:36] Yeah,. [15:07:52] (03PS1) 10Jbond: O:deployment_server: drop puppetboard/deploy [puppet] - 10https://gerrit.wikimedia.org/r/742737 (https://phabricator.wikimedia.org/T296744) [15:08:07] (03CR) 10Jforrester: [C: 03+2] multiversion: Add wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742736 (owner: 10Majavah) [15:08:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] O:deployment_server: drop puppetboard/deploy [puppet] - 10https://gerrit.wikimedia.org/r/742737 (https://phabricator.wikimedia.org/T296744) (owner: 10Jbond) [15:09:03] (03Merged) 10jenkins-bot: multiversion: Add wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742736 (owner: 10Majavah) [15:09:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:31] (03PS2) 10Majavah: devtools: setup doc1002 like doc [puppet] - 10https://gerrit.wikimedia.org/r/742078 [15:12:08] !log jforrester@deploy1002 Synchronized multiversion/MWMultiVersion.php: Add wikifunctions hard-coded value to setSiteInfoForWiki for Beta Cluster T284162 (duration: 00m 56s) [15:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:12] T284162: Create a Beta Cluster version of Wikifunctions.org - https://phabricator.wikimedia.org/T284162 [15:12:21] majavah: Thanks, now on to the next bug. :_) [15:13:10] (03PS1) 10Herron: admin: add tillmletzko-wmde to analytics-privateata-users [puppet] - 10https://gerrit.wikimedia.org/r/742738 (https://phabricator.wikimedia.org/T296634) [15:13:12] (03PS1) 10Herron: admin: add janjaquemot to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742739 (https://phabricator.wikimedia.org/T296633) [15:14:28] (03CR) 10jerkins-bot: [V: 04-1] admin: add janjaquemot to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742739 (https://phabricator.wikimedia.org/T296633) (owner: 10Herron) [15:14:51] (03PS1) 10Jforrester: Add WikiLambda to i18n extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742740 (https://phabricator.wikimedia.org/T284162) [15:14:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10herron) [15:15:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:15:10] (03CR) 10Jforrester: [C: 03+2] "Oops." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742740 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [15:15:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for @JanJaquemot - https://phabricator.wikimedia.org/T296633 (10herron) [15:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:52] (03Merged) 10jenkins-bot: Add WikiLambda to i18n extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742740 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [15:16:11] (03PS2) 10Herron: admin: add janjaquemot to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742739 (https://phabricator.wikimedia.org/T296633) [15:16:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:15] RECOVERY - Keyholder SSH agent on deploy1002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [15:18:27] RECOVERY - Keyholder SSH agent on deploy2002 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [15:18:43] majavah: ^^ should be good now [15:19:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Thanks for this! I 've got a comment inline about instead of removing the dash and changing the indentation to instead use nindent which s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 (owner: 10Jelto) [15:19:56] (03PS2) 10Ayounsi: Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) [15:22:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:51] (03PS3) 10Elukey: Move coal, navtiming and statsv to the new canonical CA bundle path [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) [15:22:53] (03PS2) 10Elukey: profile::kafka::broker: use new get ca bundle path helpers [puppet] - 10https://gerrit.wikimedia.org/r/742725 (https://phabricator.wikimedia.org/T296089) [15:22:55] (03PS3) 10Elukey: presto: move truststore to the new wmf internal CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/739477 [15:23:31] (03CR) 10Elukey: "Updated the code to the latest version of the bundle :)" [puppet] - 10https://gerrit.wikimedia.org/r/739477 (owner: 10Elukey) [15:23:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:10] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32744/console" [puppet] - 10https://gerrit.wikimedia.org/r/739477 (owner: 10Elukey) [15:24:28] (03PS1) 10Herron: admin: add mmartorana to deployment and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742743 (https://phabricator.wikimedia.org/T295790) [15:25:07] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10herron) [15:25:19] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [15:27:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:29:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:25] (03PS1) 10Elukey: varnishkafka: use new ca bundle instead of the Puppet one [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) [15:32:38] 10SRE, 10SRE-Access-Requests, 10WMF-NDA-Requests: Add EJoseph to #wmf-nda - https://phabricator.wikimedia.org/T293326 (10herron) Friendly ping @EJoseph cc @Gehel [15:36:40] (03PS2) 10Elukey: varnishkafka: use new ca bundle instead of the Puppet one [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) [15:37:19] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetboard2001.codfw.wmnet [15:37:19] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.91 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32746/console" [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [15:38:13] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts puppetboard2001.codfw.wmnet [15:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:19] (03CR) 10SBassett: [C: 03+1] admin: add mmartorana to deployment and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742743 (https://phabricator.wikimedia.org/T295790) (owner: 10Herron) [15:40:46] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [15:41:07] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetboard2001.codfw.wmnet [15:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:30] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts puppetboard2001.codfw.wmnet [15:41:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:42:33] (03PS1) 10Vgutierrez: cache::haproxy: Avoid using lua [puppet] - 10https://gerrit.wikimedia.org/r/742749 (https://phabricator.wikimedia.org/T290005) [15:42:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:43:06] (03PS1) 10Jforrester: beta: Drop deployment-deploy01 references from dsh/scap, being decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/742750 (https://phabricator.wikimedia.org/T278689) [15:43:50] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32747/console" [puppet] - 10https://gerrit.wikimedia.org/r/742749 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:44:42] (03PS1) 10Herron: admin: add dbad2021 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742751 (https://phabricator.wikimedia.org/T293253) [15:45:25] (03CR) 10Jbond: [C: 03+2] Revert "Revert "Revert "Revert "mx2001: disable ldap validation"""" [puppet] - 10https://gerrit.wikimedia.org/r/739826 (owner: 10Jbond) [15:45:44] ah, the git definition of decisiveness. ;) [15:45:45] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 79.35 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:46:44] lol [15:47:19] I did a search for long β€œrevert:” chains when GitHub released their commit search feature, some pretty funny results in there [15:49:13] (03PS1) 10Elukey: netflow: move kafka config to new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) [15:50:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32748/console" [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [15:52:04] (03PS2) 10Elukey: netflow: move kafka config to new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) [15:52:48] Lucas_WMDE: :D [15:52:50] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32749/console" [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [15:56:10] (03CR) 10Ebernhardson: [C: 03+1] Add profile::base::linux419 to the WCQS role [puppet] - 10https://gerrit.wikimedia.org/r/742729 (https://phabricator.wikimedia.org/T294961) (owner: 10Muehlenhoff) [15:56:24] (03PS1) 10Muehlenhoff: Revert "Prefer mx1001 over mx2001 for weights in MX records" [dns] - 10https://gerrit.wikimedia.org/r/742754 [15:57:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739477 (owner: 10Elukey) [15:57:21] (03CR) 10Jbond: [C: 03+1] varnishkafka: use new ca bundle instead of the Puppet one [puppet] - 10https://gerrit.wikimedia.org/r/742747 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [15:58:19] (03CR) 10Filippo Giunchedi: [C: 03+1] P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:58:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742753 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [15:59:07] (03PS1) 10Jforrester: Add initial namespace aliases for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742756 (https://phabricator.wikimedia.org/T284162) [15:59:09] (03PS1) 10Muehlenhoff: Revert "Prefer mx1001 over mx2001 for smart hosts / wiki mail" [puppet] - 10https://gerrit.wikimedia.org/r/742757 [16:00:12] (03CR) 10jerkins-bot: [V: 04-1] Add initial namespace aliases for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742756 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [16:00:15] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742738 (https://phabricator.wikimedia.org/T296634) (owner: 10Herron) [16:00:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742739 (https://phabricator.wikimedia.org/T296633) (owner: 10Herron) [16:01:33] !log lvs2007 - depooling for network maint - do not push LVS config changes please! [16:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:48] (03CR) 10Jbond: admin: add mmartorana to deployment and analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742743 (https://phabricator.wikimedia.org/T295790) (owner: 10Herron) [16:03:01] (03CR) 10Jbond: admin: add dbad2021 to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742751 (https://phabricator.wikimedia.org/T293253) (owner: 10Herron) [16:03:58] PROBLEM - Check systemd state on ms-be2059 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:05] (expect some lvs2007 alerts here while a cable is being replaced) [16:04:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 (10Papaul) 05Openβ†’03Resolved @MatthewVernon disk replaced [16:05:00] PROBLEM - PyBal backends health check on lvs2007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:05:00] (03CR) 10Jbond: [C: 03+1] "lgtm, but hold off untill weekday 01/12/2021 as agreed on irvc" [dns] - 10https://gerrit.wikimedia.org/r/742754 (owner: 10Muehlenhoff) [16:05:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:06:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/742738 (https://phabricator.wikimedia.org/T296634) (owner: 10Herron) [16:06:19] (03CR) 10Jbond: [C: 03+1] Revert "Prefer mx1001 over mx2001 for smart hosts / wiki mail" [puppet] - 10https://gerrit.wikimedia.org/r/742757 (owner: 10Muehlenhoff) [16:06:49] !log lvs2007 - repooling into service [16:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:13] 10SRE, 10Patch-For-Review: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) 05Openβ†’03Resolved a:03elukey Summary: The `profile::base::certificates` code is now able to work transparently for Pontoon/Deployment-Prep/Production: * in production, the pro... [16:07:16] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 108, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:07:49] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1163 at 50%', diff saved to https://phabricator.wikimedia.org/P17905 and previous config saved to /var/cache/conftool/dbconfig/20211130-160748-jynus.json [16:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:04] RECOVERY - PyBal backends health check on lvs2007 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:09:21] 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Requesting exec access to pods in 'ci' namespace staging kubernetes - https://phabricator.wikimedia.org/T290360 (10herron) 05Openβ†’03Stalled This doesn't appear to be immediately actionable in terms of SRE clinic duty workflow. I'm going to r... [16:09:31] 10SRE, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Doing): Automated validation of mediawiki-multiversion images - https://phabricator.wikimedia.org/T288629 (10herron) [16:13:20] !log cr2-codfw bounce fpc 1 pic 0 (vrrp backup) - T289241 [16:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/742739 (https://phabricator.wikimedia.org/T296633) (owner: 10Herron) [16:14:58] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1163 at 25%', diff saved to https://phabricator.wikimedia.org/P17906 and previous config saved to /var/cache/conftool/dbconfig/20211130-161457-jynus.json [16:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:56] (03PS1) 10Herron: admin: create ldap_only entry for user jsn [puppet] - 10https://gerrit.wikimedia.org/r/742759 (https://phabricator.wikimedia.org/T296654) [16:18:30] (03CR) 10Jelto: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/742458 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:18:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10RobH) a:05wiki_willyβ†’03Cmjohnson >>! In T296546#7537023, @jcrespo wrote: >>... [16:19:06] PROBLEM - Disk space on ms-be2059 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdr1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2059&var-datasource=codfw+prometheus/ops [16:20:05] !log reboot ms-be2059 to fix device enumeration order re T295563 [16:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:10] T295563: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet - https://phabricator.wikimedia.org/T295563 [16:23:08] !log Move cr2-codfw pfw3 link to BO cable - T289241 [16:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:43] (03CR) 10Milimetric: [C: 03+1] "approval from the folks looking at this data, so it can be merged" [puppet] - 10https://gerrit.wikimedia.org/r/742734 (owner: 10Milimetric) [16:24:51] 10SRE, 10LDAP-Access-Requests: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10herron) Thanks @WMDE-leszek I see @KFrancis is already subscribed, great! Next step will be confirming that we have the completed NDA on file, then we'll be ready to create the... [16:25:21] 10SRE, 10LDAP-Access-Requests: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10herron) [16:26:29] !log Move cr2-codfw eqord link to BO cable - T289241 [16:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:52] (03PS1) 10Kormat: wmfdb: Add base exceptions, and basic logging. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/742760 [16:28:16] RECOVERY - Check systemd state on ms-be2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:22] !log Move cr2-codfw lumen transit link to BO cable - T289241 [16:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:27] (03PS2) 10Herron: admin: add mmartorana to deployment and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742743 (https://phabricator.wikimedia.org/T295790) [16:29:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10database-backups: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102 - https://phabricator.wikimedia.org/T296546 (10jcrespo) > My only concern is it may want the memory to be mirrored Indeed. I... [16:30:42] (03PS2) 10Herron: admin: add dbad2021 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742751 (https://phabricator.wikimedia.org/T293253) [16:31:09] (03CR) 10Herron: admin: add dbad2021 to analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742751 (https://phabricator.wikimedia.org/T293253) (owner: 10Herron) [16:31:47] (03CR) 10Herron: admin: add mmartorana to deployment and analytics-privatedata-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742743 (https://phabricator.wikimedia.org/T295790) (owner: 10Herron) [16:33:49] (03CR) 10Herron: [C: 03+2] admin: add tillmletzko-wmde to analytics-privateata-users [puppet] - 10https://gerrit.wikimedia.org/r/742738 (https://phabricator.wikimedia.org/T296634) (owner: 10Herron) [16:34:42] (03PS3) 10Herron: admin: add janjaquemot to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742739 (https://phabricator.wikimedia.org/T296633) [16:36:00] (03CR) 10Herron: [C: 03+2] admin: add janjaquemot to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742739 (https://phabricator.wikimedia.org/T296633) (owner: 10Herron) [16:38:23] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10fkaelin) Thank you for the updates. I am not familiar with the additional steps required to activate/use the credentials in puppet. Would you be able to provide a more concrete pointer/li... [16:40:12] RECOVERY - Disk space on ms-be2059 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2059&var-datasource=codfw+prometheus/ops [16:41:44] (03CR) 10Ottomata: [C: 03+2] analytics/systemd/sqoop: Force daily sqoop to overwrite [puppet] - 10https://gerrit.wikimedia.org/r/742734 (owner: 10Milimetric) [16:43:52] (03CR) 10Kormat: [V: 03+2 C: 03+2] wmfdb: Add base exceptions, and basic logging. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/742760 (owner: 10Kormat) [16:45:52] (03PS2) 10Jforrester: [Beta Cluster] Add initial namespace aliases for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742756 (https://phabricator.wikimedia.org/T284162) [16:47:34] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [16:50:49] (03PS1) 10Eigyan: WIP: Deploy GDI sur=vey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [16:51:45] (03CR) 10jerkins-bot: [V: 04-1] WIP: Deploy GDI sur=vey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [16:53:08] (03CR) 10Hnowlan: "This doesn't currently work in envoy due to yaml-cpp not supporting anchor merging 😞https://github.com/envoyproxy/envoy/issues/12926" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742715 (https://phabricator.wikimedia.org/T294445) (owner: 10Hnowlan) [16:53:10] (03PS2) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [16:53:22] PROBLEM - very high load average likely xfs on ms-be2059 is CRITICAL: CRITICAL - load average: 145.55, 143.75, 115.62 https://wikitech.wikimedia.org/wiki/Swift [16:54:06] (03CR) 10jerkins-bot: [V: 04-1] WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [16:54:08] (03PS3) 10Hnowlan: api-gateway: Create read and write clusters for mw and discovery APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/742715 (https://phabricator.wikimedia.org/T294445) [16:54:28] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380 (10Ottomata) Here's how we do it for the analytics_admin swift account: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/cluster/secrets.pp#L55-L64 Wher... [16:57:19] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1163 fully', diff saved to https://phabricator.wikimedia.org/P17907 and previous config saved to /var/cache/conftool/dbconfig/20211130-165718-jynus.json [16:57:19] (03PS1) 10Hnowlan: api-gateway: add default routes list to devel settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/742765 [16:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211130T1700). [17:00:05] majavah: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:07] !log move db1139:s1 under db1118 [17:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:21] hey, I'm here [17:00:24] majavah: πŸ‘‹ looking [17:01:19] (03CR) 10Jbond: [C: 03+1] devtools: setup doc1002 like doc [puppet] - 10https://gerrit.wikimedia.org/r/742078 (owner: 10Majavah) [17:01:24] I think that worked [17:01:32] see no errors, etc [17:01:45] (03CR) 10Jbond: [C: 03+1] P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [17:01:47] (03CR) 10Dave Pifke: "Tested in deployment-prep, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [17:02:13] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add default routes list to devel settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/742765 (owner: 10Hnowlan) [17:04:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742240 (https://phabricator.wikimedia.org/T263830) (owner: 10Majavah) [17:05:23] rzl: sry didn;t see yuor response, anyway all look good to me [17:05:54] jbond: no I appreciate it! I was going to be less confident with the subject area, but if you're happy I'm happy :) [17:06:04] want to merge or shall I? [17:06:05] if it helps, I've tested https://gerrit.wikimedia.org/r/c/operations/puppet/+/741713 on the devtools cloud vps project and it worked fine [17:06:23] (03Merged) 10jenkins-bot: api-gateway: add default routes list to devel settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/742765 (owner: 10Hnowlan) [17:06:28] i can merge, one sec [17:06:31] (03CR) 10Jbond: [C: 03+2] devtools: setup doc1002 like doc [puppet] - 10https://gerrit.wikimedia.org/r/742078 (owner: 10Majavah) [17:06:34] (03CR) 10Jbond: [C: 03+2] P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [17:06:36] (03CR) 10Jbond: [C: 03+2] trafficserver: Enable tls on integration.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/742240 (https://phabricator.wikimedia.org/T263830) (owner: 10Majavah) [17:07:09] majavah: fyi feel free tojust ad me one pupepet CR's [17:07:29] cool sure [17:08:05] can I ask you to check that the rsync from doc1001 to 1002/2001 starts fine? [17:08:30] mergedyes will run puppet there now [17:08:39] fyi i have allready ran on cp1075 and looks good [17:11:33] majavah: looks like its working fine and syncing as we speak [17:11:45] awesome [17:13:28] 10SRE, 10Traffic-Icebox, 10HTTPS: HTTPS for internal service traffic - https://phabricator.wikimedia.org/T108580 (10Majavah) [17:13:34] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic-Icebox, 10HTTPS: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10Majavah) 05Openβ†’03Resolved a:03Majavah [17:13:40] 10SRE, 10Traffic-Icebox, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Majavah) [17:13:48] (03PS3) 10Eigyan: WIP: Deploy GDI survey to cawiki and fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) [17:15:50] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1163 at 5%', diff saved to https://phabricator.wikimedia.org/P17908 and previous config saved to /var/cache/conftool/dbconfig/20211130-171550-jynus.json [17:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, see" [puppet] - 10https://gerrit.wikimedia.org/r/742751 (https://phabricator.wikimedia.org/T293253) (owner: 10Herron) [17:23:12] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={list,listWithCount} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:25:14] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:26:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, see" [puppet] - 10https://gerrit.wikimedia.org/r/742743 (https://phabricator.wikimedia.org/T295790) (owner: 10Herron) [17:28:07] (03PS1) 10Ottomata: refine - bump jar version to get deduplication logic fix [puppet] - 10https://gerrit.wikimedia.org/r/742770 (https://phabricator.wikimedia.org/T294361) [17:28:36] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=update https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:29:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/742759 (https://phabricator.wikimedia.org/T296654) (owner: 10Herron) [17:29:28] (03CR) 10Razzi: [C: 03+2] superset: set webserver timeout to 180 seconds [puppet] - 10https://gerrit.wikimedia.org/r/740712 (https://phabricator.wikimedia.org/T294771) (owner: 10Razzi) [17:29:38] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:30:48] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:30:53] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) [17:31:59] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [17:33:38] (03CR) 10Jelto: "I would like to merge I1327c4a853a99776f65e5d74ae5d8b8774d8bfa0 first to have secure runner settings for production" [puppet] - 10https://gerrit.wikimedia.org/r/740691 (https://phabricator.wikimedia.org/T295481) (owner: 10Dzahn) [17:34:46] !log installing libvorbis security updates [17:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:18] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1163 at 25%', diff saved to https://phabricator.wikimedia.org/P17910 and previous config saved to /var/cache/conftool/dbconfig/20211130-173517-jynus.json [17:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:39:35] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1163 at 50%', diff saved to https://phabricator.wikimedia.org/P17911 and previous config saved to /var/cache/conftool/dbconfig/20211130-173935-jynus.json [17:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:44:21] (03CR) 10Btullis: [C: 03+2] Add /srv/spark-tmp to the list of allowed read-write paths [puppet] - 10https://gerrit.wikimedia.org/r/742732 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [17:44:35] !log jynus@cumin1001 dbctl commit (dc=all): 'Repool db1163 fully', diff saved to https://phabricator.wikimedia.org/P17912 and previous config saved to /var/cache/conftool/dbconfig/20211130-174434-jynus.json [17:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:41] (03CR) 10Ottomata: [C: 03+2] refine - bump jar version to get deduplication logic fix [puppet] - 10https://gerrit.wikimedia.org/r/742770 (https://phabricator.wikimedia.org/T294361) (owner: 10Ottomata) [17:50:51] (03PS1) 10Ahmon Dancy: mediawiki: Add additional php settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 [17:51:02] (03PS2) 10Vgutierrez: cache::haproxy: Avoid using lua [puppet] - 10https://gerrit.wikimedia.org/r/742749 (https://phabricator.wikimedia.org/T290005) [17:54:17] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Avoid using lua [puppet] - 10https://gerrit.wikimedia.org/r/742749 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:56:51] (03PS1) 10Ayounsi: Revert "Turn on prepending for esams and eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/742563 [17:58:00] (03CR) 10CDanis: [C: 03+1] Revert "Turn on prepending for esams and eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/742563 (owner: 10Ayounsi) [17:58:13] (03CR) 10Ayounsi: [C: 03+2] Revert "Turn on prepending for esams and eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/742563 (owner: 10Ayounsi) [17:58:57] (03Merged) 10jenkins-bot: Revert "Turn on prepending for esams and eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/742563 (owner: 10Ayounsi) [17:59:29] (03PS2) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) [18:00:05] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211130T1800) [18:00:17] (03CR) 10Hnowlan: "This change is ready for review, assuming the ratelimit reads multiple YAML documents as per the standard" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [18:02:50] (03CR) 10RLazarus: [C: 03+2] Packaging fixes: [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/742280 (owner: 10RLazarus) [18:03:34] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) I did not forget this task, but have been busy the last... [18:04:15] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, and 2 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) @BTullis thanks! Real-time, would be a nice plus, but a hard requirement (unlike netflow). @cmooney [[ https://gerrit.w... [18:05:34] (03Merged) 10jenkins-bot: Packaging fixes: [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/742280 (owner: 10RLazarus) [18:08:39] !log restart haproxy on cp3064 - T290005 [18:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:43] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [18:09:34] !log uploaded php-yaml for component/php72 (T296331) [18:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:38] T296331: Install php-yaml for use by SettingsLoader - https://phabricator.wikimedia.org/T296331 [18:09:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:09:45] (03CR) 10Btullis: Pmacct add sflow listener (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [18:12:36] (03PS3) 10Ayounsi: Pmacct add sflow listener [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) [18:13:17] (03CR) 10Ayounsi: Pmacct add sflow listener (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742110 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [18:15:32] (03CR) 10Eigyan: "-1 to hold the horses" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [18:21:41] (03CR) 10Ppchelko: [C: 04-1] api-gateway: allow discovery services to set custom rate limits (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [18:25:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:42:00] (03PS1) 10Cwhite: site: consolidate logstash node definitions [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) [18:42:02] (03PS1) 10Cwhite: site: reprovision codfw logging cluster to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/742779 (https://phabricator.wikimedia.org/T288621) [18:42:04] (03PS1) 10Cwhite: hiera: add opensearch production configuration [puppet] - 10https://gerrit.wikimedia.org/r/742780 (https://phabricator.wikimedia.org/T288621) [18:42:06] (03PS1) 10Cwhite: opensearch-dashboards: enable phatality on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/742781 (https://phabricator.wikimedia.org/T288621) [18:42:54] (03CR) 10Dzahn: [C: 03+2] beta: Drop deployment-deploy01 references from dsh/scap, being decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/742750 (https://phabricator.wikimedia.org/T278689) (owner: 10Jforrester) [18:43:56] (03PS2) 10Cwhite: opensearch-dashboards: enable phatality on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/742781 (https://phabricator.wikimedia.org/T288621) [18:45:09] (03CR) 10Dzahn: "Is there a reason for it? Are cache hosts removed from deployment-prep entirely?" [puppet] - 10https://gerrit.wikimedia.org/r/742210 (owner: 10Majavah) [18:46:01] (03CR) 10Cwhite: [C: 03+2] opensearch-dashboards: enable phatality on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/742781 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [18:46:19] (03PS3) 10Majavah: hieradata: remove beta hosts from cloud-wide cache_hosts [puppet] - 10https://gerrit.wikimedia.org/r/742210 [18:46:21] (03PS2) 10Majavah: hieradata: remove old project-proxies [puppet] - 10https://gerrit.wikimedia.org/r/742211 [18:46:24] (03CR) 10Dzahn: "it's been a couple months, still wanna amend here?" [puppet] - 10https://gerrit.wikimedia.org/r/673556 (https://phabricator.wikimedia.org/T277729) (owner: 10Razzi) [18:47:20] (03CR) 10Majavah: hieradata: remove beta hosts from cloud-wide cache_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742210 (owner: 10Majavah) [18:47:34] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) [18:47:39] (03CR) 10Dzahn: "I see you merged this, did you also do the setup and coordinate with releng?" [puppet] - 10https://gerrit.wikimedia.org/r/742078 (owner: 10Majavah) [18:48:08] (03CR) 10Majavah: devtools: setup doc1002 like doc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742078 (owner: 10Majavah) [18:48:23] (03CR) 10Dzahn: "ah, it's cloud only. nevermind" [puppet] - 10https://gerrit.wikimedia.org/r/742078 (owner: 10Majavah) [18:48:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) Replaced all the mgmt switches, updated netbox and ran homer. Closing this task and creating a decommissioning task to remove the old mgmt switches. [18:48:59] (03CR) 10Dzahn: devtools: setup doc1002 like doc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742078 (owner: 10Majavah) [18:49:00] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) [18:49:09] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Cmjohnson) 05Openβ†’03Resolved [18:51:00] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: Decommission old MSW in racks A, B, C and D - https://phabricator.wikimedia.org/T296770 (10Cmjohnson) [18:51:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:52:27] (03PS2) 10Ahmon Dancy: mediawiki 0.0.37: Add additional php settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 [18:53:11] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10Cmjohnson) @Andrew If we're going to leave this be, is it okay to close this task? Your server refresh can be tracked in a separate task [18:54:04] (03CR) 10Jhernandez: "We need to figure out why cawiki doesn't show up in the diffs but fawiki does" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742763 (https://phabricator.wikimedia.org/T296486) (owner: 10Eigyan) [18:54:06] (03PS1) 10Cwhite: hiera: beta-logs: populate scap::deployment_server with placeholder [puppet] - 10https://gerrit.wikimedia.org/r/742784 [18:55:14] (03CR) 10Cwhite: [C: 03+2] hiera: beta-logs: populate scap::deployment_server with placeholder [puppet] - 10https://gerrit.wikimedia.org/r/742784 (owner: 10Cwhite) [18:57:13] (03CR) 10Dzahn: profile::gitlab-runner add hieradata for protected GitLab Runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742458 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [18:57:36] (03CR) 10Dzahn: "all looks good to me except I don't think there is a description parameter yet" [puppet] - 10https://gerrit.wikimedia.org/r/742458 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [18:58:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:58:38] (03PS1) 10Razzi: Revert "superset: set webserver timeout to 180 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/742564 [19:00:07] (03PS2) 10Razzi: Revert "superset: set webserver timeout to 180 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/742564 [19:00:25] (03CR) 10Razzi: [V: 03+2 C: 03+2] Revert "superset: set webserver timeout to 180 seconds" [puppet] - 10https://gerrit.wikimedia.org/r/742564 (owner: 10Razzi) [19:01:58] (03PS1) 10Vgutierrez: varnish: Listen on several Unix Domain Sockets [puppet] - 10https://gerrit.wikimedia.org/r/742785 (https://phabricator.wikimedia.org/T290005) [19:02:28] (03CR) 10Dzahn: deployment-prep: Add wikifunctions.beta.wmflabs.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714068 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [19:03:21] (03CR) 10jerkins-bot: [V: 04-1] varnish: Listen on several Unix Domain Sockets [puppet] - 10https://gerrit.wikimedia.org/r/742785 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [19:07:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10herron) [19:07:33] (03PS2) 10Vgutierrez: varnish: Listen on several Unix Domain Sockets [puppet] - 10https://gerrit.wikimedia.org/r/742785 (https://phabricator.wikimedia.org/T290005) [19:08:23] (03CR) 10jerkins-bot: [V: 04-1] varnish: Listen on several Unix Domain Sockets [puppet] - 10https://gerrit.wikimedia.org/r/742785 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [19:08:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @JanJaquemot - https://phabricator.wikimedia.org/T296633 (10herron) [19:09:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @tmletzko - https://phabricator.wikimedia.org/T296634 (10herron) 05Openβ†’03Resolved a:03herron The patchset to enable the requested access has been merged and deployed. I'll transition this to resolved now, but please don'... [19:09:23] (03PS3) 10Vgutierrez: varnish: Listen on several Unix Domain Sockets [puppet] - 10https://gerrit.wikimedia.org/r/742785 (https://phabricator.wikimedia.org/T290005) [19:09:40] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @JanJaquemot - https://phabricator.wikimedia.org/T296633 (10herron) 05Openβ†’03Resolved a:03herron The patchset to enable the requested access has been merged and deployed. I'll transition this to resolved now, but please d... [19:10:42] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32752/console" [puppet] - 10https://gerrit.wikimedia.org/r/742785 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [19:11:10] (03PS3) 10Herron: admin: add mmartorana to deployment and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742743 (https://phabricator.wikimedia.org/T295790) [19:15:20] 10SRE, 10LDAP-Access-Requests: Add Ollie Shotton to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T296715 (10KFrancis) Hi all, I'll work on the agreement and let you know when it's complete. Thanks! [19:22:58] (03CR) 10Dzahn: "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto) [19:25:50] (03CR) 10Dzahn: hieradata: remove beta hosts from cloud-wide cache_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742210 (owner: 10Majavah) [19:25:52] (03CR) 10Dzahn: [C: 03+2] hieradata: remove beta hosts from cloud-wide cache_hosts [puppet] - 10https://gerrit.wikimedia.org/r/742210 (owner: 10Majavah) [19:28:27] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH) [19:29:33] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH) So both of the old mx80s are attached to mgmt, power, and scs. I've not bothered to document these connections in netbox, as they are un... [19:31:36] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10RobH) a:05RobHβ†’03ayounsi I'm not sure if this should assign to @ayounsi or @cmooney, either can handle this! This is ready for software up... [19:32:28] 10SRE, 10ops-ulsfo, 10DC-Ops: ulsfo cable ids missing - https://phabricator.wikimedia.org/T295198 (10RobH) 05Openβ†’03Resolved fixed all outstanding report errors for ulsfo [19:44:24] (03PS1) 10Legoktm: fpm-multiversion-base: Add PHP Yaml extension [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/742790 (https://phabricator.wikimedia.org/T296331) [19:52:12] (03CR) 10Legoktm: [C: 04-1] "I discussed this with Joe, and he recommended rolling it out on the canaries first for a few days, just in case there are any unexpected i" [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [20:00:02] (03CR) 10Daniel Kinzler: mediawiki: Install yaml extension for use by SettingsBuilder (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [20:02:26] (03CR) 10Majavah: [C: 04-1] "needs changelog entries I think?" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/742790 (https://phabricator.wikimedia.org/T296331) (owner: 10Legoktm) [20:03:19] (03PS2) 10Legoktm: mediawiki: Install yaml extension for SettingsBuilder on canaries [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [20:03:22] (03CR) 10Herron: [C: 03+2] admin: add mmartorana to deployment and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742743 (https://phabricator.wikimedia.org/T295790) (owner: 10Herron) [20:03:38] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Install yaml extension for SettingsBuilder on canaries [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [20:05:55] (03CR) 10Legoktm: mediawiki: Install yaml extension for SettingsBuilder on canaries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [20:06:45] jouncebot: nowandnext [20:06:45] No deployments scheduled for the next 3 hour(s) and 53 minute(s) [20:06:45] In 3 hour(s) and 53 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211201T0000) [20:06:56] (03PS2) 10Urbanecm: uzwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742504 (https://phabricator.wikimedia.org/T294245) [20:07:02] (03CR) 10Urbanecm: [C: 03+2] uzwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742504 (https://phabricator.wikimedia.org/T294245) (owner: 10Urbanecm) [20:07:43] (03CR) 10Legoktm: "I'll remove the duplicate entries in a follow-up, not sure how I missed it earlier." [puppet] - 10https://gerrit.wikimedia.org/r/738426 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [20:07:49] (03Merged) 10jenkins-bot: uzwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742504 (https://phabricator.wikimedia.org/T294245) (owner: 10Urbanecm) [20:09:53] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 5443b78f197b782238632966891d721859733a74: uzwiki: Deploy Growth features to newcomers (T294245) (duration: 00m 57s) [20:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:58] T294245: Activation of the visual editor and Growth features by default on Uzbek Wikipedia - https://phabricator.wikimedia.org/T294245 [20:10:17] (03PS3) 10Legoktm: mediawiki: Install yaml extension for SettingsBuilder on canaries [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [20:10:19] (03PS1) 10Legoktm: Remove duplicate role_contacts entries [puppet] - 10https://gerrit.wikimedia.org/r/742792 [20:11:36] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Install yaml extension for SettingsBuilder on canaries [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [20:12:13] (03PS3) 10Herron: admin: add dbad2021 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742751 (https://phabricator.wikimedia.org/T293253) [20:18:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10herron) [20:18:26] (03CR) 10Herron: [C: 03+2] admin: add dbad2021 to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/742751 (https://phabricator.wikimedia.org/T293253) (owner: 10Herron) [20:20:44] 10SRE, 10Scap, 10Release-Engineering-Team (Next): Re-imaged mw app servers can end up with missing l10n cache for old versions of MW needed for rollback - https://phabricator.wikimedia.org/T273334 (10Jdforrester-WMF) [20:20:50] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF) [20:20:58] 10SRE, 10Scap, 10Release-Engineering-Team (Next): Re-imaged mw app servers can end up with missing l10n cache for old versions of MW needed for rollback - https://phabricator.wikimedia.org/T273334 (10Jdforrester-WMF) We proceeded with the wider work without fixing this task, so I'll remove it as a blocker. [20:21:02] (03PS2) 10Herron: admin: create ldap_only entry for user jsn [puppet] - 10https://gerrit.wikimedia.org/r/742759 (https://phabricator.wikimedia.org/T296654) [20:21:52] 10SRE-Access-Requests: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10Majavah) [20:22:07] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF) [20:22:39] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF) [20:27:22] (03PS2) 10Legoktm: fpm-multiversion-base: Add PHP Yaml extension [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/742790 (https://phabricator.wikimedia.org/T296331) [20:27:52] (03PS3) 10Legoktm: fpm-multiversion-base: Add PHP yaml extension [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/742790 (https://phabricator.wikimedia.org/T296331) [20:28:01] (03CR) 10Legoktm: fpm-multiversion-base: Add PHP yaml extension (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/742790 (https://phabricator.wikimedia.org/T296331) (owner: 10Legoktm) [20:29:05] (03CR) 10Dzahn: [C: 03+1] admin: create ldap_only entry for user jsn [puppet] - 10https://gerrit.wikimedia.org/r/742759 (https://phabricator.wikimedia.org/T296654) (owner: 10Herron) [20:29:15] (03CR) 10Herron: [C: 03+2] admin: create ldap_only entry for user jsn [puppet] - 10https://gerrit.wikimedia.org/r/742759 (https://phabricator.wikimedia.org/T296654) (owner: 10Herron) [20:33:14] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Jsn.sherman - https://phabricator.wikimedia.org/T296654 (10herron) 05Openβ†’03Resolved a:03herron Hi @jsn.sherman your account has been added to the `wmf` ldap group. I'll transition this task to resolved now, but please don't he... [20:34:48] (03CR) 10Legoktm: [C: 03+2] Remove duplicate role_contacts entries [puppet] - 10https://gerrit.wikimedia.org/r/742792 (owner: 10Legoktm) [20:35:10] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10herron) [20:35:18] herron: I pulled in your change, ok to puppet-merge? [20:35:38] legoktm: yes please do! [20:36:19] {{done}} [20:38:38] ty [20:41:54] (03CR) 10Jforrester: deployment-prep: Add wikifunctions.beta.wmflabs.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714068 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [20:41:58] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10herron) 05Openβ†’03Resolved Hi @mmartorana, the patch set to enable the access requested has been merged and deplo... [20:42:16] 10SRE, 10Continuous-Integration-Infrastructure, 10Traffic-Icebox, 10HTTPS: contint.wikimedia.org: add TLS termination - https://phabricator.wikimedia.org/T263830 (10Dzahn) Wow @Majavah thanks for closing this! :) Just a bit sad that it was still not triaged in CI infra and people probably won't notice. [20:43:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10herron) 05Openβ†’03Resolved Hi @DAbad the patch set to enable the access requested has been merged and deployed, and you should have received an email... [20:48:40] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10herron) [20:50:15] (03PS4) 10Legoktm: mediawiki: Install yaml extension for SettingsBuilder on canaries [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [20:50:44] (03PS2) 10DCausse: [wdqs] switch wdqs1010 to the streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/742670 [20:52:21] (03PS2) 10DCausse: [wdqs] cleanup streaming updater config [puppet] - 10https://gerrit.wikimedia.org/r/742669 [20:54:52] (03PS1) 10Dzahn: admin: add taavi to deployers [puppet] - 10https://gerrit.wikimedia.org/r/742798 (https://phabricator.wikimedia.org/T296777) [20:55:23] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for Jsn.sherman - https://phabricator.wikimedia.org/T296654 (10jsn.sherman) Thanks @herron! [20:55:29] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742669 (owner: 10DCausse) [20:55:40] (03CR) 10Dzahn: "@thcipiriani Would you approve this?" [puppet] - 10https://gerrit.wikimedia.org/r/742798 (https://phabricator.wikimedia.org/T296777) (owner: 10Dzahn) [20:56:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10herron) Hi @KFrancis looking at the NDA tracking sheet it looks like we have an NDA on file for Taavi on line 42, but may have an outdated or alternate email address on... [20:58:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10Dzahn) @thcipriani Thoughts on this request? [20:58:52] 10SRE, 10Analytics, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Legoktm) Ack, thanks! I sent a heads-up to the ops list about this just in case some other application has started assuming it... [20:58:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10Majavah) >>! In T296777#7538908, @herron wrote: > Hi @KFrancis looking at the NDA tracking sheet it looks like we have an NDA on file for Taavi on line 42, but may have... [21:01:55] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) 05Resolvedβ†’03Open Not a problem, but in this case we need to move this account out of "wmf" again and back into "nda". Yes, you can create a new account on Wikitech on your own and associat... [21:03:47] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) @herron Could you move the existing volunteer account out of the "wmf" group and back into the "nda" group. (status before this ticket was created). And then once Daimona has a new work user,... [21:04:06] (03PS3) 10Ahmon Dancy: mediawiki 0.0.37: Add additional php settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 [21:04:54] (03CR) 10Ahmon Dancy: [C: 03+1] "Tested in train-dev" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 (owner: 10Ahmon Dancy) [21:06:11] (03PS5) 10Legoktm: mediawiki: Install yaml extension for SettingsBuilder on canaries [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [21:07:39] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32754/console" [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [21:08:12] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10herron) >>! In T295993#7538959, @Dzahn wrote: > @herron Could you move the existing volunteer account out of the "wmf" group and back into the "nda" group. (status before this ticket was created). Su... [21:09:35] (03CR) 10Legoktm: [V: 03+1 C: 03+1] "PCC shows mw1413 (appserver) with no changes, while mw1414 (canary appserver) installs the extension." [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [21:11:42] (03CR) 10Legoktm: [C: 03+1] Point back irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/742730 (https://phabricator.wikimedia.org/T296721) (owner: 10Muehlenhoff) [21:16:53] (03PS1) 10Ssingh: add CNAME for one.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/742803 (https://phabricator.wikimedia.org/T296570) [21:20:20] (03CR) 10Ssingh: [C: 03+2] add CNAME for one.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/742803 (https://phabricator.wikimedia.org/T296570) (owner: 10Ssingh) [21:30:05] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Jsn.sherman - https://phabricator.wikimedia.org/T296654 (10Aklapper) @herron: As I've seen process steps skipped several times lately, what could be done to [follow the docs](https://wikitech.wikimedia.org/w/index.php?title=SRE%2FClinic_Duty%2FAccess_reque... [21:31:57] 10SRE, 10DNS, 10Traffic, 10WMF-Communications, 10Patch-For-Review: Setup subdomain for Foundation messaging site - https://phabricator.wikimedia.org/T296570 (10ssingh) 05Openβ†’03Resolved a:03ssingh ` $ dig one.wikimedia.org CNAME +short messaging-wikimedia-org.go-vip.net. ` [21:34:54] (03CR) 10DCausse: "appears to be a noop according to https://puppet-compiler.wmflabs.org/compiler1002/1110/" [puppet] - 10https://gerrit.wikimedia.org/r/742669 (owner: 10DCausse) [21:43:53] (03PS1) 10Razzi: superset: set webserver timeout to 180 seconds [puppet] - 10https://gerrit.wikimedia.org/r/742808 (https://phabricator.wikimedia.org/T294771) [21:45:49] (03CR) 10Razzi: [C: 03+2] ""Reviewed" this with @ottomata by going over the fix on IRC. Going to go ahead and redeploy this now." [puppet] - 10https://gerrit.wikimedia.org/r/742808 (https://phabricator.wikimedia.org/T294771) (owner: 10Razzi) [21:53:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10thcipriani) >>! In T296777#7538923, @Dzahn wrote: > @thcipriani Thoughts on this request? Approved. @Majavah is familiar with scap from working with beta, and is alway... [21:57:27] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10Urbanecm) This has enough approvals already, but I want to explicitly support this anyway. Majavah is really helpful, and deployment access will let them be more helpfu... [21:58:51] (03CR) 10Dzahn: [C: 03+1] "has approval from group approver on ticket" [puppet] - 10https://gerrit.wikimedia.org/r/742798 (https://phabricator.wikimedia.org/T296777) (owner: 10Dzahn) [21:59:55] (03CR) 10Urbanecm: [C: 03+1] "LGTM 😊" [puppet] - 10https://gerrit.wikimedia.org/r/742798 (https://phabricator.wikimedia.org/T296777) (owner: 10Dzahn) [22:00:44] (03CR) 10RhinosF1: [C: 03+1] admin: add taavi to deployers [puppet] - 10https://gerrit.wikimedia.org/r/742798 (https://phabricator.wikimedia.org/T296777) (owner: 10Dzahn) [22:12:20] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10nskaggs) Are we going to simply decommission this machine then and remove it from the rack? If so, then I would consider that the resolution of this ticket. But I think we should... [22:14:14] 10SRE, 10DNS, 10Traffic, 10WMF-Communications: Setup subdomain for Foundation messaging site - https://phabricator.wikimedia.org/T296570 (10Varnent) Thank you so much @ssingh! [22:18:55] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10Andrew) Yep, let's just decom. I'll open a ticket for that and then close this [22:20:46] (03CR) 10Dzahn: [C: 03+2] admin: add taavi to deployers [puppet] - 10https://gerrit.wikimedia.org/r/742798 (https://phabricator.wikimedia.org/T296777) (owner: 10Dzahn) [22:21:31] !log welcome Majavah to MediaWiki deployers (T296777) [22:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:52] majavah: deploy1002: Admin/Admin::Hashuser[taavi]/Admin::User[taavi]/User[taavi]/ensure: created [22:23:09] happy deploying [22:23:28] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T296592 (10Andrew) 05Openβ†’03Resolved https://phabricator.wikimedia.org/T296790 [22:24:52] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10Dzahn) 05Openβ†’03Resolved a:03Dzahn ` Admin/Admin::Hashuser[taavi]/Admin::User[taavi]/User[taavi]/ensure: created [deploy1002:~] $ id taavi uid=21215(taavi) gid=5... [22:28:08] Congrats majavah [22:30:12] !log krinkle@deploy1002 Started deploy [integration/docroot@2af7007]: Ia89b6591639e5 [22:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:19] majavah: happy deploying [22:30:22] !log krinkle@deploy1002 Finished deploy [integration/docroot@2af7007]: Ia89b6591639e5 (duration: 00m 09s) [22:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:14] 10SRE, 10Analytics, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Ottomata) +1 [22:44:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:44:43] (03CR) 10Ryan Kemper: "I'd meant to circle back and set the kernel version but was lagging on it, so thanks very much for getting this patch up! Merging now." [puppet] - 10https://gerrit.wikimedia.org/r/742729 (https://phabricator.wikimedia.org/T294961) (owner: 10Muehlenhoff) [22:45:04] (03CR) 10Ryan Kemper: [C: 03+2] Add profile::base::linux419 to the WCQS role [puppet] - 10https://gerrit.wikimedia.org/r/742729 (https://phabricator.wikimedia.org/T294961) (owner: 10Muehlenhoff) [22:46:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:48:03] (03CR) 10Ryan Kemper: [C: 03+2] [wdqs] cleanup streaming updater config [puppet] - 10https://gerrit.wikimedia.org/r/742669 (owner: 10DCausse) [22:53:04] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:13] (03PS2) 10Cwhite: hiera: add opensearch production configuration [puppet] - 10https://gerrit.wikimedia.org/r/742780 (https://phabricator.wikimedia.org/T288621) [22:56:26] mutante: I think majavah still needs to be added to the wmf-deployment gerrit group [23:02:46] (03PS1) 10Ottomata: Airflow 2.2.2 with extra dependencies [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/742813 (https://phabricator.wikimedia.org/T295380) [23:06:46] zabe: checking.. ACK.. thanks for the ping [23:06:50] (03PS3) 10Cwhite: hiera: add opensearch production configuration [puppet] - 10https://gerrit.wikimedia.org/r/742780 (https://phabricator.wikimedia.org/T288621) [23:08:14] (03PS2) 10Ottomata: Airflow 2.2.2 with extra dependencies [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/742813 (https://phabricator.wikimedia.org/T295380) [23:08:38] I feel like the only tech contributor hat left for majavah to collect is a pay check. :) [23:09:25] !log gerrit - added Majavah to wmf-deployment group for T296777 [23:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:08] expected more bot [23:11:02] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10Dzahn) 22:56 < zabe> mutante: I think majavah still needs to be added to the wmf-deployment gerrit group 23:09 < mutante> !log gerrit - added Majavah to wmf-deployment group for T296777 [23:12:03] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10Dzahn) {F34797211} [23:12:30] tokens on request ticket for Majavah: https://phab.wmfusercontent.org/file/data/xxzlffr322vrark5zwgk/PHID-FILE-rlgjnsvua3u5bdjapalp/preview-Screenshot_at_2021-11-30_15-11-34.png [23:21:18] (03CR) 10Cwhite: "PCC NOOP: https://puppet-compiler.wmflabs.org/compiler1002/32755/" [puppet] - 10https://gerrit.wikimedia.org/r/742780 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [23:21:32] 10SRE, 10SRE-swift-storage: Persistent "429, Too Many Requests" for Commons AV1 thumbnail - https://phabricator.wikimedia.org/T296562 (10ToBeFree) Can someone from SRE confirm that the bug exists and is currently reproducible? Is there a kind of desired "minimum response time" from sysadmins in response to per... [23:21:36] (03PS2) 10Cwhite: site: consolidate logstash node definitions [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) [23:21:53] (03CR) 10Krinkle: P::doc: sync data to non-active servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [23:23:10] Can someone look at ToBeFree's error above and at least see if a log exists or something? [23:29:01] (03PS3) 10Cwhite: site: consolidate logstash node definitions [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) [23:33:57] (03PS1) 10Clare Ming: Enable A/B test enrollment instrumentation. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742817 (https://phabricator.wikimedia.org/T292587) [23:37:25] (03CR) 10Nray: Enable A/B test enrollment instrumentation. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742817 (https://phabricator.wikimedia.org/T292587) (owner: 10Clare Ming) [23:37:59] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/32757 (largely noop)" [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [23:38:17] (03PS3) 10Jforrester: [Beta Cluster] Add initial namespace aliases for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742756 (https://phabricator.wikimedia.org/T284162) [23:38:19] (03PS1) 10Jforrester: [Beta Cluster] Add project images for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742818 (https://phabricator.wikimedia.org/T284162) [23:38:50] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Add initial namespace aliases for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742756 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [23:38:54] (03PS2) 10Clare Ming: Enable A/B test enrollment instrumentation. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742817 (https://phabricator.wikimedia.org/T292587) [23:39:20] (03CR) 10Jforrester: [C: 03+2] [Beta Cluster] Add project images for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742818 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [23:39:34] (03Merged) 10jenkins-bot: [Beta Cluster] Add initial namespace aliases for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742756 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [23:40:07] (03Merged) 10jenkins-bot: [Beta Cluster] Add project images for Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742818 (https://phabricator.wikimedia.org/T284162) (owner: 10Jforrester) [23:40:53] (03PS1) 10Dzahn: miscweb: try again to enable TLS, remove nodePort [deployment-charts] - 10https://gerrit.wikimedia.org/r/742819 (https://phabricator.wikimedia.org/T281538) [23:41:08] (03CR) 10Nray: [C: 03+1] Enable A/B test enrollment instrumentation. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742817 (https://phabricator.wikimedia.org/T292587) (owner: 10Clare Ming) [23:43:44] (03CR) 10Dzahn: "This time using 'helmfile -e staging destroy' in the right env to get rid of remnants that might have lead to previous issues." [deployment-charts] - 10https://gerrit.wikimedia.org/r/742819 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [23:45:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:41] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Majavah - https://phabricator.wikimedia.org/T296777 (10KFrancis) @Dzahn @herron We are good on our current NDA on file for Majavah. Please proceed with any needed access. Thanks! [23:49:30] (03CR) 10Dzahn: [C: 03+2] miscweb: try again to enable TLS, remove nodePort [deployment-charts] - 10https://gerrit.wikimedia.org/r/742819 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [23:50:55] (03CR) 10Jdlrobson: [C: 03+1] Enable A/B test enrollment instrumentation. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742817 (https://phabricator.wikimedia.org/T292587) (owner: 10Clare Ming) [23:55:07] (03Merged) 10jenkins-bot: miscweb: try again to enable TLS, remove nodePort [deployment-charts] - 10https://gerrit.wikimedia.org/r/742819 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [23:56:47] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [23:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:57] !log deploy1002 - kube_env miscweb staging ; helmfile -e staging destroy [23:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:37] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [23:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log