[00:00:05] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211201T0000). [00:00:05] tgr and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:18] I can deploy today [00:00:33] yay! thanks Roan [00:00:47] o/ [00:01:05] (03CR) 10Catrope: [C: 03+2] Enable A/B test enrollment instrumentation. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742817 (https://phabricator.wikimedia.org/T292587) (owner: 10Clare Ming) [00:01:31] (03CR) 10Catrope: [C: 03+2] Newcomer tasks: Fix filtering of non-existent task types [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742548 (https://phabricator.wikimedia.org/T296366) (owner: 10Gergő Tisza) [00:03:13] (03Merged) 10jenkins-bot: Enable A/B test enrollment instrumentation. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742817 (https://phabricator.wikimedia.org/T292587) (owner: 10Clare Ming) [00:07:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:00] cjming: Your config patch is on mwdebug1002, please test if possible (or tell me to just go ahead if this is another hard-to-test-because-instrumentation thing) [00:08:45] RoanKattouw: thanks - please go ahead [00:08:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:04] !log catrope@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742817|Enable A/B test enrollment instrumentation. (T292587)]] (duration: 00m 56s) [00:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:08] T292587: Sticky header: Create A/B test schema and tie to sticky header feature - https://phabricator.wikimedia.org/T292587 [00:11:49] cjming: Deployed [00:12:33] RoanKattouw: thank you! [00:28:38] (03Merged) 10jenkins-bot: Newcomer tasks: Fix filtering of non-existent task types [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742548 (https://phabricator.wikimedia.org/T296366) (owner: 10Gergő Tisza) [00:30:19] tgr: Your patch is on mwdebug1002, please test [00:31:08] thanks RoanKattouw, works as intended [00:32:21] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/includes/NewcomerTasks/NewcomerTasksUserOptionsLookup.php: Backport: [[gerrit:742548|Newcomer tasks: Fix filtering of non-existent task types (T296366)]] (duration: 00m 56s) [00:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:26] T296366: GrowthExperiments: Call to undefined method StatusValue::getTotalCount() - https://phabricator.wikimedia.org/T296366 [00:35:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:25] (03CR) 10Cwhite: "PCC legacy nodes (NOOP): https://puppet-compiler.wmflabs.org/compiler1003/32758/" [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [00:45:32] RECOVERY - very high load average likely xfs on ms-be2059 is OK: OK - load average: 60.98, 68.95, 77.08 https://wikitech.wikimedia.org/wiki/Swift [00:51:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:53:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:56:40] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [01:00:58] (03Restored) 10RLazarus: Initial deb package [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738996 (owner: 10RLazarus) [01:01:08] (03PS2) 10RLazarus: Initial deb package [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738996 [01:03:26] PROBLEM - very high load average likely xfs on ms-be2059 is CRITICAL: CRITICAL - load average: 122.95, 104.75, 89.45 https://wikitech.wikimedia.org/wiki/Swift [01:13:28] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [01:14:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:17:50] (03CR) 10RLazarus: "Moving over here from https://gerrit.wikimedia.org/r/738500 where this was reviewed for the debian branch -- per discussion offline, we de" [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738996 (owner: 10RLazarus) [01:18:46] (03Abandoned) 10RLazarus: Initial deb package [docker-images/imagecatalog] (debian) - 10https://gerrit.wikimedia.org/r/738500 (owner: 10RLazarus) [01:21:32] (03CR) 10RLazarus: [C: 03+2] Initial deb package [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738996 (owner: 10RLazarus) [01:23:24] (03Merged) 10jenkins-bot: Initial deb package [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738996 (owner: 10RLazarus) [01:39:30] (03PS1) 104nn1l2: hewiki: add "templateeditor" permission group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742833 (https://phabricator.wikimedia.org/T296769) [01:58:16] 10SRE, 10Internet-Archive, 10InternetArchiveBot: Determine appropriate API request limits for InternetArchiveBot - https://phabricator.wikimedia.org/T296577 (10Legoktm) To give an idea of how many requests InternetArchiveBot was sending, the following screenshot shows number of requests over a 24 hour period... [01:58:36] 10SRE, 10Internet-Archive, 10InternetArchiveBot: Determine appropriate API request limits for InternetArchiveBot - https://phabricator.wikimedia.org/T296577 (10Legoktm) [02:09:24] RECOVERY - very high load average likely xfs on ms-be2059 is OK: OK - load average: 70.51, 73.39, 79.68 https://wikitech.wikimedia.org/wiki/Swift [02:10:11] (03PS1) 10Clare Ming: Add mediawiki.web_ui_scroll stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742834 (https://phabricator.wikimedia.org/T292586) [02:13:23] (03CR) 10Clare Ming: "sampling rates were added in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/742524/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742834 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [02:19:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:20:14] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:21:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:32:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:34:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:12:19] (03PS1) 104nn1l2: bnwikibooks: add autopatrolled and patroller user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742835 (https://phabricator.wikimedia.org/T296640) [04:46:50] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:52:58] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:18] (03CR) 10Ryan Kemper: query_service: Collect wdqs and wcqs jmx metrics separately (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742566 (https://phabricator.wikimedia.org/T280008) (owner: 10Ebernhardson) [05:26:14] (03PS4) 10Gergő Tisza: GrowthExperiments configuration fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) [05:27:30] (03CR) 10Gergő Tisza: GrowthExperiments configuration fixes (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [05:28:49] (03CR) 10Ryan Kemper: "Will merge this weds. Small question: is there a reason the patch is titled wdquery_service and not just query_service? Does this impact j" [puppet] - 10https://gerrit.wikimedia.org/r/739942 (https://phabricator.wikimedia.org/T295676) (owner: 10Ebernhardson) [05:37:52] (03PS1) 10Ryan Kemper: wcqs: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/742841 (https://phabricator.wikimedia.org/T280001) [05:38:28] (03CR) 10jerkins-bot: [V: 04-1] wcqs: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/742841 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [05:40:10] (03PS2) 10Ryan Kemper: wcqs: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/742841 (https://phabricator.wikimedia.org/T280001) [05:40:40] (03PS3) 10Ryan Kemper: wcqs: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/742841 (https://phabricator.wikimedia.org/T280001) [05:40:44] (03CR) 10jerkins-bot: [V: 04-1] wcqs: move back to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/742841 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [05:43:48] Any reason why I'm not able to cherry-pick change using command line on wmf.9? https://usercontent.irccloud-cdn.com/file/75yX40IR/image.png [05:50:13] <_joe_> kart_: read the text, it's stated right there. [06:07:28] Ouch. It is refering to change I dropped :/ [06:17:41] (03PS2) 10Giuseppe Lavagetto: mwdebug: reduce number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/734452 (owner: 10Effie Mouzeli) [06:17:43] (03PS1) 10Giuseppe Lavagetto: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/742842 [06:19:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2117.codfw.wmnet with reason: Maintenance T277354 [06:19:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2117.codfw.wmnet with reason: Maintenance T277354 [06:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:45] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [06:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2124.codfw.wmnet with reason: Maintenance T277354 [06:20:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2124.codfw.wmnet with reason: Maintenance T277354 [06:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance T277354 [06:20:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance T277354 [06:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/742842 (owner: 10Giuseppe Lavagetto) [06:26:47] (03Merged) 10jenkins-bot: mediawiki: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/742842 (owner: 10Giuseppe Lavagetto) [06:26:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: reduce number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/734452 (owner: 10Effie Mouzeli) [06:30:36] (03Merged) 10jenkins-bot: mwdebug: reduce number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/734452 (owner: 10Effie Mouzeli) [06:49:19] (03PS1) 10Marostegui: Revert "pc2014: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/742850 [06:49:57] (03CR) 10Marostegui: [C: 03+2] Revert "pc2014: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/742850 (owner: 10Marostegui) [07:01:52] (03CR) 10Zabe: [C: 03+1] Add templateeditor group and protection level at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741972 (https://phabricator.wikimedia.org/T296154) (owner: 104nn1l2) [07:20:12] (03PS1) 10Varac: Kubernetes 1.22 support, update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 [07:20:14] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [07:28:51] (03CR) 10Elukey: [C: 03+2] Move coal, navtiming and statsv to the new canonical CA bundle path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742674 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [07:32:00] (03CR) 10Elukey: [C: 03+2] profile::kafka::broker: use new get ca bundle path helpers [puppet] - 10https://gerrit.wikimedia.org/r/742725 (https://phabricator.wikimedia.org/T296089) (owner: 10Elukey) [07:37:52] (03CR) 10Elukey: [V: 03+1 C: 03+2] presto: move truststore to the new wmf internal CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/739477 (owner: 10Elukey) [07:45:00] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:50:28] (03PS1) 10Elukey: presto: use new ca bundle jks for internal TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/742910 [07:51:51] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32759/console" [puppet] - 10https://gerrit.wikimedia.org/r/742910 (owner: 10Elukey) [07:54:23] (03PS2) 10Elukey: presto: use new ca bundle jks for internal TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/742910 [07:55:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32760/console" [puppet] - 10https://gerrit.wikimedia.org/r/742910 (owner: 10Elukey) [07:58:21] (03PS4) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in some more situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740305 [07:58:26] (03PS2) 10Thiemo Kreuz (WMDE): Make use of the ?? operator in more trivial situations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304 [07:58:32] (03PS8) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 [08:01:07] (03PS3) 10Elukey: presto: use new ca bundle jks for internal TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/742910 [08:02:00] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32761/console" [puppet] - 10https://gerrit.wikimedia.org/r/742910 (owner: 10Elukey) [08:03:00] (03CR) 10Elukey: [V: 03+1 C: 03+2] presto: use new ca bundle jks for internal TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/742910 (owner: 10Elukey) [08:09:22] elukey: can you downtime the fr db and dev lag alerts for a few hours? Andy says they can wait until someone is awake and some are sending irc alerts every few minutes [08:11:20] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:43] RhinosF1: done it, added 12h of downtime [08:13:55] elukey: thanks [08:14:36] https://phabricator.wikimedia.org/T296811 [08:16:11] (03PS1) 10Muehlenhoff: Update Cumin alias for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/742916 [08:16:52] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, 10Patch-For-Review: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10jcrespo) a:05jcrespo→03BTullis Reassigning to btullis, as he was the person t... [08:20:17] (03CR) 10Filippo Giunchedi: [C: 03+1] site: consolidate logstash node definitions [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [08:20:55] (03CR) 10Muehlenhoff: [C: 03+2] Update Cumin alias for puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/742916 (owner: 10Muehlenhoff) [08:21:32] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: add opensearch production configuration [puppet] - 10https://gerrit.wikimedia.org/r/742780 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [08:22:07] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10ayounsi) p:05Medium→03High Many sensors are now over threshold, see the red in: https://librenms.wikimedia.org/device/173/ [08:22:25] (03CR) 10Muehlenhoff: Add ownership annotations for more Service SRE services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738426 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [08:32:36] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:58] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:02] (03Abandoned) 10Filippo Giunchedi: rancid: add ability to disable emails [puppet] - 10https://gerrit.wikimedia.org/r/741919 (owner: 10Filippo Giunchedi) [08:51:10] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:14] (03PS1) 10Elukey: presto: use system truststore to validate TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/742921 [08:53:59] (03CR) 10Elukey: [C: 03+2] presto: use system truststore to validate TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/742921 (owner: 10Elukey) [08:54:12] (03PS1) 10Jcrespo: dbbackups: Move db backups from db1140:s1 to db1139:s1 [puppet] - 10https://gerrit.wikimedia.org/r/742922 (https://phabricator.wikimedia.org/T280979) [08:56:10] !log draining primary/secondary instance off ganeti2010 T296622 [08:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:14] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [08:56:23] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Move db backups from db1140:s1 to db1139:s1 [puppet] - 10https://gerrit.wikimedia.org/r/742922 (https://phabricator.wikimedia.org/T280979) (owner: 10Jcrespo) [09:00:09] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Listen on several Unix Domain Sockets [puppet] - 10https://gerrit.wikimedia.org/r/742785 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [09:03:30] !log rolling restart of haproxy and varnish on O:cache::text_haproxy and O:cache::upload_haproxy - T290005 [09:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:35] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:04:03] (03PS1) 10Ladsgroup: logstash: Add maxSeconds and actualSeconds as numeric fields [puppet] - 10https://gerrit.wikimedia.org/r/742923 (https://phabricator.wikimedia.org/T295706) [09:08:50] (03PS1) 10Ladsgroup: Drop using ft_title and ft_namespace [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742853 (https://phabricator.wikimedia.org/T296380) [09:09:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1096.eqiad.wmnet with reason: Maintenance T277354 [09:09:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1096.eqiad.wmnet with reason: Maintenance T277354 [09:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:46] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:09:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17917 and previous config saved to /var/cache/conftool/dbconfig/20211201-090948-marostegui.json [09:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1096:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17918 and previous config saved to /var/cache/conftool/dbconfig/20211201-091110-marostegui.json [09:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubetcd2005.codfw.wmnet with reason: Switch to DRBD for migration [09:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubetcd2005.codfw.wmnet with reason: Switch to DRBD for migration [09:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:42] jouncebot: nowandnext [09:20:43] No deployments scheduled for the next 2 hour(s) and 39 minute(s) [09:20:43] In 2 hour(s) and 39 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211201T1200) [09:20:46] cool [09:20:55] (03CR) 10Ladsgroup: [C: 03+2] Drop using ft_title and ft_namespace [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742853 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [09:23:23] (03PS1) 10Majavah: beta: Update mx host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742925 [09:25:25] (03Merged) 10jenkins-bot: Drop using ft_title and ft_namespace [extensions/FlaggedRevs] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/742853 (https://phabricator.wikimedia.org/T296380) (owner: 10Ladsgroup) [09:26:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1096:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17919 and previous config saved to /var/cache/conftool/dbconfig/20211201-092615-marostegui.json [09:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:20] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:27:02] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: ulsfo: (2) mx80s to become temp cr[34]-drmrs - https://phabricator.wikimedia.org/T295819 (10ayounsi) 05Open→03Stalled Thanks I had a quick look and they both are healthy, all 8 interfaces show up as well. I'll wait that we make pr... [09:29:25] (03PS1) 10Elukey: install_server: add reuse recipe for kafka-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/742926 (https://phabricator.wikimedia.org/T296641) [09:30:19] I have a beta-only config patch I want to get out at some point, so when prod is clear and someone more experienced is around I'd like to try merging/deploying it [09:30:48] (03PS2) 10Elukey: install_server: add reuse recipe for kafka-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/742926 (https://phabricator.wikimedia.org/T296641) [09:31:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:33] (03PS1) 10Muehlenhoff: Set estimated migration dates for elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/742930 [09:38:11] I made lots of tests and it works fine, moving forward [09:39:16] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/FlaggedRevs/backend/FlaggedRevision.php: Backport: [[gerrit:742853|Drop using ft_title and ft_namespace (T296380)]] (duration: 00m 56s) [09:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:20] T296380: flaggedtemplates table is still too big - https://phabricator.wikimedia.org/T296380 [09:40:06] Amir1: isn't it brave to say "[flagged revs] works fine"? 🙂 [09:40:31] urbanecm: in limits of that extension :D [09:40:40] okay okay :D [09:40:54] Amir1: can you please let majavah know when you're done? [09:41:12] majavah: I'm done, just checking graphs [09:41:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1096:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17920 and previous config saved to /var/cache/conftool/dbconfig/20211201-094120-marostegui.json [09:41:23] cool [09:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:25] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:41:30] can I merge/fetch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/742925/ then? [09:41:41] majavah: go ahead! :) [09:41:47] (03CR) 10Majavah: [C: 03+2] beta: Update mx host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742925 (owner: 10Majavah) [09:41:50] should only need a git fetch; git rebase at deployment host [09:42:08] no need to sync (as it only changes -labs.php file) [09:42:11] but you can if you want to try that out [09:42:26] sure if it's harmless [09:42:45] (03Merged) 10jenkins-bot: beta: Update mx host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742925 (owner: 10Majavah) [09:42:57] yeah, syncing -labs.php doesn't do anything. It's never loaded by prod. [09:43:03] apergos: Are you aware of this? https://logstash.wikimedia.org/goto/198875c8ea1f016cbf8936015c684c99 [09:43:07] "git log -p HEAD..@{u}" looks good, so rebasing [09:43:13] only happens in snapshot1013 [09:43:59] !log [urbanecm@mwmaint1002 ~]$ foreachwiki extensions/CheckUser/maintenance/fixTrailingSpacesInLogs.php [09:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:37] I pulled to mwdebug1001 and it worked too [09:44:50] urbanecm: so something like `scap sync-file wmf-config/CommonSettings-labs.php 'Config: [[gerrit:742925|beta: Update mx host]]'`? [09:45:02] correct [09:45:52] yup or use my tool :D [09:46:02] these are today? no that's new afaik Amir1, these are all blobstore not being able to decompress something and they are all warnings [09:46:07] !log taavi@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:742925|beta: Update mx host]] (duration: 00m 55s) [09:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:26] neat [09:46:29] apergos: it's an hour ago, okay then [09:46:32] https://deploy-commands.toolforge.org/bacc/742925 [09:46:34] I guess that either MW is chattier than it used to be about those (bad old revision data) or that we have not paid attention [09:46:34] majavah: congrats for your first deployment (AFAICS) :) [09:46:44] it can be ignored however [09:46:45] yeah, this was the first one :-) [09:46:50] wohoooo [09:47:00] majavah: in case you wish to do sth that actually touches prod, I'll be doing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/734383 [09:47:10] (will require a world-wide namespaceDupes.php though) [09:47:24] apergos: noted, thanks. It's just showing up in error logs, maybe needs a patch [09:47:54] maybe php warnings shouldn't go to logstash. dunno [09:48:27] they definitely should [09:48:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:50] most of the time it means you're doing something incorrectly [09:49:19] urbanecm: I'd probably prefer to do something that impacts less wikis first [09:49:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:48] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:21] majavah: sure, understandable [09:53:03] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:46] well in the current case it is once again the result of having a lot of old bad revision data in these 20 year old piles of data. we will run into this over and over until either the data is cleaned somehow (with what resources? who would take that risk?) or ... well that. basically :-D [09:55:27] (03PS1) 10Jcrespo: mediabackups: Backup commonswiki at codfw [puppet] - 10https://gerrit.wikimedia.org/r/742934 (https://phabricator.wikimedia.org/T156544) [09:55:45] like for case on handling image metadata we put that blob inside supressWarnings(), I can look it up for you [09:56:15] I suggest doing that for that specific part because atm it's noisy and it's making seeing the actual errors harder [09:56:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1096:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17921 and previous config saved to /var/cache/conftool/dbconfig/20211201-095624-marostegui.json [09:56:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1098.eqiad.wmnet with reason: Maintenance T277354 [09:56:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1098.eqiad.wmnet with reason: Maintenance T277354 [09:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:33] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:56:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17922 and previous config saved to /var/cache/conftool/dbconfig/20211201-095632-marostegui.json [09:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1098:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17923 and previous config saved to /var/cache/conftool/dbconfig/20211201-095754-marostegui.json [09:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:40] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739032 (https://phabricator.wikimedia.org/T294737) (owner: 10Gergő Tisza) [10:08:03] MediaWiki\Revision\RevisionStoreRecord->getSha1 this is where it should probably suppress the warning. I guess. [10:08:44] but I'm not really sure... what if we hit a legitimate error (i.e. it's new corruption and not from old data)? [10:08:50] how would we know the difference? [10:12:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1098:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17924 and previous config saved to /var/cache/conftool/dbconfig/20211201-101259-marostegui.json [10:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:04] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:13:59] (03CR) 10Muehlenhoff: [C: 03+2] Set estimated migration dates for elastic clusters [puppet] - 10https://gerrit.wikimedia.org/r/742930 (owner: 10Muehlenhoff) [10:14:37] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:18] (03CR) 10Kormat: "Looks good, one comment." [puppet] - 10https://gerrit.wikimedia.org/r/742926 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [10:21:35] (03PS3) 10Elukey: install_server: add reuse recipe for kafka-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/742926 (https://phabricator.wikimedia.org/T296641) [10:21:38] (03CR) 10Elukey: install_server: add reuse recipe for kafka-main hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742926 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [10:22:55] (03CR) 10Kormat: [C: 03+1] "LGTM. Remember to use reuse-parts-test.cfg in netboot.cfg for the first host just to make sure it's doing what we all sincerely hope it's " [puppet] - 10https://gerrit.wikimedia.org/r/742926 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [10:23:29] !log test haproxy_2.2.19-1~bpo10+1 on cp3064 - T290005 [10:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:34] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:24:13] jouncebot: nowandnext [10:24:13] No deployments scheduled for the next 1 hour(s) and 35 minute(s) [10:24:13] In 1 hour(s) and 35 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211201T1200) [10:25:20] I’ll deploy a MediaWiki security patch [10:26:17] (03PS3) 10Arturo Borrero Gonzalez: hieradata: remove old project-proxies [puppet] - 10https://gerrit.wikimedia.org/r/742211 (owner: 10Majavah) [10:28:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1098:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17925 and previous config saved to /var/cache/conftool/dbconfig/20211201-102804-marostegui.json [10:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:09] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:29:14] !log Deployed patch for T296578 [10:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:44] (I’m done) [10:42:50] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup commonswiki at codfw [puppet] - 10https://gerrit.wikimedia.org/r/742934 (https://phabricator.wikimedia.org/T156544) (owner: 10Jcrespo) [10:43:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1098:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17926 and previous config saved to /var/cache/conftool/dbconfig/20211201-104308-marostegui.json [10:43:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1113.eqiad.wmnet with reason: Maintenance T277354 [10:43:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1113.eqiad.wmnet with reason: Maintenance T277354 [10:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:14] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:43:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17927 and previous config saved to /var/cache/conftool/dbconfig/20211201-104316-marostegui.json [10:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1113:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17928 and previous config saved to /var/cache/conftool/dbconfig/20211201-104438-marostegui.json [10:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:47] (03PS1) 10Cathal Mooney: Added option to disable Capirca ACL generation completely. [software/homer] - 10https://gerrit.wikimedia.org/r/742942 [10:51:32] (03PS2) 10Cathal Mooney: Added option to disable Capirca ACL generation completely. [software/homer] - 10https://gerrit.wikimedia.org/r/742942 [10:55:57] PROBLEM - ganeti-confd running on ganeti2010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [10:56:07] PROBLEM - ganeti-mond running on ganeti2010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-mond https://wikitech.wikimedia.org/wiki/Ganeti [10:56:08] (03CR) 10Volans: [C: 04-1] "Small nits inline, I don't see any problem to include it." [software/homer] - 10https://gerrit.wikimedia.org/r/742942 (owner: 10Cathal Mooney) [10:58:03] PROBLEM - ganeti-noded running on ganeti2010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:58:32] https://phabricator.wikimedia.org/T296823 fyi Amir1 [10:59:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1113:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17929 and previous config saved to /var/cache/conftool/dbconfig/20211201-105943-marostegui.json [10:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:48] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [10:59:54] (03CR) 10Elukey: [C: 03+2] install_server: add reuse recipe for kafka-main hosts [puppet] - 10https://gerrit.wikimedia.org/r/742926 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [11:04:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:07:19] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:10:31] apergos: Thanks [11:12:40] (03CR) 10Btullis: [C: 03+1] "This works for me, thanks. +1" [puppet] - 10https://gerrit.wikimedia.org/r/742453 (https://phabricator.wikimedia.org/T296285) (owner: 10Kormat) [11:14:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1113:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17930 and previous config saved to /var/cache/conftool/dbconfig/20211201-111448-marostegui.json [11:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:53] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:29:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1113:3316 (T277354)', diff saved to https://phabricator.wikimedia.org/P17931 and previous config saved to /var/cache/conftool/dbconfig/20211201-112952-marostegui.json [11:29:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1140.eqiad.wmnet with reason: Maintenance T277354 [11:29:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1140.eqiad.wmnet with reason: Maintenance T277354 [11:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:08] !log test HAProxy 2.4.9 on cp3064 - T290005 [11:30:29] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:13] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [11:33:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db[1155,1165].eqiad.wmnet with reason: Maintenance T277354 [11:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db[1155,1165].eqiad.wmnet with reason: Maintenance T277354 [11:33:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T277354)', diff saved to https://phabricator.wikimedia.org/P17932 and previous config saved to /var/cache/conftool/dbconfig/20211201-113354-marostegui.json [11:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1165 (T277354)', diff saved to https://phabricator.wikimedia.org/P17933 and previous config saved to /var/cache/conftool/dbconfig/20211201-113506-marostegui.json [11:35:30] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:12] (03PS1) 10Majavah: admin: update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/742946 [11:38:04] (03PS2) 10Muehlenhoff: Revert "Prefer mx1001 over mx2001 for smart hosts / wiki mail" [puppet] - 10https://gerrit.wikimedia.org/r/742757 [11:42:36] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Prefer mx1001 over mx2001 for smart hosts / wiki mail" [puppet] - 10https://gerrit.wikimedia.org/r/742757 (owner: 10Muehlenhoff) [11:44:53] (03PS2) 10Muehlenhoff: Revert "Prefer mx1001 over mx2001 for weights in MX records" [dns] - 10https://gerrit.wikimedia.org/r/742754 [11:47:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/742946 (owner: 10Majavah) [11:50:02] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Prefer mx1001 over mx2001 for weights in MX records" [dns] - 10https://gerrit.wikimedia.org/r/742754 (owner: 10Muehlenhoff) [11:50:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1165 (T277354)', diff saved to https://phabricator.wikimedia.org/P17934 and previous config saved to /var/cache/conftool/dbconfig/20211201-115011-marostegui.json [11:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:47] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [11:52:09] (03PS1) 10Giuseppe Lavagetto: mediawiki: several fixes to rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/742947 [11:52:36] (03PS2) 10Kormat: cumin: Split out backup sources from db-store alias [puppet] - 10https://gerrit.wikimedia.org/r/742453 (https://phabricator.wikimedia.org/T296285) [11:54:25] (03CR) 10Kormat: [C: 03+2] cumin: Split out backup sources from db-store alias [puppet] - 10https://gerrit.wikimedia.org/r/742453 (https://phabricator.wikimedia.org/T296285) (owner: 10Kormat) [11:55:05] (03PS2) 10Muehlenhoff: Point back irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/742730 (https://phabricator.wikimedia.org/T296721) [11:57:03] 10Puppet, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, 10Patch-For-Review: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) - https://phabricator.wikimedia.org/T296285 (10Kormat) 05Open→03Resolved Cumin alias change is merged, so i'm going to optim... [11:59:37] (03CR) 10Muehlenhoff: [C: 03+2] Point back irc.wikimedia.org to irc2001 [dns] - 10https://gerrit.wikimedia.org/r/742730 (https://phabricator.wikimedia.org/T296721) (owner: 10Muehlenhoff) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211201T1200). [12:00:05] Inductiveload and nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:07] o/ [12:00:11] hey! [12:00:17] o/ [12:00:18] hi [12:00:24] i can deploy today [12:00:33] inductiveload: it looks like nobody on enwikisource replied to your config change suggestion yet? :/ [12:00:41] no [12:00:47] but also no one opposed [12:00:49] (03CR) 10Jelto: "This change is ready for review." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 (owner: 10Jelto) [12:01:16] which is about par for the course [12:01:32] (03PS3) 10Urbanecm: Add templateeditor group and protection level at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741972 (https://phabricator.wikimedia.org/T296154) (owner: 104nn1l2) [12:01:36] (03CR) 10Urbanecm: [C: 03+2] Add templateeditor group and protection level at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741972 (https://phabricator.wikimedia.org/T296154) (owner: 104nn1l2) [12:02:14] (03PS1) 10Volans: sre.hosts.reimage: support Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 [12:02:32] (03Merged) 10jenkins-bot: Add templateeditor group and protection level at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741972 (https://phabricator.wikimedia.org/T296154) (owner: 104nn1l2) [12:03:14] nn1l2: your patch is at mwdebug1001 [12:03:17] can you test? [12:04:03] (03PS2) 10Volans: sre.hosts.reimage: support Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 [12:04:04] LGTM https://vi.wikipedia.org/wiki/%C4%90%E1%BA%B7c_bi%E1%BB%87t:Quy%E1%BB%81n_nh%C3%B3m_ng%C6%B0%E1%BB%9Di_d%C3%B9ng [12:04:13] thanks, syncing [12:05:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1165 (T277354)', diff saved to https://phabricator.wikimedia.org/P17935 and previous config saved to /var/cache/conftool/dbconfig/20211201-120515-marostegui.json [12:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:40] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2bd14e8968c90b2562f045457d61b252728e6250: Add templateeditor group and protection level at viwiki (T296154) (duration: 00m 56s) [12:05:51] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:56] nn1l2: should be live [12:06:11] Lucas_WMDE: I'm okay with doing inductiveload's change, a week should be enough [12:06:15] are you fine with that? [12:06:17] ok [12:06:23] no blocker from me [12:06:27] T296154: Template editor group for vi.wiki - https://phabricator.wikimedia.org/T296154 [12:06:28] thanks [12:06:33] (03PS4) 10Urbanecm: enwikisource: enable anonymous talk page mobile tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741097 (https://phabricator.wikimedia.org/T47955) (owner: 10Inductiveload) [12:06:42] (03CR) 10Urbanecm: [C: 03+2] enwikisource: enable anonymous talk page mobile tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741097 (https://phabricator.wikimedia.org/T47955) (owner: 10Inductiveload) [12:06:44] i did try to get input ^_^ [12:06:47] yes, thanks, urbanecm [12:07:10] np [12:07:16] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) create and shared a spreadsheet trying to capture/compa... [12:07:28] (03Merged) 10jenkins-bot: enwikisource: enable anonymous talk page mobile tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741097 (https://phabricator.wikimedia.org/T47955) (owner: 10Inductiveload) [12:07:49] inductiveload: available at mwdebug1001, can you have a look? [12:08:42] urbanecm: it is working [12:08:47] thanks, syncing [12:09:23] (03PS1) 10Ladsgroup: dbtools: Add auto_schema to reduce the toil in schema changes [software] - 10https://gerrit.wikimedia.org/r/742950 [12:10:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c8ab29b2feb47d611873cf0465b2a2dd5eac0ad2: enwikisource: enable anonymous talk page mobile tabs (T47955) (duration: 00m 56s) [12:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:09] inductiveload: live [12:10:11] anything else? [12:10:19] (03CR) 10Marostegui: [C: 03+1] dbtools: Add auto_schema to reduce the toil in schema changes [software] - 10https://gerrit.wikimedia.org/r/742950 (owner: 10Ladsgroup) [12:10:31] (03CR) 10jerkins-bot: [V: 04-1] dbtools: Add auto_schema to reduce the toil in schema changes [software] - 10https://gerrit.wikimedia.org/r/742950 (owner: 10Ladsgroup) [12:10:37] confirmed, working [12:10:40] T47955: Provide a navigation option (next/previous/index page) for proofread page in mobile view - https://phabricator.wikimedia.org/T47955 [12:10:42] no, that's it for today, thank you very much :-) [12:10:55] great :) [12:11:08] !log EU B&C window done [12:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:38] (03PS1) 10Volans: dhcp: fix missing semicolon in DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/742951 [12:19:26] well that resolved a UX bug that's been a "thing" at WS since before Phabricator ✌️ [12:19:31] (at least for enWS, RIP the others) [12:19:35] (03PS2) 10Ladsgroup: dbtools: Add auto_schema to reduce the toil in schema changes [software] - 10https://gerrit.wikimedia.org/r/742950 [12:20:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1165 (T277354)', diff saved to https://phabricator.wikimedia.org/P17936 and previous config saved to /var/cache/conftool/dbconfig/20211201-122020-marostegui.json [12:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:56] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:21:29] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "Fixed pep8, merging" [software] - 10https://gerrit.wikimedia.org/r/742950 (owner: 10Ladsgroup) [12:22:10] (03Merged) 10jenkins-bot: dbtools: Add auto_schema to reduce the toil in schema changes [software] - 10https://gerrit.wikimedia.org/r/742950 (owner: 10Ladsgroup) [12:31:25] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. You could add an extra layer of validation by checking the ultimate Vlan ID derived, and making sure the device primary IP is from " [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 (owner: 10Volans) [12:39:22] (03CR) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [12:39:35] (03PS3) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) [12:43:36] (03PS1) 10Ladsgroup: auto_schema: Fix asking for replicas that have hanging replicas [software] - 10https://gerrit.wikimedia.org/r/742952 [12:44:48] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Fix asking for replicas that have hanging replicas [software] - 10https://gerrit.wikimedia.org/r/742952 (owner: 10Ladsgroup) [12:45:25] (03Merged) 10jenkins-bot: auto_schema: Fix asking for replicas that have hanging replicas [software] - 10https://gerrit.wikimedia.org/r/742952 (owner: 10Ladsgroup) [12:49:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1131.eqiad.wmnet with reason: Maintenance T277354 [12:49:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1131.eqiad.wmnet with reason: Maintenance T277354 [12:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T277354)', diff saved to https://phabricator.wikimedia.org/P17937 and previous config saved to /var/cache/conftool/dbconfig/20211201-124919-marostegui.json [12:49:49] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [12:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:17] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [12:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1131 (T277354)', diff saved to https://phabricator.wikimedia.org/P17938 and previous config saved to /var/cache/conftool/dbconfig/20211201-125031-marostegui.json [12:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:50] (03PS1) 10Muehlenhoff: Add more Phabricator task references [puppet] - 10https://gerrit.wikimedia.org/r/742954 [13:00:30] (03CR) 10Filippo Giunchedi: "LGTM (post-merge), a question/note: since there doesn't seem to be anything kafka-specific here the recipe could be also named after the m" [puppet] - 10https://gerrit.wikimedia.org/r/742926 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [13:01:27] (03PS2) 10Muehlenhoff: Add more Phabricator task references [puppet] - 10https://gerrit.wikimedia.org/r/742954 [13:01:39] 10SRE, 10Infrastructure-Foundations, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) [13:02:19] (03PS3) 10Jelto: profile::gitlab-runner add hieradata for protected GitLab Runners [puppet] - 10https://gerrit.wikimedia.org/r/742458 (https://phabricator.wikimedia.org/T295481) [13:05:28] (03CR) 10Muehlenhoff: [C: 03+2] Add more Phabricator task references [puppet] - 10https://gerrit.wikimedia.org/r/742954 (owner: 10Muehlenhoff) [13:05:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1131 (T277354)', diff saved to https://phabricator.wikimedia.org/P17939 and previous config saved to /var/cache/conftool/dbconfig/20211201-130536-marostegui.json [13:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:12] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:07:48] 10SRE, 10Infrastructure-Foundations, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) [13:08:28] (03PS1) 10Muehlenhoff: Add migration tracking date for aqs [puppet] - 10https://gerrit.wikimedia.org/r/742956 [13:11:43] (03CR) 10Jelto: [C: 03+2] profile::gitlab-runner add hieradata for protected GitLab Runners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742458 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:12:09] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:15:26] (03PS3) 10Jelto: site: use gitlab_runner role on gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740691 (https://phabricator.wikimedia.org/T295481) (owner: 10Dzahn) [13:18:27] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: support Ganeti hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 (owner: 10Volans) [13:19:38] !log restore haproxy 2.2.9 on cp3064 - T290005 [13:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:14] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [13:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1131 (T277354)', diff saved to https://phabricator.wikimedia.org/P17942 and previous config saved to /var/cache/conftool/dbconfig/20211201-132041-marostegui.json [13:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:17] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:21:41] (03CR) 10Ayounsi: sre.hosts.reimage: support Ganeti hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 (owner: 10Volans) [13:21:47] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [13:23:45] (03PS1) 10Vgutierrez: site: Reimage cp1089 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/742958 (https://phabricator.wikimedia.org/T290005) [13:28:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM. Thanks for making it consistent across all services!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 (owner: 10Jelto) [13:30:25] !log set "sudo gnt-cluster modify --hypervisor-parameters kvm:machine_version=pc-i440fx-2.8" for ganeti eqiad cluster T294120 [13:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:34] T294120: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 [13:32:22] (03PS6) 10Jbond: cookbook sre.puppet.netbox: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 [13:34:43] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [13:35:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1131 (T277354)', diff saved to https://phabricator.wikimedia.org/P17944 and previous config saved to /var/cache/conftool/dbconfig/20211201-133546-marostegui.json [13:35:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1168.eqiad.wmnet with reason: Maintenance T277354 [13:35:48] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [13:35:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1168.eqiad.wmnet with reason: Maintenance T277354 [13:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T277354)', diff saved to https://phabricator.wikimedia.org/P17945 and previous config saved to /var/cache/conftool/dbconfig/20211201-133554-marostegui.json [13:36:22] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:36:23] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.puppet.netbox: Cookbook for syncing netbox puppet data [cookbooks] - 10https://gerrit.wikimedia.org/r/739234 (owner: 10Jbond) [13:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1168 (T277354)', diff saved to https://phabricator.wikimedia.org/P17946 and previous config saved to /var/cache/conftool/dbconfig/20211201-133705-marostegui.json [13:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] (03PS3) 10Volans: sre.hosts.reimage: support Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 (https://phabricator.wikimedia.org/T296832) [13:42:32] (03CR) 10Volans: "addressed comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans) [13:43:18] (03CR) 10Ayounsi: [C: 03+1] sre.hosts.reimage: support Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans) [13:45:42] (03CR) 10Ayounsi: [C: 03+1] dhcp: fix missing semicolon in DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/742951 (owner: 10Volans) [13:46:35] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: support Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans) [13:47:29] (03CR) 10Volans: [C: 03+2] dhcp: fix missing semicolon in DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/742951 (owner: 10Volans) [13:47:50] (03PS1) 10Kormat: Set line length to 100 [software/wmfdb] - 10https://gerrit.wikimedia.org/r/742960 [13:49:00] (03CR) 10Kormat: [V: 03+2 C: 03+2] Set line length to 100 [software/wmfdb] - 10https://gerrit.wikimedia.org/r/742960 (owner: 10Kormat) [13:49:18] (03CR) 10Jelto: [C: 03+2] charts: fix affinity indentation in charts and scaffold chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 (owner: 10Jelto) [13:49:22] (03CR) 10DCausse: [C: 03+1] Revert "Add repository-swift plugin" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/741734 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [13:50:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: several fixes to rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/742947 (owner: 10Giuseppe Lavagetto) [13:50:58] (03Merged) 10jenkins-bot: sre.hosts.reimage: support Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/742948 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans) [13:52:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1168 (T277354)', diff saved to https://phabricator.wikimedia.org/P17947 and previous config saved to /var/cache/conftool/dbconfig/20211201-135210-marostegui.json [13:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:47] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:53:26] (03Merged) 10jenkins-bot: charts: fix affinity indentation in charts and scaffold chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/742166 (owner: 10Jelto) [13:53:35] (03Merged) 10jenkins-bot: mediawiki: several fixes to rsyslog [deployment-charts] - 10https://gerrit.wikimedia.org/r/742947 (owner: 10Giuseppe Lavagetto) [13:54:07] (03Merged) 10jenkins-bot: dhcp: fix missing semicolon in DHCP config [software/spicerack] - 10https://gerrit.wikimedia.org/r/742951 (owner: 10Volans) [13:55:26] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) @cmooney we have the possibility to add custom facts to puppetdb, we already have a bunch of them, or modify existing one... [13:55:28] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:52] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2010.codfw.wmnet with OS buster [13:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:57] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster [14:03:39] (03PS1) 10Giuseppe Lavagetto: mediawiki: properly quote readMode in rsyslog template [deployment-charts] - 10https://gerrit.wikimedia.org/r/742963 [14:03:48] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: properly quote readMode in rsyslog template [deployment-charts] - 10https://gerrit.wikimedia.org/r/742963 (owner: 10Giuseppe Lavagetto) [14:06:38] (03PS2) 10Giuseppe Lavagetto: mediawiki: properly quote readMode in rsyslog template [deployment-charts] - 10https://gerrit.wikimedia.org/r/742963 [14:06:53] (03PS1) 10Elukey: install_server: rename reuse-kafka-main recipe [puppet] - 10https://gerrit.wikimedia.org/r/742964 (https://phabricator.wikimedia.org/T296641) [14:07:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1168 (T277354)', diff saved to https://phabricator.wikimedia.org/P17948 and previous config saved to /var/cache/conftool/dbconfig/20211201-140715-marostegui.json [14:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:39] (03CR) 10Elukey: install_server: add reuse recipe for kafka-main hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742926 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:07:53] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:08:25] (03CR) 10Jelto: [C: 03+2] site: use gitlab_runner role on gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/740691 (https://phabricator.wikimedia.org/T295481) (owner: 10Dzahn) [14:11:55] (03PS1) 10Jbond: puppetboard: drop definitions for puppetboard [12]001 [puppet] - 10https://gerrit.wikimedia.org/r/742965 (https://phabricator.wikimedia.org/T296744) [14:12:14] (03PS2) 10Jbond: puppetboard: drop definitions for puppetboard [12]001 [puppet] - 10https://gerrit.wikimedia.org/r/742965 (https://phabricator.wikimedia.org/T296744) [14:12:20] (03CR) 10Jbond: [C: 03+2] puppetboard: drop definitions for puppetboard [12]001 [puppet] - 10https://gerrit.wikimedia.org/r/742965 (https://phabricator.wikimedia.org/T296744) (owner: 10Jbond) [14:13:00] !log started commonswiki codfw media backup at 8 threads of parallelism [14:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:42] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetboard2001.codfw.wmnet [14:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: properly quote readMode in rsyslog template [deployment-charts] - 10https://gerrit.wikimedia.org/r/742963 (owner: 10Giuseppe Lavagetto) [14:17:39] (03Merged) 10jenkins-bot: mediawiki: properly quote readMode in rsyslog template [deployment-charts] - 10https://gerrit.wikimedia.org/r/742963 (owner: 10Giuseppe Lavagetto) [14:18:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Jclark-ctr) [14:18:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Jclark-ctr) added to netbox [14:20:13] (03CR) 10Muehlenhoff: [C: 03+1] "Makes sense!" [puppet] - 10https://gerrit.wikimedia.org/r/742964 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:22:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1168 (T277354)', diff saved to https://phabricator.wikimedia.org/P17949 and previous config saved to /var/cache/conftool/dbconfig/20211201-142219-marostegui.json [14:22:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1180.eqiad.wmnet with reason: Maintenance T277354 [14:22:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1180.eqiad.wmnet with reason: Maintenance T277354 [14:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T277354)', diff saved to https://phabricator.wikimedia.org/P17950 and previous config saved to /var/cache/conftool/dbconfig/20211201-142227-marostegui.json [14:22:56] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10Jclark-ctr) [14:23:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1180 (T277354)', diff saved to https://phabricator.wikimedia.org/P17951 and previous config saved to /var/cache/conftool/dbconfig/20211201-142339-marostegui.json [14:23:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10Jclark-ctr) added host to netbox [14:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/742964 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:27:26] (03PS1) 10Jelto: profile::gitlab_runner rename hiera file to gitlab_runner [puppet] - 10https://gerrit.wikimedia.org/r/742966 (https://phabricator.wikimedia.org/T295481) [14:28:19] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetboard2001.codfw.wmnet [14:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:25] 10SRE, 10Patch-For-Review: Migrate puppetboard to Bullseye - https://phabricator.wikimedia.org/T264276 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetboard2001.codfw.wmnet` - puppetboard2001.codfw.wmnet (**PASS**) - Downtimed host on Icinga - Found Gan... [14:29:29] !log jbond@cumin1001 START - Cookbook sre.hosts.decommission for hosts puppetboard1001.eqiad.wmnet [14:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:39] (03CR) 10Elukey: [C: 03+2] install_server: rename reuse-kafka-main recipe [puppet] - 10https://gerrit.wikimedia.org/r/742964 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:31:21] (03CR) 10Jelto: [C: 03+2] profile::gitlab_runner rename hiera file to gitlab_runner [puppet] - 10https://gerrit.wikimedia.org/r/742966 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:31:49] (03CR) 10Majavah: [C: 03+1] hewiki: add "templateeditor" permission group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742833 (https://phabricator.wikimedia.org/T296769) (owner: 104nn1l2) [14:32:34] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Jclark-ctr) [14:32:44] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q2:(Need By: TBD) rack/setup/install ms-fe1009-1012 - https://phabricator.wikimedia.org/T294137 (10Jclark-ctr) host added to netbox [14:38:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetboard1001.eqiad.wmnet [14:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:07] 10SRE, 10Patch-For-Review: Migrate puppetboard to Bullseye - https://phabricator.wikimedia.org/T264276 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jbond@cumin1001 for hosts: `puppetboard1001.eqiad.wmnet` - puppetboard1001.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found Gan... [14:38:11] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Decommission puppetboard[12]001 - https://phabricator.wikimedia.org/T296744 (10jbond) [14:38:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1180 (T277354)', diff saved to https://phabricator.wikimedia.org/P17953 and previous config saved to /var/cache/conftool/dbconfig/20211201-143843-marostegui.json [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:19] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [14:40:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Jclark-ctr) [14:40:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Jclark-ctr) Servers added to netbox [14:42:26] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2010.codfw.wmnet with OS buster [14:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:30] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster executed with errors: - ganeti2010 (**FAIL**) - Downtimed... [14:44:58] (03PS1) 10Elukey: install_server: set test reuse recipe for kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/742969 (https://phabricator.wikimedia.org/T296641) [14:45:09] (03CR) 10Majavah: P::doc: sync data to non-active servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [14:45:36] (03PS1) 10Jbond: puppetboard: clean up old files [puppet] - 10https://gerrit.wikimedia.org/r/742970 (https://phabricator.wikimedia.org/T296744) [14:46:17] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetboard: clean up old files [puppet] - 10https://gerrit.wikimedia.org/r/742970 (https://phabricator.wikimedia.org/T296744) (owner: 10Jbond) [14:46:24] (03PS2) 10Jbond: puppetboard: clean up old files [puppet] - 10https://gerrit.wikimedia.org/r/742970 (https://phabricator.wikimedia.org/T296744) [14:46:26] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppetboard: clean up old files [puppet] - 10https://gerrit.wikimedia.org/r/742970 (https://phabricator.wikimedia.org/T296744) (owner: 10Jbond) [14:46:51] (03PS2) 10Elukey: install_server: set test reuse recipe for kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/742969 (https://phabricator.wikimedia.org/T296641) [14:47:45] (03CR) 10Elukey: "This change merged will not mean a straight reimage, we can figure out when/how to do it etc.." [puppet] - 10https://gerrit.wikimedia.org/r/742969 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [14:51:22] (03CR) 10Btullis: [C: 03+2] Remove three more HDFS checks from Icinga [puppet] - 10https://gerrit.wikimedia.org/r/734662 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [14:53:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1180 (T277354)', diff saved to https://phabricator.wikimedia.org/P17954 and previous config saved to /var/cache/conftool/dbconfig/20211201-145348-marostegui.json [14:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:24] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [15:05:05] (03CR) 10Kormat: [C: 03+1] install_server: set test reuse recipe for kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/742969 (https://phabricator.wikimedia.org/T296641) (owner: 10Elukey) [15:07:33] PROBLEM - Disk space on gitlab-runner1001 is CRITICAL: DISK CRITICAL - /run/docker/netns/b6069efe92fa is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=gitlab-runner1001&var-datasource=eqiad+prometheus/ops [15:08:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'After maintenance db1180 (T277354)', diff saved to https://phabricator.wikimedia.org/P17955 and previous config saved to /var/cache/conftool/dbconfig/20211201-150853-marostegui.json [15:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:29] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [15:10:19] (03PS3) 10Ottomata: Airflow 2.2.2 with extra dependencies [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/742813 (https://phabricator.wikimedia.org/T295380) [15:14:01] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:15:08] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:47] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2010.codfw.wmnet with OS buster [15:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:52] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster [15:27:15] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2010.codfw.wmnet with OS buster [15:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:18] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster executed with errors: - ganeti2010 (**FAIL**) - Removed f... [15:27:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2010.codfw.wmnet with OS buster [15:27:35] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster [15:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:58] (03CR) 10Filippo Giunchedi: [C: 03+1] "Chatted about this at the o11y team meeting, seems to be the right thing" [puppet] - 10https://gerrit.wikimedia.org/r/742923 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup) [15:40:59] 10SRE, 10ops-codfw: Installation issues on ganeti2010 with buster / firmware update - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [15:42:18] (03CR) 10Muehlenhoff: [C: 03+2] Add migration tracking date for aqs [puppet] - 10https://gerrit.wikimedia.org/r/742956 (owner: 10Muehlenhoff) [15:42:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2010.codfw.wmnet with OS buster [15:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:34] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2010.codfw.wmnet with OS buster executed with errors: - ganeti2010 (**FAIL**) - Removed f... [15:43:46] (03PS2) 10Ladsgroup: logstash: Add maxSeconds and actualSeconds as numeric fields [puppet] - 10https://gerrit.wikimedia.org/r/742923 (https://phabricator.wikimedia.org/T295706) [15:43:49] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] logstash: Add maxSeconds and actualSeconds as numeric fields [puppet] - 10https://gerrit.wikimedia.org/r/742923 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup) [15:44:32] (03PS1) 10Elukey: Test istio egress gateway endpoint for ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/742979 (https://phabricator.wikimedia.org/T294414) [15:45:39] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/742923 (https://phabricator.wikimedia.org/T295706) (owner: 10Ladsgroup) [15:50:52] (03CR) 10Elukey: [C: 03+2] Test istio egress gateway endpoint for ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/742979 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [15:52:53] 10SRE, 10ops-codfw: Installation issues on ganeti2010 with buster / firmware update - https://phabricator.wikimedia.org/T296856 (10Papaul) p:05Triage→03Medium [15:53:12] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [15:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:15] !log bounce logstash on eqiad/codfw to apply template changes [15:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:50] 10SRE, 10ops-codfw: Installation issues on ganeti2010 with buster / firmware update - https://phabricator.wikimedia.org/T296856 (10Papaul) ` BIOS Version 1.7.0 iDRAC Firmware Version 3.30.30.30 [16:07:30] (03PS1) 10Herron: striker: send dev logs to logstash pipeline via localhost [puppet] - 10https://gerrit.wikimedia.org/r/742983 [16:08:04] (03CR) 10jerkins-bot: [V: 04-1] striker: send dev logs to logstash pipeline via localhost [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [16:08:35] !log installing postgresql-9.6 security updates [16:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:05] (03PS2) 10Herron: striker: send dev logs to logstash pipeline via localhost [puppet] - 10https://gerrit.wikimedia.org/r/742983 [16:16:44] (03PS3) 10Herron: striker: send dev logs to logstash pipeline via localhost [puppet] - 10https://gerrit.wikimedia.org/r/742983 [16:21:56] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [16:23:07] (03PS1) 10Jelto: profile::gitlab_runner unregister runner gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/742986 (https://phabricator.wikimedia.org/T295481) [16:26:40] (03CR) 10Jelto: [C: 03+2] profile::gitlab_runner unregister runner gitlab-runner1001 [puppet] - 10https://gerrit.wikimedia.org/r/742986 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:30:31] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Daimona) @Dzahn Thank you for the info; [[https://wikitech.wikimedia.org/wiki/Special:Contributions/ELeoni | here ]] is my work wikitech account. And apologies for the back and forth! [16:30:32] RECOVERY - Disk space on gitlab-runner1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=gitlab-runner1001&var-datasource=eqiad+prometheus/ops [16:33:31] PROBLEM - Check systemd state on gitlab-runner1001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-resource-monitor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:59] (03CR) 10Herron: "Hello, striker logs are currently being shipped to the legacy logstash cluster which is planned to be retired soon. This will move them o" [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [16:41:22] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10RobH) >>! In T294891#7539936, @ayounsi wrote: > Many sensors are now over threshold, see the red in: https://librenms.wikimedia.org/device/173/ So I thought this was already modified by y... [16:48:10] (03PS4) 10Cwhite: site: consolidate logstash node definitions [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) [16:58:42] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:10] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10RobH) I'm echoing my changes here for sanity checking: * Dual feed of 120V @ 20 AMPs (I'm uncertain if ulsfo is 20 or 30 amp so lets just use 20 as the limit for now.) ** set each feeds m... [17:01:54] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:04:58] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:34] (03CR) 10Herron: [C: 03+1] site: consolidate logstash node definitions [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [17:08:48] (03CR) 10Herron: [C: 03+1] hiera: add opensearch production configuration [puppet] - 10https://gerrit.wikimedia.org/r/742780 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [17:17:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but please collect +1 from Andrew Bogott and Bryan Davis." [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [17:19:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] striker: send dev logs to logstash pipeline via localhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [17:19:39] (03CR) 10BryanDavis: [C: 03+1] "None of the current logs are getting into ELK as far as I know (T151422), so any attempt to make that better is worthwhile." [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [17:24:34] (03PS4) 10Herron: striker: send dev logs to logstash pipeline via localhost [puppet] - 10https://gerrit.wikimedia.org/r/742983 [17:25:59] (03CR) 10Herron: striker: send dev logs to logstash pipeline via localhost (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [17:29:57] (03CR) 10Jdlrobson: [C: 03+1] Add mediawiki.web_ui_scroll stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742834 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [17:40:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) We had a meeting today, rough summary: The idea is rou... [17:41:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] striker: send dev logs to logstash pipeline via localhost [puppet] - 10https://gerrit.wikimedia.org/r/742983 (owner: 10Herron) [17:43:35] (03PS3) 10Cathal Mooney: Added option to disable Capirca ACL generation completely. [software/homer] - 10https://gerrit.wikimedia.org/r/742942 [17:50:35] (03PS1) 10Majavah: replace deployment-prep mx host [puppet] - 10https://gerrit.wikimedia.org/r/742988 [17:53:35] !log depool cp1089 to be reimaged as cache::text_haproxy - T290005 [17:53:35] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [17:53:49] (03PS1) 10Jelto: admin_ng: remove tiller [deployment-charts] - 10https://gerrit.wikimedia.org/r/742989 (https://phabricator.wikimedia.org/T251305) [17:53:54] vgutierrez: you had a space before the !log and it didn't count [17:54:07] majavah: thanks :) [17:54:12] !log depool cp1089 to be reimaged as cache::text_haproxy - T290005 [17:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:50] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [17:55:43] (03PS2) 10Vgutierrez: site: Reimage cp1089 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/742958 (https://phabricator.wikimedia.org/T290005) [17:56:38] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp1089 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/742958 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [17:58:08] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp1089.eqiad.wmnet with OS buster [17:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:14] 10SRE, 10Infrastructure-Foundations, 10netops: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) @volans thanks for the info. Sounds like we have a way forward if we want to do this. And certainly if we expand our use of bridges, sub-int... [17:58:24] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1089.eqiad.wmnet with OS buster [18:20:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cache_haproxy_tls site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:23:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:23:25] (03PS1) 10Dwisehaupt: Move the fundraisingdb-read handle to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/742991 [18:26:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cache_haproxy_tls site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:27:29] (03CR) 10Nray: [C: 03+1] Add mediawiki.web_ui_scroll stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742834 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [18:27:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:30:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cache_haproxy_tls site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:33:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:34:41] (03CR) 10BBlack: [C: 03+2] Move the fundraisingdb-read handle to frdb1004 [dns] - 10https://gerrit.wikimedia.org/r/742991 (owner: 10Dwisehaupt) [18:38:18] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:54] !log pool cp1089 using HAProxy as TLS terminator - T290005 [18:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:58] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [18:41:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1089.eqiad.wmnet with OS buster [18:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:15] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1089.eqiad.wmnet with OS buster c... [18:41:21] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [18:41:37] 10SRE, 10ops-codfw: Installation issues on ganeti2010 with buster / firmware update - https://phabricator.wikimedia.org/T296856 (10Papaul) 05Open→03Resolved this is complete ` BIOS Version 2.12.2 iDRAC Firmware Version 5.00.10.20 [18:41:58] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [18:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:26] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:39] 10SRE, 10ops-codfw: Installation issues on ganeti2010 with buster / firmware update - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) Thanks for the quick turnaround, much appreciated! I'll kick off another reimage attempt tomorrow. [19:00:05] RoanKattouw and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211201T1900). [19:00:05] cjming: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:11] * urbanecm waves [19:00:14] hey! [19:00:17] o/ [19:00:25] I can deploy if urbanecm is around :P [19:00:34] I am in a deployable condition [19:00:37] note cjming is also a deployer [19:00:57] happy to defer to whoever [19:00:57] ah. do you want to self-service then? [19:01:21] sure - since i'm the only one in the queue [19:01:33] (03PS2) 10Clare Ming: Add mediawiki.web_ui_scroll stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742834 (https://phabricator.wikimedia.org/T292586) [19:01:44] also majavah is a very new deployer. If they wish to try, I migt make sense to let them to [19:01:48] anyway, letting it up to you two :)) [19:02:02] oh! np -- please go ahed majavah: [19:02:07] *ahead [19:02:07] sure! thanks [19:02:28] i just rebased - so i'll let you take it from there [19:03:04] (03CR) 10Majavah: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742834 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [19:03:06] (03PS4) 10Ottomata: Airflow 2.1.4 with extra dependencies [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/742813 (https://phabricator.wikimedia.org/T295380) [19:03:47] (03CR) 10Ottomata: "I tried to do airflow 2.2.2 but had issues with certificates and libffi .so. Probably need to upgrade python first." [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/742813 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [19:04:19] (03Merged) 10jenkins-bot: Add mediawiki.web_ui_scroll stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742834 (https://phabricator.wikimedia.org/T292586) (owner: 10Clare Ming) [19:05:02] cjming: your patch is on mwdebug1001, can you test please? [19:05:17] sure thing [19:05:52] 10SRE, 10Traffic: HAProxy fails to reuse connections under some conditions - https://phabricator.wikimedia.org/T296874 (10Vgutierrez) [19:07:37] urbanecm: is there any reason your fix trailing logs script won't work on 1.36 [19:07:52] i don't think it should break any wiki [19:08:01] but you might want to check if there are any entries with spaces in the first place [19:08:09] if not, no reason to bother running it [19:08:47] urbanecm: I have 4000 wikis [19:08:50] Not checking manually [19:08:56] make a for loop ;) [19:09:06] I doubt it because iirc it was only via api [19:09:53] majavah: i think we're gtg - it's a bit hard to test [19:10:19] sure, I'll sync it in that case [19:11:33] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:742834|Add mediawiki.web_ui_scroll stream (T292586)]] (duration: 00m 57s) [19:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:38] T292586: Sticky Header: Create schema to track returning to the top of the page - https://phabricator.wikimedia.org/T292586 [19:11:58] your patch is live! [19:12:48] majavah: thank you! [19:13:22] !log UTC evening deploys done [19:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:54] !log otto@deploy1002 Started deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) [19:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:58] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) (duration: 00m 03s) [19:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:08] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:14] !log otto@deploy1002 Started deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) [19:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:17] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) (duration: 00m 03s) [19:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:52] !log otto@deploy1002 Started deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) [19:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:55] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) (duration: 00m 03s) [19:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:28] !log otto@deploy1002 Started deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) [19:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:26] PROBLEM - MariaDB Replica Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1574.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:27:04] grrrrrrrrr [19:27:55] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) (duration: 02m 26s) [19:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:11] !log otto@deploy1002 Started deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) [19:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:33] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@bea2abe] (hadoop-test): (no justification provided) (duration: 00m 22s) [19:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:34] (03CR) 10Ppchelko: api-gateway: allow discovery services to set custom rate limits (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [19:38:50] PROBLEM - Check systemd state on ms-fe2010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:08] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:39:22] PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:41:03] (03PS1) 10Jsn.sherman: Enable TheWikipediaLibrary on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) [19:44:18] (03CR) 10Jsn.sherman: [C: 04-1] "-1 while we wait for 1.38.0-wmf.12; 2021-12-06" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/742996 (https://phabricator.wikimedia.org/T288070) (owner: 10Jsn.sherman) [19:46:55] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [19:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:58] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 03s) [19:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:47] (03Abandoned) 10Ppchelko: PageUpdater: apply tags even if RC suppressed. [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/737085 (https://phabricator.wikimedia.org/T291967) (owner: 10Ppchelko) [20:02:42] (03CR) 10Herron: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/742988 (owner: 10Majavah) [20:09:40] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [20:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:44] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:52] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul) [20:12:56] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:44] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:30:06] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:44:24] RECOVERY - Check systemd state on ms-fe2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:52] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) No problem @Daimona ! @herron could you take this and treat it like any other "add me to wmf group" request for the new account above? Thanks! [20:48:11] !log razzi@deploy1002 Started deploy [analytics/refinery@3b1b794]: Regular analytics weekly train [analytics/refinery@3b1b794] [20:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:56] PROBLEM - Check systemd state on ms-fe2010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:53:34] RECOVERY - MariaDB Replica Lag: s1 on db1139 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:00:05] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211201T2100). Please do the needful. [21:00:29] lots of failure to import monotonic on swift proxy [21:00:53] that doesn't look good. the package is missing, I am going to try to install it in once and see if that helps [21:02:56] !log installing python-monotonic on ms-fe2010 [21:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:36] RECOVERY - Check systemd state on ms-fe2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:00] my worry is that by fixing it, I may be breaking something else :-( [21:05:02] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:13] !log installing python-monotonic on ms-fe2011, ms-fe2012 (breaks swift-proxy) [21:06:13] (03CR) 10Cwhite: [C: 03+2] site: consolidate logstash node definitions [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [21:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:21] (03PS5) 10Cwhite: site: consolidate logstash node definitions [puppet] - 10https://gerrit.wikimedia.org/r/742778 (https://phabricator.wikimedia.org/T288621) [21:07:26] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation={list,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:07:56] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:09:30] !log razzi@deploy1002 Finished deploy [analytics/refinery@3b1b794]: Regular analytics weekly train [analytics/refinery@3b1b794] (duration: 21m 18s) [21:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:00] RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:00] RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:04] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:10:04] PROBLEM - Check systemd state on archiva1002 is CRITICAL: CRITICAL - degraded: The following units failed: archiva.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:49] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257]: (no justification provided) [21:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:58] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [21:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:14] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 16s) [21:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:28] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:11:32] (03CR) 10Cwhite: [C: 03+2] hiera: add opensearch production configuration [puppet] - 10https://gerrit.wikimedia.org/r/742780 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [21:11:42] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [21:12:09] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [21:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:12] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 03s) [21:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:34] 10SRE-swift-storage: ms-fe2010, ms-fe2011, ms-fe2012 had its swift-proxy.service failed - https://phabricator.wikimedia.org/T296883 (10jcrespo) [21:21:05] 10SRE-swift-storage: ms-fe2010, ms-fe2011, ms-fe2012 had its swift-proxy.service failed - https://phabricator.wikimedia.org/T296883 (10jcrespo) [21:21:10] 10SRE-swift-storage: ms-fe2010, ms-fe2011, ms-fe2012 had its swift-proxy.service failed - https://phabricator.wikimedia.org/T296883 (10LSobanski) [21:21:12] 10SRE-swift-storage: swift-proxy not starting on ms-fe2009 due to missing python-monotonic - https://phabricator.wikimedia.org/T296289 (10LSobanski) [21:22:43] 10SRE-swift-storage: swift-proxy not starting on ms-fe2009 due to missing python-monotonic - https://phabricator.wikimedia.org/T296289 (10jcrespo) Ah, so I guess not in production, that leaves me less worried. Sorry, I searched for ms-fe and monotonic but couldn't find this ticket. [21:31:46] RECOVERY - Check systemd state on archiva1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:35:46] (03PS1) 10Cwhite: hiera: correct opensearch hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/743027 [21:37:47] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10wiki_willy) Hi @RobH - on the Digital Realty invoices, it looks like these are 30amp, 208v circuits at ulsfo: MRC 30A 208V AC Primary, 1030223PDU1UPSAPNLA_23 MRC 30A 208V AC Redundant, 10... [21:49:03] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10herron) >>! In T295993#7542087, @Dzahn wrote: > No problem @Daimona ! @herron could you take this and treat it like any other "add me to wmf group" request for the new account above? Thanks! This is... [21:49:35] (03CR) 10Cwhite: [C: 03+2] "PCC NOOP https://puppet-compiler.wmflabs.org/compiler1002/32767/" [puppet] - 10https://gerrit.wikimedia.org/r/743027 (owner: 10Cwhite) [22:04:24] (03PS1) 10Cwhite: hiera: correct more opensearch hiera definitions [puppet] - 10https://gerrit.wikimedia.org/r/743032 [22:08:32] (03CR) 10Cwhite: [C: 03+2] "PCC NOOP https://puppet-compiler.wmflabs.org/compiler1001/32771/" [puppet] - 10https://gerrit.wikimedia.org/r/743032 (owner: 10Cwhite) [22:09:22] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [22:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:26] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 03s) [22:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:09] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [22:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:12] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 03s) [22:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:54] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [22:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:57] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 03s) [22:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:02] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [22:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:26] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 01m 23s) [22:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:34] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [22:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:42] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 07s) [22:12:44] (03PS2) 10Ahmon Dancy: wmf-beta-update-databases.py: Print error in a better way [puppet] - 10https://gerrit.wikimedia.org/r/742519 [22:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:37] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [22:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:44] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 07s) [22:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:02] !log otto@deploy1002 Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) [22:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:10] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 07s) [22:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:54] (03PS4) 10Ahmon Dancy: mediawiki 0.0.40: Add additional php settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/742773 [22:34:13] jouncebot nowandnext [22:34:14] No deployments scheduled for the next 1 hour(s) and 25 minute(s) [22:34:14] In 1 hour(s) and 25 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211202T0000) [22:39:00] (03PS1) 10Ahmon Dancy: Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 [22:40:10] (03CR) 10jerkins-bot: [V: 04-1] Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 (owner: 10Ahmon Dancy) [22:40:47] (03CR) 10Thcipriani: [C: 03+1] wmf-beta-update-databases.py: Print error in a better way [puppet] - 10https://gerrit.wikimedia.org/r/742519 (owner: 10Ahmon Dancy) [22:41:43] (03PS2) 10Ahmon Dancy: Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 [22:46:12] (03PS3) 10Ahmon Dancy: Choose wikiversions.php file relative to MWMultiVersion.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743038 [22:55:51] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul) @ayounsi runing homer on asw-b-codfw gives the error below ` raise HomerError... [23:05:41] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Daimona) >>! In T295993#7542218, @herron wrote: >>>! In T295993#7542087, @Dzahn wrote: >> No problem @Daimona ! @herron could you take this and treat it like any other "add me to wmf group" request fo... [23:06:45] (03PS1) 10Jcrespo: exim: Reenable regular flash sale offers on wmf address [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T132324) [23:07:03] 10SRE, 10ops-ulsfo: ps1-22-ulsfo Cord, Master_Cord_A, Active Power alerting - https://phabricator.wikimedia.org/T294891 (10RobH) Ok, so it turns out its 3 phase 208V, which is odd as it really comes out as 3 120 volt circuits within it. so 208V 3Phase 30A is 3*120*30*0.8 = 8640 WATTS per pdu without halving,... [23:09:49] (03PS2) 10Jcrespo: exim: Reenable regular flash sale offers on wmf address [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T132324) [23:19:18] (03PS3) 10Jcrespo: exim: Reenable regular flash sale offers on wmf address [puppet] - 10https://gerrit.wikimedia.org/r/743040 (https://phabricator.wikimedia.org/T132324) [23:41:38] (03PS1) 10Cwhite: hiera: use site-local ldap for opensearch in codfw [puppet] - 10https://gerrit.wikimedia.org/r/743045 [23:41:40] (03PS1) 10Cwhite: hiera: enable opensearch compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/743046 [23:41:42] (03PS1) 10Cwhite: opensearch: use systemd timer for gc log rotation [puppet] - 10https://gerrit.wikimedia.org/r/743047 [23:41:44] (03PS1) 10Cwhite: profile: allow logstash checker to query opensearch [puppet] - 10https://gerrit.wikimedia.org/r/743048 [23:41:46] (03PS1) 10Cwhite: site: reprovision codfw logging cluster to opensearch [puppet] - 10https://gerrit.wikimedia.org/r/743049 (https://phabricator.wikimedia.org/T288621) [23:48:22] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /{domain}/v1/transform [23:48:22] /mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [23:48:34] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:48:38] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:48:58] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [23:49:36] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:50:30] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [23:50:42] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:50:42] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:50:50] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:51:08] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:51:18] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:51:40] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:52:54] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase