[00:00:10] RECOVERY - Apache HTTP on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.363 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:05:44] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:44] ACKNOWLEDGEMENT - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service daniel_zahn drive-audit[18014]: Errors found but device unavailable: sdi:6 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:02] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:23:49] 10SRE-swift-storage: swift - ms-be2035 - device sdi:6 unavailable - https://phabricator.wikimedia.org/T291896 (10Dzahn) [00:24:45] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T291896" [puppet] - 10https://gerrit.wikimedia.org/r/715597 (https://phabricator.wikimedia.org/T288806) (owner: 10Zabe) [01:02:16] RECOVERY - Check systemd state on ms-be2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:20:36] (03PS2) 10Legoktm: Add toolhub to LVS [puppet] - 10https://gerrit.wikimedia.org/r/711702 (https://phabricator.wikimedia.org/T280881) [01:20:38] (03PS2) 10Legoktm: service: Switch toolhub to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/711703 (https://phabricator.wikimedia.org/T280881) [01:20:40] (03PS2) 10Legoktm: service: Switch toolhub to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/711704 (https://phabricator.wikimedia.org/T280881) [01:20:42] (03PS2) 10Legoktm: service: Switch toolhub to production [puppet] - 10https://gerrit.wikimedia.org/r/711705 (https://phabricator.wikimedia.org/T280881) [01:20:44] (03PS3) 10Legoktm: Add toolhub to cache backends [puppet] - 10https://gerrit.wikimedia.org/r/711648 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [01:20:46] (03CR) 10Legoktm: Add toolhub to LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/711702 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [01:21:30] (03PS3) 10Legoktm: Add toolhub to LVS [puppet] - 10https://gerrit.wikimedia.org/r/711702 (https://phabricator.wikimedia.org/T280881) [01:21:32] (03PS3) 10Legoktm: service: Switch toolhub to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/711703 (https://phabricator.wikimedia.org/T280881) [01:21:34] (03PS3) 10Legoktm: service: Switch toolhub to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/711704 (https://phabricator.wikimedia.org/T280881) [01:21:36] (03PS3) 10Legoktm: service: Switch toolhub to production [puppet] - 10https://gerrit.wikimedia.org/r/711705 (https://phabricator.wikimedia.org/T280881) [01:21:38] (03PS4) 10Legoktm: Add toolhub to cache backends [puppet] - 10https://gerrit.wikimedia.org/r/711648 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [01:24:28] (03CR) 10Legoktm: "PS3: Removed ProxyFetch, which isn't needed for k8s applications." [puppet] - 10https://gerrit.wikimedia.org/r/711702 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [01:29:02] PROBLEM - PHP7 rendering on wtp1026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1940 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:31:06] RECOVERY - PHP7 rendering on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:00:05] Deploy window Branching MediaWiki, extensions, skins, and vendor – See Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210928T0200) [02:06:14] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.2 [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724227 [02:07:21] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.2 [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724227 (owner: 10TrainBranchBot) [02:12:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:04] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:27:32] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.2 [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724227 (owner: 10TrainBranchBot) [02:39:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:56:40] (03PS1) 10Ladsgroup: Enable dispatching via jobs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724230 (https://phabricator.wikimedia.org/T48643) [03:57:45] (03CR) 10jerkins-bot: [V: 04-1] Enable dispatching via jobs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724230 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [03:59:48] (03PS2) 10Ladsgroup: Enable dispatching via jobs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724230 (https://phabricator.wikimedia.org/T48643) [04:18:42] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:19:06] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:24:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:24:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:25:23] uh? [04:25:32] It's cr3-eqsin.wikimedia.org [04:25:38] hi [04:27:47] https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_80% ftr, I guess we don't have that set up in alertmanager yet [04:29:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:29:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [04:31:53] haven't been able to find traces of a big upload or anything yet [04:34:29] I think it was /wikipedia/commons/e/eb/Pchum_Ben_Khmer.png [04:35:29] yeah, pretty sure it was that [04:35:50] neat -- I'm interested in how you got there, but maybe for work hours sometime :) [04:36:31] anything still to be done, do you think? [04:37:22] don't think so, thankfully it was a pretty small picture (1.45MB) [04:37:52] ?! [04:38:14] *fascinated* by how an upload the size of a floppy disk set off the pager :) [04:38:25] https://grafana.wikimedia.org/d/000000093/varnish-traffic?viewPanel=27&orgId=1&refresh=1m [04:39:00] sampled-1000 has 1126 hits, so that's a lot of requests [04:40:04] ohh okay I'm with you -- for some reason I was picturing saturation because someone was *uploading* something large, but I guess that would have to be *from* eqsin in order to fire the outbound alert [04:40:25] I got led down the garden path of like a huge video transfer or something [04:40:37] rather than lots of downloads for a file -- that makes more sense [04:40:56] ah right, yeah, the photo is from 2019 [04:42:26] for reference, I was tailing sampled-1000, feeding it into my webreq-filter tool, noticed multiple IPs at the top with exactly the same "1.45MB" size plus all the top UAs were various Android devices, which is the same pattern for the other hotlinking incidents, so I grepped for the UA directly and saw enough requests for that image and the size and timing matched up [04:42:48] ahh nice [04:44:24] PROBLEM - PHP7 rendering on wtp1026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1940 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:45:04] wtp1026 has been flapping earlier [04:45:12] * legoktm heads afk o/ [04:45:52] have a good night, thanks! [04:46:28] RECOVERY - PHP7 rendering on wtp1026 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:05:03] (03PS1) 10Marostegui: install_server: Reimage db2080 [puppet] - 10https://gerrit.wikimedia.org/r/724237 (https://phabricator.wikimedia.org/T290868) [05:06:06] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db2080 [puppet] - 10https://gerrit.wikimedia.org/r/724237 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [05:10:03] !log Remove flaggedimages from s6 T290340 [05:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:09] T290340: Drop the flaggedimages table from Wikimedia production - https://phabricator.wikimedia.org/T290340 [05:48:22] RECOVERY - dump of s6 in codfw on alert1001 is OK: Last dump for s6 at codfw (db2141.codfw.wmnet:3316) taken on 2021-09-28 04:23:35 (102 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:10:35] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP-wmf for erayfield - https://phabricator.wikimedia.org/T291126 (10Joe) Thank you @mepps for your approval. @ERayfield can you please confirm your wikitech username and the email you used to register it? I can't seem to find any account with `mail=erayfield@w... [06:19:59] (03PS1) 10Marostegui: db2080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/724321 (https://phabricator.wikimedia.org/T290868) [06:22:23] (03CR) 10Marostegui: [C: 03+2] db2080: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/724321 (https://phabricator.wikimedia.org/T290868) (owner: 10Marostegui) [06:30:35] (03CR) 10Elukey: "left a note about require vs ensure_package, the rest LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/724105 (owner: 10Hnowlan) [06:41:56] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: manage the DHCP records [cookbooks] - 10https://gerrit.wikimedia.org/r/723995 (owner: 10Volans) [06:42:37] !log installed spicerack 1.0.2 on cumin2002 [06:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:28] (03Merged) 10jenkins-bot: sre.experimental.reimage: manage the DHCP records [cookbooks] - 10https://gerrit.wikimedia.org/r/723995 (owner: 10Volans) [06:46:22] (03PS5) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [06:48:05] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31322/console" [puppet] - 10https://gerrit.wikimedia.org/r/724015 (owner: 10Giuseppe Lavagetto) [06:49:16] PROBLEM - dump of m5 in eqiad on alert1001 is CRITICAL: Last dump for m5 at eqiad (db1117.eqiad.wmnet:3325) taken on 2021-09-28 06:24:33 is 20 GB, but previous one was 37 GB, a change of 46.4% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:50:43] I guess that's wikitech moved to s6 [06:52:18] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1002.eqiad.wmnet [06:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:23] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by volans@cumin2002 for host sretest1002.eqiad.wmnet [06:52:28] !log volans@cumin2002 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host sretest1002.eqiad.wmnet [06:52:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:33] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage executed with errors: - sretest1002 (**FAIL**) - Downtimed on Icinga - Disabled Puppet - Remov... [06:54:54] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [06:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:00] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by volans@cumin2002 for host sretest1001.eqiad.wmnet [06:59:05] (03PS1) 10Volans: dhcp: fix typo in opt82 file path [software/spicerack] - 10https://gerrit.wikimedia.org/r/724337 (https://phabricator.wikimedia.org/T221388) [07:05:05] (03CR) 10Volans: [C: 03+2] "trivial typo, self-merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/724337 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [07:06:46] (03PS6) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [07:10:30] (03Merged) 10jenkins-bot: dhcp: fix typo in opt82 file path [software/spicerack] - 10https://gerrit.wikimedia.org/r/724337 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [07:11:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:34] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:13:12] (03CR) 10DCausse: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [07:14:48] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1002.eqiad.wmnet [07:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:54] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by volans@cumin2002 for host sretest1002.eqiad.wmnet [07:17:00] (03PS1) 10Muehlenhoff: Prefer mx2001 for mail in ulsfo/eqsin [puppet] - 10https://gerrit.wikimedia.org/r/724338 [07:18:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724338 (owner: 10Muehlenhoff) [07:21:21] !log volans@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1001.eqiad.wmnet [07:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:26] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - sretest1001 (**PASS**) - Downtimed on Icinga - Disabled Puppet - Removed from Pup... [07:23:33] (03CR) 10Muehlenhoff: "https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png" [puppet] - 10https://gerrit.wikimedia.org/r/724338 (owner: 10Muehlenhoff) [07:24:20] (03CR) 10Elukey: [C: 03+1] "I like the new deployment_server.pp a lot, very easy to follow!" [puppet] - 10https://gerrit.wikimedia.org/r/723419 (owner: 10Giuseppe Lavagetto) [07:28:01] (03PS7) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [07:38:43] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [07:38:55] !log volans@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1002.eqiad.wmnet [07:39:00] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - sretest1002 (**PASS**) - Downtimed on Icinga - Disabled Puppet - Removed from Pup... [07:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:02] (03PS8) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [07:42:26] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31326/console" [puppet] - 10https://gerrit.wikimedia.org/r/724015 (owner: 10Giuseppe Lavagetto) [07:47:41] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.0.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/724341 [07:48:08] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31327/console" [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [07:53:35] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.0.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/724341 (owner: 10Volans) [07:58:01] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:54] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.0.3 [software/spicerack] - 10https://gerrit.wikimedia.org/r/724341 (owner: 10Volans) [08:00:33] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:07] (03PS1) 10Volans: Upstream release v1.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/724344 [08:01:32] (03CR) 10Volans: [C: 03+2] Upstream release v1.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/724344 (owner: 10Volans) [08:01:46] (03CR) 10Legoktm: [C: 03+1] "LGTM, let me know if you want me to +2" [puppet] - 10https://gerrit.wikimedia.org/r/723760 (https://phabricator.wikimedia.org/T287900) (owner: 10Majavah) [08:01:51] (03CR) 10Muehlenhoff: microsites: Switch to wmflib::dir::mkdir_p (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724053 (owner: 10Muehlenhoff) [08:02:11] (03PS2) 10Muehlenhoff: microsites: Switch to wmflib::dir::mkdir_p [puppet] - 10https://gerrit.wikimedia.org/r/724053 [08:04:26] (03PS9) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [08:05:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31328/console" [puppet] - 10https://gerrit.wikimedia.org/r/724015 (owner: 10Giuseppe Lavagetto) [08:06:52] (03Merged) 10jenkins-bot: Upstream release v1.0.3 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/724344 (owner: 10Volans) [08:07:26] PROBLEM - dump of m5 in codfw on alert1001 is CRITICAL: Last dump for m5 at codfw (db2078.codfw.wmnet:3325) taken on 2021-09-28 07:31:00 is 20 GB, but previous one was 37 GB, a change of 46.4% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:08:40] jynus: I guess it's related to wikitech moved to s6, but is there anything we should do about it? ^^^ [08:09:31] it is due to the deletion of the tables in m5 [08:09:35] I did that yesterday [08:10:08] I assumed so yes, but is there a way to tell it that it's ok and move on? [08:12:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:14:00] (03CR) 10David Caro: "pcc runs:" [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [08:14:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:17:58] !log uploaded spicerack_1.0.3 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [08:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:57] ACKNOWLEDGEMENT - dump of m5 in codfw on alert1001 is CRITICAL: Last dump for m5 at codfw (db2078.codfw.wmnet:3325) taken on 2021-09-28 07:31:00 is 20 GB, but previous one was 37 GB, a change of 46.4% Jcrespo Deletion of labswiki from m5 - The acknowledgement expires at: 2021-10-05 08:18:22. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:18:57] ACKNOWLEDGEMENT - dump of m5 in eqiad on alert1001 is CRITICAL: Last dump for m5 at eqiad (db1117.eqiad.wmnet:3325) taken on 2021-09-28 06:24:33 is 20 GB, but previous one was 37 GB, a change of 46.4% Jcrespo Deletion of labswiki from m5 - The acknowledgement expires at: 2021-10-05 08:18:22. https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:24:02] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:27:14] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:30:32] !log marostegui@cumin1001 START - Cookbook sre.experimental.reimage for host db2080.codfw.wmnet [08:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:40] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:34:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] Lower the default TLS proxy inbound idle timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/724174 (owner: 10Ppchelko) [08:34:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:34:39] (03CR) 10Alexandros Kosiaris: [C: 03+2] "+2, but I 'll leave the deploy to you." [deployment-charts] - 10https://gerrit.wikimedia.org/r/724173 (https://phabricator.wikimedia.org/T215001) (owner: 10Ppchelko) [08:35:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:39:29] (03Merged) 10jenkins-bot: Eventgate TLS proxy: lower local idle_timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/724173 (https://phabricator.wikimedia.org/T215001) (owner: 10Ppchelko) [08:39:31] (03Merged) 10jenkins-bot: Lower the default TLS proxy inbound idle timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/724174 (owner: 10Ppchelko) [08:40:22] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:41:32] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10MoritzMuehlenhoff) >>! In T289624#7382300, @Papaul wrote: > @MoritzMuehlenhoff I don't know is you saw my comment on Sep 10th... [08:45:01] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:50:24] (03CR) 10Majavah: P::toolforge: Use composer package on buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/723760 (https://phabricator.wikimedia.org/T287900) (owner: 10Majavah) [08:50:27] jouncebot: now [08:50:27] No deployments scheduled for the next 2 hour(s) and 9 minute(s) [08:51:53] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:57:16] !log upgrade scap on eqiad and codfw - T291095 [08:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:22] T291095: Deploy Scap version 4.0.0 - https://phabricator.wikimedia.org/T291095 [08:59:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:00:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host db2080.codfw.wmnet [09:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:42] volans: ^ \o/ [09:02:20] yay [09:03:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:11] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:17] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:14:49] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:14:55] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:15:33] (03CR) 10Muehlenhoff: Setup systemd timer to sync OS reports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [09:22:47] (03CR) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/723419 (owner: 10Giuseppe Lavagetto) [09:23:33] (03PS24) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [09:23:35] (03PS10) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [09:23:57] !log Deploy schema change on s2 codfw (lag will show up) T283499 [09:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:03] T283499: Schema change for renaming page_timestamp index on revision table to rev_page_timestamp - https://phabricator.wikimedia.org/T283499 [09:24:46] (03PS25) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [09:24:48] (03PS11) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [09:26:39] !log Deploy schema change on s4 codfw (lag will show up) T283499 [09:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:57] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31331/console" [puppet] - 10https://gerrit.wikimedia.org/r/724015 (owner: 10Giuseppe Lavagetto) [09:27:32] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-mirror-maker restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. - elukey@cumin1001 [09:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:55] (03PS26) 10Giuseppe Lavagetto: profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 [09:30:57] (03PS12) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [09:37:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0) restart MirrorMaker for Kafka A:kafka-mirror-maker-test-eqiad cluster: Roll restart of jvm daemons. - elukey@cumin1001 [09:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kubeadm: Disable PodPresets [puppet] - 10https://gerrit.wikimedia.org/r/724101 (https://phabricator.wikimedia.org/T279106) (owner: 10Majavah) [09:42:01] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1010.eqiad.wmnet [09:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:46:38] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1010.eqiad.wmnet [09:46:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:10] (03PS3) 10Hashar: ci: Apply profile::wmcs::lvm as needed for new integration instances [puppet] - 10https://gerrit.wikimedia.org/r/722476 (https://phabricator.wikimedia.org/T277078) (owner: 10Krinkle) [09:47:56] (03PS1) 10Elukey: role::ores: move celery and cache to rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/724349 [09:48:34] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet [09:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:43] (03CR) 10Hashar: [C: 03+1] "Following my edit, I have cherry picked the updated version of this change on the puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [09:48:44] <_joe_> !log removing old builds from compiler1002.puppet-diffs.eqiad1.wikimedia.cloud [09:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:58] (03CR) 10Elukey: [C: 03+2] role::ores: move celery and cache to rdb2008 [puppet] - 10https://gerrit.wikimedia.org/r/724349 (owner: 10Elukey) [09:50:10] (03CR) 10Michael Große: "Thank you 🙇‍♂️" [puppet] - 10https://gerrit.wikimedia.org/r/719502 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [09:50:53] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. - elukey@cumin1001 [09:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:13] (03PS9) 10Hashar: ci: Add 'bullseye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [09:52:23] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:53:46] (03CR) 10Hashar: [C: 03+1] ci: Add 'bullseye' to docker lsbdistcodename hack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [09:54:12] (03PS1) 10Arturo Borrero Gonzalez: openstack: manila: refresh config file [puppet] - 10https://gerrit.wikimedia.org/r/724350 (https://phabricator.wikimedia.org/T291257) [09:55:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:56:46] (03PS2) 10Arturo Borrero Gonzalez: openstack: manila: refresh config file [puppet] - 10https://gerrit.wikimedia.org/r/724350 (https://phabricator.wikimedia.org/T291257) [09:57:27] (03PS1) 10Filippo Giunchedi: prometheus: add instance-specific alerts path [puppet] - 10https://gerrit.wikimedia.org/r/724353 (https://phabricator.wikimedia.org/T289662) [09:57:31] (03PS1) 10Filippo Giunchedi: alerts: add multiple tags match [puppet] - 10https://gerrit.wikimedia.org/r/724354 (https://phabricator.wikimedia.org/T289662) [09:57:35] (03PS1) 10Filippo Giunchedi: prometheus: deploy instance-specific alerts [puppet] - 10https://gerrit.wikimedia.org/r/724355 (https://phabricator.wikimedia.org/T289662) [09:57:37] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet [09:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/31335/" [puppet] - 10https://gerrit.wikimedia.org/r/724350 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [09:58:33] RECOVERY - Disk space on aqs1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1007&var-datasource=eqiad+prometheus/ops [09:59:01] dcausse: FYI the reviews above allow an alert file to be deployed to specific prometheus instances, re: flink absent() mis-firing [09:59:16] that's T289662 [09:59:17] T289662: Add ability to select with site-local Prometheus instance to deploy alerts - https://phabricator.wikimedia.org/T289662 [09:59:34] godog: thanks!! [10:00:55] !log Deploy schema change on s7 codfw (lag will show up) T283499 [10:00:59] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [10:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:01] T283499: Schema change for renaming page_timestamp index on revision table to rev_page_timestamp - https://phabricator.wikimedia.org/T283499 [10:01:42] !log Deploy schema change on s5 codfw (lag will show up) T283499 [10:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:02:31] (03PS1) 10Arturo Borrero Gonzalez: openstack: manila: configuration template cleanups [puppet] - 10https://gerrit.wikimedia.org/r/724356 (https://phabricator.wikimedia.org/T291257) [10:04:03] (03PS1) 10Effie Mouzeli: hieradata: Temporary health endpoint for tegola-vector-tiles [puppet] - 10https://gerrit.wikimedia.org/r/724357 [10:04:13] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:23] 10SRE-swift-storage, 10ops-codfw: swift - ms-be2035 - device sdi:6 unavailable - https://phabricator.wikimedia.org/T291896 (10fgiunchedi) Thank you @Dzahn, I've failed the physical disk manually. @Papaul please replace this failed 4TB disk (host is OOW though) [10:08:56] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet [10:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:53] dcausse: sure no worries! feel free to comment/review if you have time, from the user's perspective you'll need to add "# deploy-tag: k8s*" at the top of the file for example to deploy to all k8s instances, or a comma-separated list of instances works too [10:10:01] once the reviews are merged that is [10:10:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. - elukey@cumin1001 [10:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:33] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 235, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:11:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::kuberentes_deployment_server: re-think user management [puppet] - 10https://gerrit.wikimedia.org/r/723419 (owner: 10Giuseppe Lavagetto) [10:11:39] godog: I'll take a look, I might have questions regarding the "resets" function for crashloop detection, my approach does not seem to work well but still not sure to understand why [10:13:35] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet [10:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:49] PROBLEM - Check systemd state on ms-be2060 is CRITICAL: CRITICAL - degraded: The following units failed: session-28272.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:59] 10SRE, 10Wikifeeds, 10serviceops: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) 05Open→03Resolved a:03akosiaris And all of the above is mostly irrelevant and I am mostly blind and chasing ghosts (on the plus side I got more acqua... [10:15:49] PROBLEM - Check systemd state on ms-be2035 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sdi1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: drop k8s 1.18 updates [puppet] - 10https://gerrit.wikimedia.org/r/719401 (owner: 10Majavah) [10:16:01] dcausse: ack, off the top of my head I think the alert will fire even on regular restarts, I haven't verified it though [10:16:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: drop k8s 1.18 repo [puppet] - 10https://gerrit.wikimedia.org/r/719402 (owner: 10Majavah) [10:16:45] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet [10:16:48] (03PS4) 10Hnowlan: cassandra: remove variable for enabling jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/724105 [10:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:25] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: Temporary health endpoint for tegola-vector-tiles [puppet] - 10https://gerrit.wikimedia.org/r/724357 (owner: 10Effie Mouzeli) [10:17:41] (03PS1) 10Volans: sre.experimental.reimage: add stretch to OS list [cookbooks] - 10https://gerrit.wikimedia.org/r/724359 [10:17:43] (03PS1) 10Volans: sre.experimental.reimage: remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/724360 [10:18:21] (03CR) 10jerkins-bot: [V: 04-1] cassandra: remove variable for enabling jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/724105 (owner: 10Hnowlan) [10:22:59] (03PS1) 10Majavah: aptrepo: Init thirdparty/kubeadm-k8s-1-20 [puppet] - 10https://gerrit.wikimedia.org/r/724365 (https://phabricator.wikimedia.org/T280402) [10:23:50] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet [10:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:37] RECOVERY - Check systemd state on ms-be2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/724359 (owner: 10Volans) [10:27:07] (03PS5) 10Hnowlan: cassandra: remove variable for enabling jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/724105 [10:27:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/724360 (owner: 10Volans) [10:29:01] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet [10:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:36] (03PS13) 10Giuseppe Lavagetto: ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 [10:30:56] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31337/console" [puppet] - 10https://gerrit.wikimedia.org/r/724105 (owner: 10Hnowlan) [10:31:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ci::master: remove production k8s tokens [puppet] - 10https://gerrit.wikimedia.org/r/724015 (owner: 10Giuseppe Lavagetto) [10:32:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "I am merging this change now because I think it will improve our setup security while not removing any meaningful functionality from the c" [puppet] - 10https://gerrit.wikimedia.org/r/724015 (owner: 10Giuseppe Lavagetto) [10:32:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet [10:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:34] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:26] (03CR) 10David Caro: [V: 03+1] "The changes on the pcc are the ones expected, the error was due to lack of space on the compiler node. Will merge after lunch." [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [10:35:41] jouncebot: nowandnext [10:35:42] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [10:35:42] In 0 hour(s) and 24 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210928T1100) [10:35:57] (03CR) 10Ladsgroup: [C: 03+2] "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723211 (https://phabricator.wikimedia.org/T291610) (owner: 10Michael Große) [10:36:00] cool. I'm going to deploy 723211 [10:36:09] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:49] (03Merged) 10jenkins-bot: Enable new dispatch via job approach on testwikidata and testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723211 (https://phabricator.wikimedia.org/T291610) (owner: 10Michael Große) [10:38:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: Init thirdparty/kubeadm-k8s-1-20 [puppet] - 10https://gerrit.wikimedia.org/r/724365 (https://phabricator.wikimedia.org/T280402) (owner: 10Majavah) [10:39:15] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet [10:40:47] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:25] (03Abandoned) 10Ladsgroup: Enable dispatching via jobs in testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724230 (https://phabricator.wikimedia.org/T48643) (owner: 10Ladsgroup) [10:43:19] I get during restarting php-fpm [10:43:22] https://www.irccloud.com/pastebin/fiaMaPQP/ [10:43:53] (03PS1) 10Arturo Borrero Gonzalez: openstack: galera: allow optional access to the database from manila share [puppet] - 10https://gerrit.wikimedia.org/r/724391 (https://phabricator.wikimedia.org/T291257) [10:44:32] effie: ^^^ seems related to the new scap release [10:45:49] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet [10:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:13] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2002.codfw.wmnet [10:46:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:37] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/724105 (owner: 10Hnowlan) [10:48:23] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:48:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:48:59] PROBLEM - ganeti-wconfd running on ganeti2025 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [10:49:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:40] (03PS1) 10Volans: sre.experimental.reimage: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/724392 [10:50:05] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: add stretch to OS list [cookbooks] - 10https://gerrit.wikimedia.org/r/724359 (owner: 10Volans) [10:50:09] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/724360 (owner: 10Volans) [10:50:48] (03PS2) 10Btullis: Increase the number of the Hadoop HDFS Namenode's service handler threads [puppet] - 10https://gerrit.wikimedia.org/r/723490 (https://phabricator.wikimedia.org/T275767) [10:52:49] (03Merged) 10jenkins-bot: sre.experimental.reimage: add stretch to OS list [cookbooks] - 10https://gerrit.wikimedia.org/r/724359 (owner: 10Volans) [10:53:09] (03Merged) 10jenkins-bot: sre.experimental.reimage: remove downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/724360 (owner: 10Volans) [10:53:33] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host rdb1011.eqiad.wmnet [10:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:24] 10SRE, 10Infrastructure-Foundations: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10MoritzMuehlenhoff) [10:57:27] (03PS1) 10Lucas Werkmeister (WMDE): Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup [extensions/Wikibase] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724370 (https://phabricator.wikimedia.org/T291377) [10:57:38] (03PS1) 10Lucas Werkmeister (WMDE): Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724371 (https://phabricator.wikimedia.org/T291377) [10:57:51] 10SRE, 10Infrastructure-Foundations: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10MoritzMuehlenhoff) [10:57:53] 10SRE, 10Discovery-Search: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10MoritzMuehlenhoff) [10:58:23] PROBLEM - Disk space on ms-be2035 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=89%): /tmp 0 MB (0% inode=89%): /var/tmp 0 MB (0% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2035&var-datasource=codfw+prometheus/ops [10:58:24] 10SRE, 10Infrastructure-Foundations: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10MoritzMuehlenhoff) [10:58:46] (03PS1) 10Giuseppe Lavagetto: ci: add user_defaults for k8s configs [puppet] - 10https://gerrit.wikimedia.org/r/724395 [10:58:53] (03PS1) 10Elukey: Revert "role::ores: move celery and cache to rdb2008" [puppet] - 10https://gerrit.wikimedia.org/r/724372 [10:59:05] 10SRE, 10Infrastructure-Foundations: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:59:10] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Bullseye preparations have completed and it's in active use, closing. For future migration tracking, T291916 can be used. [10:59:33] (03PS51) 10Btullis: Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210928T1100). [11:00:05] kart_ and Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:15] * kart_ is here [11:00:17] * urbanecm waves [11:00:23] Hello urbanecm [11:00:25] I can deploy today [11:00:31] PROBLEM - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:3 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:00:33] ACKNOWLEDGEMENT - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 1I:1:3 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T291917 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:00:36] 10SRE, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T291917 (10ops-monitoring-bot) [11:00:40] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ci: add user_defaults for k8s configs [puppet] - 10https://gerrit.wikimedia.org/r/724395 (owner: 10Giuseppe Lavagetto) [11:01:18] (03PS11) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [11:01:25] kart_: do you want to start by deploying your change? [11:01:34] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1011.eqiad.wmnet [11:01:35] (03CR) 10ZPapierski: Added spicerack.kafka with offset transfer function (0314 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [11:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:56] 10SRE-tools, 10Infrastructure-Foundations: Introduce Spicerack.kafka module, along with the method to transfer offset state between consumer groups and clusters - https://phabricator.wikimedia.org/T291681 (10Zbyszko) > * Does this method returns anything? No, there's no need - if there are issues, there will b... [11:02:12] Lucas_WMDE: Sure [11:02:14] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31338/console" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:02:19] ok [11:02:20] (03PS1) 10Alexandros Kosiaris: wikifeeds: Increase capacity by 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/724397 (https://phabricator.wikimedia.org/T291914) [11:02:23] I’ll already +2 my backports since they’ll take a while to make it through gate-and-submit [11:02:34] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup [extensions/Wikibase] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724370 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:02:59] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724371 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:03:20] Oh mean, should I go ahead and deploy my config change or you can help? :) [11:03:36] if you want to & can do it, feel free to go ahead [11:03:40] otherwise I can also help [11:03:53] Please help. It requires some more testing :) [11:03:59] (afaict you’re in the deployers group so I assumed you’d want to do it yourself) [11:04:01] ok sure! [11:04:07] do you want to +2 it or should I? [11:04:30] Basically, it adds and removes config variable - so need some experienced person to deploy. [11:04:47] New addition of config is coming with train this week. [11:04:58] since it’s all in the same file, it should be safe to deploy as a single `scap sync-file` [11:05:25] !log downgrading scap to 3.17.1 on deploy1002 - T291095 [11:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:32] T291095: Deploy Scap version 4.0.0 - https://phabricator.wikimedia.org/T291095 [11:05:46] (03CR) 10Elukey: [C: 03+2] Revert "role::ores: move celery and cache to rdb2008" [puppet] - 10https://gerrit.wikimedia.org/r/724372 (owner: 10Elukey) [11:07:37] Lucas_WMDE: Please +2 (Or whatever needed). I guess need rebase? [11:07:52] !log elukey@cumin1001 START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. - elukey@cumin1001 [11:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:02] (03PS7) 10Lucas Werkmeister (WMDE): Add support for SectionTranslationTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720982 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [11:08:15] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:08:40] I rebased it, I’ll look at the diffConfig CI output before +2ing [11:08:43] PROBLEM - Check systemd state on ms-be2061 is CRITICAL: CRITICAL - degraded: The following units failed: session-67620.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:51] Sure! [11:08:53] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [11:08:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2002.codfw.wmnet [11:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2002.codfw.wmnet` - testvm2002.codfw.wmnet (**WARN**) - //Host not found on Ici... [11:09:17] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:723211|Enable new dispatch via job approach on testwikidata and testwiki (T291610)]] (duration: 00m 57s) [11:09:22] Lucas_WMDE: curious: What it diffConfig CI output? and how it is useful? [11:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:26] T291610: Enable new Dispatching on test wikidata - https://phabricator.wikimedia.org/T291610 [11:09:28] <_joe_> Lucas_WMDE: please hold on a sec [11:09:33] ok [11:09:37] <_joe_> uhhh nevermind [11:09:43] <_joe_> I was reading the older backlog [11:09:45] kart_: it tells us how the config for each wiki changes [11:09:47] <_joe_> :D [11:09:59] <_joe_> I see that amir is testing the scap rollback [11:10:13] I see that Amir1 is testing *something* [11:10:13] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:10:15] idk what [11:10:20] Lucas_WMDE: Any link to view it or commands? [11:10:25] effie: works fine [11:10:27] (03PS2) 10Arturo Borrero Gonzalez: openstack: galera: allow optional access to the database from manila share [puppet] - 10https://gerrit.wikimedia.org/r/724391 (https://phabricator.wikimedia.org/T291257) [11:10:30] kart_: it’s on https://integration.wikimedia.org/zuul/ [11:10:36] Amir1: cheers, tx [11:10:42] ah, and now it started running too, https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/8307/console [11:10:51] Amir1, effie: am I good to go? or should I hold? [11:10:56] (03CR) 10jerkins-bot: [V: 04-1] openstack: galera: allow optional access to the database from manila share [puppet] - 10https://gerrit.wikimedia.org/r/724391 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [11:11:10] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] cassandra: remove variable for enabling jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/724105 (owner: 10Hnowlan) [11:11:11] Lucas_WMDE: go ahead please, and if something is wrong, just blame Amir1, thanks [11:11:16] ok [11:11:31] 10Puppet, 10Infrastructure-Foundations: investigate how rspec parses define paramters - https://phabricator.wikimedia.org/T291374 (10jbond) 05Open→03Resolved As pointed out on the issue linked above the issue here is that the facts variable needs to be defined in the os context e.g. it was also pointed ou... [11:11:40] I’d appreciate a heads-up earlier next time, we’re ten minutes into the window and I had no idea you were working on scap… [11:12:17] kart_: e.g. if you search for thwiki.json you’ll see that wgContentTranslationEnableSectionTranslation changed from false to true [11:12:23] Lucas_WMDE: Amir just reported it [11:12:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2001.codfw.wmnet [11:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add support for SectionTranslationTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720982 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [11:14:48] Amir1: are you still deploying something? [11:14:54] no [11:14:56] ok [11:14:57] (03PS3) 10Arturo Borrero Gonzalez: openstack: galera: allow optional access to the database from manila share [puppet] - 10https://gerrit.wikimedia.org/r/724391 (https://phabricator.wikimedia.org/T291257) [11:15:36] (03Merged) 10jenkins-bot: Add support for SectionTranslationTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720982 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [11:16:05] kart_: the change should be on mwdebug1002 now, can you test it using the x-wikimedia-debug extension? [11:16:07] (03CR) 10Majavah: openstack: galera: allow optional access to the database from manila share (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724391 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [11:16:20] (https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_usage) [11:16:57] Lucas_WMDE: sure. Testing.. [11:17:02] ok [11:17:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31342/" [puppet] - 10https://gerrit.wikimedia.org/r/724391 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [11:17:53] PROBLEM - Device not healthy -SMART- on ms-be2035 is CRITICAL: cluster=swift device=None instance=ms-be2035 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2035&var-datasource=codfw+prometheus/ops [11:18:04] (03PS1) 10Jbond: apt::package_from_component: update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/724402 [11:18:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: galera: allow optional access to the database from manila share (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724391 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [11:22:17] (03CR) 10Jbond: [C: 03+2] apt::package_from_component: update spec tests [puppet] - 10https://gerrit.wikimedia.org/r/724402 (owner: 10Jbond) [11:23:06] (03CR) 10jerkins-bot: [V: 04-1] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup [extensions/Wikibase] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724370 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:23:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "Try again, random failure" [extensions/Wikibase] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724370 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:23:28] 10SRE, 10MW-on-K8s, 10serviceops: Repartition mediawiki servers - https://phabricator.wikimedia.org/T291918 (10jijiki) [11:23:46] Lucas_WMDE: still testing.. [11:23:48] ok [11:23:49] RECOVERY - Check systemd state on ms-be2061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:56] aharoni: is also here to test :) [11:24:04] I just learned I’ll have to wait at least 20 more minutes for my backports to merge, so, no rush :D [11:24:12] ah!! [11:25:16] (03CR) 10Jbond: [C: 03+1] Setup systemd timer to sync OS reports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [11:25:36] (03PS1) 10Elukey: WIP - kubernetes: add token config for revscoring-editquality-deploy [puppet] - 10https://gerrit.wikimedia.org/r/724405 [11:25:41] (03CR) 10Btullis: [V: 03+1 C: 03+2] Install Alluxio to the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:25:47] !log Deploy schema change on s6 codfw (lag will show up) T283499 [11:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:56] T283499: Schema change for renaming page_timestamp index on revision table to rev_page_timestamp - https://phabricator.wikimedia.org/T283499 [11:26:08] (03CR) 10Btullis: [V: 03+1 C: 03+2] Install Alluxio to the test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:26:26] (03CR) 10Jbond: [C: 03+1] sre.experimental.reimage: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/724392 (owner: 10Volans) [11:27:10] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31343/console" [puppet] - 10https://gerrit.wikimedia.org/r/724405 (owner: 10Elukey) [11:27:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. - elukey@cumin1001 [11:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:29] Lucas_WMDE: All good. Go ahead with deployment, please. [11:27:33] ok! [11:28:15] (03CR) 10jerkins-bot: [V: 04-1] Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724371 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:28:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:33] syncing… [11:29:04] !log cleanup unused repo component buster-wikimedia|thirdparty/kubeadm-k8s-1-18 (T280402) [11:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:09] T280402: Upgrade Toolforge Kubernetes to latest 1.20 - https://phabricator.wikimedia.org/T280402 [11:29:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "random ECONNREFUSED" [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724371 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:29:19] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:720982|Add support for SectionTranslationTargetLanguages (T290302, T290175)]] (duration: 00m 57s) [11:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:25] T290175: Enable Section Translation for Igbo, Hausa, Yoruba and Thai Wikipedias - https://phabricator.wikimedia.org/T290175 [11:29:26] T290302: Confirm Section Translation can support the new set of languages - https://phabricator.wikimedia.org/T290302 [11:29:35] “/usr/local/sbin/check-and-restart-php php7.2-fpm 100 on wtp1026.eqiad.wmnet returned [4]: NOT restarting php7.2-fpm: free opcache 544 MB” [11:29:43] “1 hosts had failures restarting php-fpm” [11:29:56] ah, parsoid [11:29:58] !log Deploy schema change on s3 codfw (lag will show up) T283499 [11:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:12] Lucas_WMDE: I'm around if you need me for followup. [11:30:24] <_joe_> Lucas_WMDE: uh that's a strange result though [11:30:36] _joe_: I was about to retry that via SSH, should I? [11:31:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2001.codfw.wmnet [11:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:19] <_joe_> Lucas_WMDE: no need [11:32:51] hm, ok [11:33:01] so I’m good to proceed with other deployments? (once they go through gate-and-submit) [11:33:14] <_joe_> Lucas_WMDE: can you paste the full output somewhere? [11:33:46] <_joe_> because the command you pasted returns an exitcode of 0, so I don't think that's what caused the failure [11:34:02] _joe_: https://phabricator.wikimedia.org/P17335 (there’s not much more to it) [11:34:53] I put the full scap output into ~/scap-2021-09-28 [11:35:01] (sorry, that’s ~lucaswerkmeister-wmde/scap-2021-09-28) [11:35:41] (03PS1) 10Majavah: aptrepo: Update 1C0576B1761693CB_pyall.gpg [puppet] - 10https://gerrit.wikimedia.org/r/724406 [11:35:50] added it all to the phab paste [11:36:20] <_joe_> yeah no idea why that output tbh [11:36:24] <_joe_> anyways, go on [11:36:30] ok [11:36:40] <_joe_> sorry for the interruption, but better to check with care [11:37:12] (03PS2) 10Majavah: aptrepo: Update 1C0576B1761693CB_pyall.gpg [puppet] - 10https://gerrit.wikimedia.org/r/724406 [11:37:45] looks like the message comes from https://gerrit.wikimedia.org/g/operations/puppet/+/2ce7ae6aeb09c4acae9a1dca269cc9c1e8caab1d/modules/profile/files/mediawiki/php/php-check-and-restart.sh#27 [11:39:03] I’m guessing “returned [4]” means exit code 4 but I’m not sure where that script would exit with [11:39:05] *with 4 [11:39:24] <_joe_> Lucas_WMDE: exactly [11:39:38] <_joe_> running the script as mwdeploy on the server returns 0 [11:39:40] the “NOT restarting php7.2-fpm” seems to be a normal condition? I just normally woudln’t see it [11:39:42] <_joe_> and has that output [11:39:43] <_joe_> yes [11:39:47] I see [11:39:55] ok, so probably ok to proceed indeed, but strange [11:40:10] <_joe_> APCU_FRAGMENTATION=$(php7adm /apcu-frag |jq .fragmentation 2>&1) is the only place where I could think of having such an exit [11:40:15] <_joe_> at line 31 [11:40:28] ah, the script has set -e at the beginning [11:40:31] then it could be that [11:41:15] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2035 - device sdi:6 unavailable - https://phabricator.wikimedia.org/T291896 (10RhinosF1) {T291917} and another ticket about this host [11:41:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I commented too early >.< let’s kick off the gate-and-submit again" [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724371 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:41:33] ^ I’ll probably overrun the window a bit [11:41:38] but there’s nothing else after it [11:41:44] 10SRE, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T291917 (10fgiunchedi) [11:41:51] sorry if anyone’s eagerly waiting for me to be done, it’ll take a while longer [11:41:53] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2035 - device sdi:6 unavailable - https://phabricator.wikimedia.org/T291896 (10fgiunchedi) [11:42:47] Ty godog [11:43:11] (03PS2) 10Ema: admin: set krb attribute to 'present' for ema [puppet] - 10https://gerrit.wikimedia.org/r/723536 [11:44:01] (03CR) 10Ema: [C: 03+2] admin: set krb attribute to 'present' for ema [puppet] - 10https://gerrit.wikimedia.org/r/723536 (owner: 10Ema) [11:44:09] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) >>! In T290536#7376383, @akosiaris wrote: >>>! In T290536#7371552, @jijiki wrote: >>>>! In T290536#7364817, @Joe wrote: >>> We could thus start wit... [11:44:23] (03PS1) 10Btullis: Fix the ferm configuration for alluxio workers [puppet] - 10https://gerrit.wikimedia.org/r/724407 (https://phabricator.wikimedia.org/T266641) [11:45:27] RhinosF1: sure! thanks to you too [11:45:30] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31344/console" [puppet] - 10https://gerrit.wikimedia.org/r/724407 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:45:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/724392 (owner: 10Volans) [11:46:44] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [11:46:48] godog: some days 100,000 emails are useful [11:46:48] (03CR) 10Btullis: [V: 03+1 C: 03+2] Fix the ferm configuration for alluxio workers [puppet] - 10https://gerrit.wikimedia.org/r/724407 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [11:47:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ci: Add 'bullseye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [11:47:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ci: Add 'bullseye' to docker lsbdistcodename hack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [11:48:07] (03Merged) 10jenkins-bot: Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup [extensions/Wikibase] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724370 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:48:21] (03PS8) 10Hnowlan: cassandra: use FQDN in CN name for future instances [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) [11:48:30] oh, wmf.2 isn’t actually on the deployment host yet? [11:49:17] jeena, if you’re online already: is it okay if I merge something into wmf.2 at the moment? or should I wait? [11:49:30] I remember some issues in the past from trying to backport something early in the train [11:49:51] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [11:50:06] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [11:51:16] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) @WDoranWMF, @hnowlan Hello, this is now unblocked and ready to go. Note that the version... [11:51:54] alright, I’m testing the wmf.1 backport on mwdebug1001 [11:51:59] (NB: not mwdebug1002 as earlier) [11:52:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] ci: Apply profile::wmcs::lvm as needed for new integration instances [puppet] - 10https://gerrit.wikimedia.org/r/722476 (https://phabricator.wikimedia.org/T277078) (owner: 10Krinkle) [11:52:02] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31345/console" [puppet] - 10https://gerrit.wikimedia.org/r/724061 (https://phabricator.wikimedia.org/T141541) (owner: 10Hnowlan) [11:52:07] (03PS4) 10Giuseppe Lavagetto: ci: Apply profile::wmcs::lvm as needed for new integration instances [puppet] - 10https://gerrit.wikimedia.org/r/722476 (https://phabricator.wikimedia.org/T277078) (owner: 10Krinkle) [11:52:24] works fine \o/ [11:52:28] I’ll sync the wmf.1 backport [11:52:48] but probably cancel the gate-and-submit of the wmf.2 one, to be on the safe side [11:54:22] (03PS2) 10Elukey: kubernetes: add token config for revscoring-editquality-deploy [puppet] - 10https://gerrit.wikimedia.org/r/724405 (https://phabricator.wikimedia.org/T251305) [11:54:22] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.1/extensions/Wikibase/repo/includes/Store/Sql/SqlSiteLinkConflictLookup.php: Backport: [[gerrit:724370|Use CONN_TRX_AUTOCOMMIT in SqlSiteLinkConflictLookup (T291377)]] (duration: 00m 57s) [11:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:30] T291377: Prevent creation of items having the same sitelinks (duplicates) using memcached and database locks - https://phabricator.wikimedia.org/T291377 [11:54:44] _joe_: this time there was no warning from scap, fyi [11:55:07] <_joe_> Lucas_WMDE: that seems to confirm my theory that it was a bad response from php7adm [11:55:28] (03CR) 10Lucas Werkmeister (WMDE): "Nah, let’s not +2 this yet, the wmf.2 train hasn’t properly started yet." [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724371 (https://phabricator.wikimedia.org/T291377) (owner: 10Lucas Werkmeister (WMDE)) [11:55:36] ok [11:57:43] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) Thanks to Joe's refactoring (https://gerrit.wikimedia.org/r/c/operations/puppet/+/7234190) we have now a quick way to define -deploy users with separate permissions. I have creat... [11:57:46] in that case I think we’re done with the deployment window [11:57:52] !log EU backport+config window done [11:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:24] (03PS4) 10Giuseppe Lavagetto: Add toolhub to LVS [puppet] - 10https://gerrit.wikimedia.org/r/711702 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [12:06:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add toolhub to LVS [puppet] - 10https://gerrit.wikimedia.org/r/711702 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [12:06:36] !log Remove flaggedimages from s7 T290340 [12:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:42] T290340: Drop the flaggedimages table from Wikimedia production - https://phabricator.wikimedia.org/T290340 [12:10:19] !log lucaswerkmeister-wmde@wtp1026:~$ sudo -u mwdeploy /usr/local/sbin/restart-php7.2-fpm # attempt to solve a recurrence of T290120, but it failed [12:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:26] T290120: Cannot declare class Wikimedia\MWConfig\XWikimediaDebug, because the name is already in use in XWikimediaDebug.php - https://phabricator.wikimedia.org/T290120 [12:10:26] ^ I may have messed up there :/ [12:10:41] _joe_: can you help me? since it’s another wtp host thing [12:10:54] * urbanecm waves, in case i can be of any help [12:10:57] hi [12:11:02] <_joe_> Lucas_WMDE: this only happens on wtp hosts, but is fixed with a restart [12:11:07] <_joe_> Lucas_WMDE: which host? [12:11:08] there are lots of “cannot declare class XWikimediaDebug” errors in logstash, and that Phabricator task led me to believe a php-fpm restart would fix it [12:11:12] wtp1026 [12:11:15] but clearly I don’t know how to restart it [12:11:23] sudo -i /usr/local/sbin/restart-php7.2-fpm [12:11:44] !log [urbanecm@wtp1026 ~]$ sudo -i /usr/local/sbin/restart-php7.2-fpm [12:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:06] hm, when I ran `sudo /usr/local/sbin/restart-php7.2-fpm` I got the password prompt [12:12:09] <_joe_> urbanecm: so we overlapped each other it seems [12:12:10] (without -i) [12:12:21] sorry _joe_ 🙂 [12:12:42] <_joe_> yeah that resulted in the service being depooled [12:12:55] I think it was depooled from my attempt already [12:12:58] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10WMDE-Fisch) [12:13:04] “Service restart failed. NOT repooling” [12:13:15] <_joe_> Lucas_WMDE: yeah that would make sense [12:13:26] <_joe_> so your attempt should at least have stopped the errors [12:13:46] they’re gone from logstash now it seems [12:14:08] <_joe_> yeah we also restarted php-fpm [12:14:12] `systemctl status php7.2-fpm.service` looks good now [12:14:14] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:16:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:16:44] I’m still not sure what I should’ve done (other than ask for help directly ^^) [12:17:00] if I can’t sudo the command without -i, shouldn’t I get the same error with -i? [12:17:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] wikifeeds: Increase capacity by 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/724397 (https://phabricator.wikimedia.org/T291914) (owner: 10Alexandros Kosiaris) [12:17:25] Lucas_WMDE: no, because sudoerrs is only configured to let you run it with -i, but not without [12:17:25] (well, “error” – the password prompt with the “usual lecture” stuff which really means “this didn’t match a NOPASSWD rule”) [12:17:29] o_O [12:17:33] okay [12:17:37] thanks ^^ [12:18:26] I’ll try to remember that [12:18:36] (or, remember to look at https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#PHP7_opcache_health next time) [12:19:35] (03CR) 10Jelto: [C: 03+1] "lgtm. rolebinding for the -deploy user exists already. Don't forget to add the token to private puppet (I guess..)" [puppet] - 10https://gerrit.wikimedia.org/r/724405 (https://phabricator.wikimedia.org/T251305) (owner: 10Elukey) [12:21:58] (03Merged) 10jenkins-bot: wikifeeds: Increase capacity by 100% [deployment-charts] - 10https://gerrit.wikimedia.org/r/724397 (https://phabricator.wikimedia.org/T291914) (owner: 10Alexandros Kosiaris) [12:23:34] urbanecm: should I open an issue for the fact that it let me run the command with `-u mwdeploy`, which then didn’t do the right thing? [12:23:40] (03PS1) 10Muehlenhoff: Update DHCP address for testvm2001 [puppet] - 10https://gerrit.wikimedia.org/r/724410 [12:23:53] (the main error was “Failed to restart php7.2-fpm.service: Access denied”, not sure if I pasted that yet) [12:24:21] Lucas_WMDE: I think that command should check it's running under the right identity (ie. root) [12:24:30] yeah, it looks like it didn’t do that [12:24:37] I’ll open something [12:24:47] (the fact sudo didn't complain with -u mwdeploy is correct -- deployers can run anything as mwdeploy) [12:24:55] *nod* [12:26:21] 10SRE, 10MW-on-K8s, 10Performance-Team, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10akosiaris) >>! In T290536#7383272, @jijiki wrote: > That is a good idea, I started a different task to discuss our options in partitioning our mediawiki... [12:26:54] (03CR) 10Muehlenhoff: [C: 03+2] Update DHCP address for testvm2001 [puppet] - 10https://gerrit.wikimedia.org/r/724410 (owner: 10Muehlenhoff) [12:27:19] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [12:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:49] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [12:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:19] 10SRE, 10serviceops: restart-php7.2-fpm attempts to run as non-root but can’t actually restart service, leaving instance depooled - https://phabricator.wikimedia.org/T291921 (10Lucas_Werkmeister_WMDE) [12:30:47] 10SRE, 10serviceops: restart-php7.2-fpm attempts to run as non-root but can’t actually restart service, leaving instance depooled - https://phabricator.wikimedia.org/T291921 (10Lucas_Werkmeister_WMDE) (Just to be clear, in case the task title is ambiguous: I’m aware this is my fault, I’m just suggesting to pre... [12:30:50] created ^ (feel free to retag, I wasn’t sure what to put there) [12:31:19] thanks! [12:35:18] !log btullis@deploy1002 Started deploy [analytics/refinery@380d165]: Regular analytics weekly train [analytics/refinery@380d165] [12:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:13] (03CR) 10David Caro: [V: 03+1 C: 03+2] ldap::sssd: Don't specify services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/724003 (https://phabricator.wikimedia.org/T291585) (owner: 10David Caro) [12:46:32] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:53:00] !log btullis@deploy1002 Finished deploy [analytics/refinery@380d165]: Regular analytics weekly train [analytics/refinery@380d165] (duration: 17m 42s) [12:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:15] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/724392 (owner: 10Volans) [12:53:28] (03PS1) 10Joal: Add analytics purge for Gobblin old files [puppet] - 10https://gerrit.wikimedia.org/r/724413 (https://phabricator.wikimedia.org/T287084) [12:53:56] !log btullis@deploy1002 Started deploy [analytics/refinery@380d165] (thin): Regular analytics weekly train THIN [analytics/refinery@380d165] [12:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:03] !log btullis@deploy1002 Finished deploy [analytics/refinery@380d165] (thin): Regular analytics weekly train THIN [analytics/refinery@380d165] (duration: 00m 07s) [12:54:04] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:16] !log btullis@deploy1002 Started deploy [analytics/refinery@380d165] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@380d165] [12:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:58] (03Merged) 10jenkins-bot: sre.experimental.reimage: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/724392 (owner: 10Volans) [12:56:16] (03PS1) 10Marostegui: install_server: Remove db2103 [puppet] - 10https://gerrit.wikimedia.org/r/724414 (https://phabricator.wikimedia.org/T290865) [12:56:51] (03CR) 10Volans: [C: 03+1] "LGTM! thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/724414 (https://phabricator.wikimedia.org/T290865) (owner: 10Marostegui) [12:57:15] (03CR) 10Marostegui: [C: 03+2] install_server: Remove db2103 [puppet] - 10https://gerrit.wikimedia.org/r/724414 (https://phabricator.wikimedia.org/T290865) (owner: 10Marostegui) [12:59:09] (03PS1) 10Papaul: Add mw24[12-19] MAC address and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724415 (https://phabricator.wikimedia.org/T290192) [12:59:45] (03CR) 10jerkins-bot: [V: 04-1] Add mw24[12-19] MAC address and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724415 (https://phabricator.wikimedia.org/T290192) (owner: 10Papaul) [13:00:32] PROBLEM - etcd request latencies on kubestagemaster1001 is CRITICAL: instance=10.64.16.203 operation={get,list,listWithCount,update} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:01:18] !log btullis@deploy1002 Finished deploy [analytics/refinery@380d165] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@380d165] (duration: 07m 02s) [13:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:06] RECOVERY - etcd request latencies on kubestagemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:03:48] !log marostegui@cumin1001 START - Cookbook sre.experimental.reimage for host db2103.codfw.wmnet [13:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/724406 (owner: 10Majavah) [13:04:00] (03CR) 10Muehlenhoff: [C: 03+2] aptrepo: Update 1C0576B1761693CB_pyall.gpg [puppet] - 10https://gerrit.wikimedia.org/r/724406 (owner: 10Majavah) [13:06:41] (03CR) 10Volans: "Much nicer! Couple of questions inline, I'll pass over the tests now" [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:08:33] (03PS2) 10Papaul: Add mw24[12-19] MAC address and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724415 (https://phabricator.wikimedia.org/T290192) [13:16:04] (03CR) 10Papaul: [C: 03+2] Add mw24[12-19] MAC address and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724415 (https://phabricator.wikimedia.org/T290192) (owner: 10Papaul) [13:16:14] (03PS3) 10Papaul: Add mw24[12-19] MAC address and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724415 (https://phabricator.wikimedia.org/T290192) [13:16:21] (03CR) 10Papaul: [V: 03+2 C: 03+2] Add mw24[12-19] MAC address and to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724415 (https://phabricator.wikimedia.org/T290192) (owner: 10Papaul) [13:18:59] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` centrallog2002.codfw.w... [13:19:13] (03CR) 10Volans: [C: 04-1] "The only major thing is the missing dry-run support, all the rest are nits (also from previous comment)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:19:15] (03PS1) 10Muehlenhoff: webserver-misc-apps.discovery: Add os-reports.w.o [puppet] - 10https://gerrit.wikimedia.org/r/724416 [13:19:17] (03CR) 10Ottomata: Eventgate TLS proxy: lower local idle_timeout (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/724173 (https://phabricator.wikimedia.org/T215001) (owner: 10Ppchelko) [13:23:02] (03CR) 10Ottomata: "Why 7 instead of 31 days?" [puppet] - 10https://gerrit.wikimedia.org/r/724413 (https://phabricator.wikimedia.org/T287084) (owner: 10Joal) [13:25:53] (03CR) 10Ottomata: [C: 03+1] Increase the number of the Hadoop HDFS Namenode's service handler threads [puppet] - 10https://gerrit.wikimedia.org/r/723490 (https://phabricator.wikimedia.org/T275767) (owner: 10Btullis) [13:30:01] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:56] (03CR) 10Herron: [C: 03+2] logstash: make jmx_ params optional [puppet] - 10https://gerrit.wikimedia.org/r/721370 (owner: 10Herron) [13:33:10] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:33:10] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:15] (03PS3) 10Herron: logstash::input::gelf: add host param [puppet] - 10https://gerrit.wikimedia.org/r/721346 [13:36:09] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:36:09] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:15] 10SRE, 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q2): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10lmata) [13:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host db2103.codfw.wmnet [13:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:49] 10SRE, 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q2): Ensure SRE team has a good understanding of how & when to declare an outage on the status page; & it is easy to do so - https://phabricator.wikimedia.org/T285769 (10lmata) [13:36:51] volans: ^ \o/ [13:37:09] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721346 (owner: 10Herron) [13:37:21] great!, thanks for testing :) [13:37:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on centrallog2002.codfw.wmnet with reason: REIMAGE [13:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:42] (03CR) 10Herron: [C: 03+2] logstash::input::gelf: add host param [puppet] - 10https://gerrit.wikimedia.org/r/721346 (owner: 10Herron) [13:39:18] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 141 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:39:43] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on centrallog2002.codfw.wmnet with reason: REIMAGE [13:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2103 T290865', diff saved to https://phabricator.wikimedia.org/P17337 and previous config saved to /var/cache/conftool/dbconfig/20210928-134012-marostegui.json [13:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:18] T290865: Upgrade s1 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290865 [13:40:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2080 T290868', diff saved to https://phabricator.wikimedia.org/P17339 and previous config saved to /var/cache/conftool/dbconfig/20210928-134030-marostegui.json [13:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:37] T290868: Upgrade s8 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290868 [13:40:40] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:41:08] (03CR) 10Herron: [C: 03+2] logstash: add udp output module [puppet] - 10https://gerrit.wikimedia.org/r/721356 (owner: 10Herron) [13:44:50] (03CR) 10Gehel: "minor comments inline, otherwise LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [13:44:54] (03CR) 10Elukey: [C: 03+2] kubernetes: add token config for revscoring-editquality-deploy [puppet] - 10https://gerrit.wikimedia.org/r/724405 (https://phabricator.wikimedia.org/T251305) (owner: 10Elukey) [13:46:26] 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10SRE Observability (FY2021/2022-Q2): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10lmata) [13:47:33] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['centrallog2002.codfw.wmnet'] ` and were **ALL** successful. [13:50:22] (03CR) 10Herron: "Hey ryankemper, ebernhardson, gehel, looping you in -- This would deploy a shim instance on the elastic hosts to deliver elasticsearch GEL" [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [13:51:19] (03PS15) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) [13:51:27] (03CR) 10Jbond: [C: 03+1] webserver-misc-apps.discovery: Add os-reports.w.o [puppet] - 10https://gerrit.wikimedia.org/r/724416 (owner: 10Muehlenhoff) [13:51:37] (03PS8) 10Herron: add logstash gelf relay and enable on one host [puppet] - 10https://gerrit.wikimedia.org/r/721364 [13:52:00] (03PS1) 10Elukey: helmfile.d: move ml-services to the new helm3 deploy user/token config [deployment-charts] - 10https://gerrit.wikimedia.org/r/724421 (https://phabricator.wikimedia.org/T251305) [13:54:56] (03CR) 10Gehel: [C: 03+1] profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [13:59:10] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T291948 (10SCherukuwada) [13:59:38] (03CR) 10DCausse: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [14:00:08] (03PS1) 10DCausse: search-platform: Fix flink app crashloop detection [alerts] - 10https://gerrit.wikimedia.org/r/724423 (https://phabricator.wikimedia.org/T276467) [14:00:53] (03PS2) 10Ottomata: airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) [14:02:01] (03CR) 10jerkins-bot: [V: 04-1] airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [14:03:17] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10Papaul) [14:03:59] RECOVERY - Disk space on ms-be2035 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2035&var-datasource=codfw+prometheus/ops [14:04:00] (03CR) 10jerkins-bot: [V: 04-1] search-platform: Fix flink app crashloop detection [alerts] - 10https://gerrit.wikimedia.org/r/724423 (https://phabricator.wikimedia.org/T276467) (owner: 10DCausse) [14:04:02] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q1): Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet - https://phabricator.wikimedia.org/T289624 (10Papaul) 05Open→03Resolved @lmata this is complete [14:04:16] (03CR) 10Elukey: [C: 03+2] helmfile.d: move ml-services to the new helm3 deploy user/token config [deployment-charts] - 10https://gerrit.wikimedia.org/r/724421 (https://phabricator.wikimedia.org/T251305) (owner: 10Elukey) [14:11:55] PROBLEM - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:13:23] (03PS2) 10DCausse: search-platform: Fix flink app crashloop detection [alerts] - 10https://gerrit.wikimedia.org/r/724423 (https://phabricator.wikimedia.org/T276467) [14:16:21] (03PS3) 10DCausse: search-platform: Fix flink app crashloop detection [alerts] - 10https://gerrit.wikimedia.org/r/724423 (https://phabricator.wikimedia.org/T276467) [14:21:48] (03PS1) 10Jelto: profile::gitlab start using gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) [14:25:55] (03PS3) 10Ottomata: airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) [14:26:01] (03PS1) 10Elukey: kubeflow-kfserving-inference: add quote to labels/annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/724431 [14:26:26] (03CR) 10jerkins-bot: [V: 04-1] airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [14:28:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service: Switch toolhub to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/711703 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [14:28:19] (03PS4) 10Giuseppe Lavagetto: service: Switch toolhub to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/711703 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [14:29:31] (03PS1) 10Majavah: aptrepo: fix helm component for k8s 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/724432 [14:30:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: fix helm component for k8s 1.20 [puppet] - 10https://gerrit.wikimedia.org/r/724432 (owner: 10Majavah) [14:31:10] <_joe_> !log restarting pybal on lvs1016 [14:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:46] !log add packages for buster-wikimedia|thirdparty/kubeadm-k8s-1-20 (T280402) [14:32:50] <_joe_> !log restarting pybal on lvs2010 [14:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:52] T280402: Upgrade Toolforge Kubernetes to latest 1.20 - https://phabricator.wikimedia.org/T280402 [14:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:01] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [14:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:07] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving-inference: add quote to labels/annotations [deployment-charts] - 10https://gerrit.wikimedia.org/r/724431 (owner: 10Elukey) [14:34:12] <_joe_> !log restarting pybal on lvs1015 [14:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:46] (03PS4) 10Ottomata: airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) [14:36:26] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31350/console" [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [14:36:29] <_joe_> !log restarting pybal on lvs2009 [14:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:59] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - toolhub_4011: Servers kubernetes2015.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:38:36] !log Remove flaggedimages from s5 T290340 [14:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:42] T290340: Drop the flaggedimages table from Wikimedia production - https://phabricator.wikimedia.org/T290340 [14:38:52] (03CR) 10Ottomata: [V: 03+1 C: 03+2] airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [14:39:20] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001 [14:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:35] (03PS5) 10Ottomata: airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) [14:39:36] <_joe_> bd808: around? maybe you know why toolhub wasn't deployed to codfw? [14:40:09] (03PS6) 10Ottomata: airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) [14:40:25] <_joe_> If you're not around, I'll deploy it anyways as else we'd get paged - even if a service is not active, it needs to be up if it's declared in both dc loadbalancers [14:41:01] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [14:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:59] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - toolhub_4011: Servers kubernetes2015.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2002.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:42:29] (03CR) 10Ottomata: [C: 03+2] airflow - Add support for configuring connections using LocalFilesystemBackend [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [14:44:07] 10SRE, 10SRE Observability (FY2021/2022-Q3): Tooling for end-of-quarter SLO reporting - https://phabricator.wikimedia.org/T290924 (10lmata) [14:44:22] <_joe_> the pybal backends is me, sadly [14:45:08] (03CR) 10Ppchelko: "I think that's not needed, can be abandoned." [deployment-charts] - 10https://gerrit.wikimedia.org/r/722845 (owner: 10Effie Mouzeli) [14:46:41] (03PS1) 10Giuseppe Lavagetto: toolhub: temporarily eqiad-only [puppet] - 10https://gerrit.wikimedia.org/r/724442 [14:47:56] (03PS1) 10Gehel: [DNM] quick hack to start discussion [software/spicerack] - 10https://gerrit.wikimedia.org/r/724443 [14:48:04] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31351/console" [puppet] - 10https://gerrit.wikimedia.org/r/724442 (owner: 10Giuseppe Lavagetto) [14:48:19] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] toolhub: temporarily eqiad-only [puppet] - 10https://gerrit.wikimedia.org/r/724442 (owner: 10Giuseppe Lavagetto) [14:49:04] (03CR) 10Muehlenhoff: [C: 03+2] Setup systemd timer to sync OS reports [puppet] - 10https://gerrit.wikimedia.org/r/724081 (owner: 10Muehlenhoff) [14:50:50] (03PS2) 10Gehel: [DNM] quick hack to start discussion [software/spicerack] - 10https://gerrit.wikimedia.org/r/724443 [14:51:31] <_joe_> !log restarting pybals in codfw again [14:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:27] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:54:13] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.62:4011]) https://wikitech.wikimedia.org/wiki/PyBal [14:54:49] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:55:26] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) [14:56:30] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: allow manila-share auth as novaadmin [puppet] - 10https://gerrit.wikimedia.org/r/724445 (https://phabricator.wikimedia.org/T291257) [14:56:36] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) >>! In T270071#7375908, @akosiaris wrote: > I think we first need to recap a bit where we are at and what is still a problem. I think some of th... [14:57:22] (03CR) 10jerkins-bot: [V: 04-1] [DNM] quick hack to start discussion [software/spicerack] - 10https://gerrit.wikimedia.org/r/724443 (owner: 10Gehel) [15:00:21] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:01:34] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31352/" [puppet] - 10https://gerrit.wikimedia.org/r/724445 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [15:03:08] (03CR) 10Andrew Bogott: "Could this be done with a special-purpose service user? Overall I'd like to phase out our use of novaadmin for inter-service communication" [puppet] - 10https://gerrit.wikimedia.org/r/724445 (https://phabricator.wikimedia.org/T291257) (owner: 10Arturo Borrero Gonzalez) [15:05:10] (03PS1) 10Muehlenhoff: Create symlink for latest OS report [puppet] - 10https://gerrit.wikimedia.org/r/724447 [15:06:34] (03PS1) 10Elukey: helmfile.d: add user deploy-kserve [deployment-charts] - 10https://gerrit.wikimedia.org/r/724448 (https://phabricator.wikimedia.org/T286791) [15:07:14] (03CR) 10Ottomata: "Hm, everything looks right, but this isn't quite working and am not sure why. Asking in Airflow slack for help." [puppet] - 10https://gerrit.wikimedia.org/r/698808 (https://phabricator.wikimedia.org/T272973) (owner: 10Ottomata) [15:07:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] docker: add security updates to Bullseye base image [puppet] - 10https://gerrit.wikimedia.org/r/720241 (owner: 10Hashar) [15:10:21] (03PS2) 10Elukey: helmfile.d: add user deploy-kserve [deployment-charts] - 10https://gerrit.wikimedia.org/r/724448 (https://phabricator.wikimedia.org/T286791) [15:10:55] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "toolhub isn't working in codfw, let's fix that before we go any further down the patch sequence." [puppet] - 10https://gerrit.wikimedia.org/r/711704 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [15:10:55] _joe_: Mostly I just had not done a codfw deploy I suppose. No multi-master database solution yet, so we really can't serve from both at once (manual db fail over needed), but having the pods up in codfw shouldn't hurt anything. [15:11:16] <_joe_> bd808: they are going in crashloopbackoff though [15:11:47] <_joe_> yes the service needs to be up in both DCs, even if not serving read-write traffic in codfw, like mediawiki and every other active/passive service [15:12:41] <_joe_> so before we can proceed, that needs to be fixed [15:12:51] So we do have the DB in codfw, but they are RO [15:12:55] RECOVERY - SSH on gerrit2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:13:15] <_joe_> marostegui: yeah one possibility is that the service tries to write to the db on startup [15:13:21] <_joe_> and crashes [15:13:24] _joe_: *nod* I'm just logging into the deply server to see the crash logs now [15:13:24] it will fail yeah [15:13:25] <_joe_> but I found no logs [15:16:10] <_joe_> oh wait [15:16:15] <_joe_> what's crashing is toolhub-main-tls-proxy: [15:16:20] I see `'containers with unready status: [toolhub-main-tls-proxy]'` in the `kubectl get po toolhub-main-99cc49c95-5l99j -o yaml` output [15:16:44] <_joe_> [2021-09-28 15:15:38.029][1][critical][main] [source/server/server.cc:101] error initializing configuration '/etc/envoy/envoy.yaml': Proto constraint validation failed (BootstrapValidationError.StaticResources: ["embedded message failed validation"] | caused by StaticResourcesValidationError.Listeners[i]: ["embedded message failed validation"] | caused by ListenerValidationError.Address: [15:16:46] <_joe_> ["embedded message failed validation"] | caused by AddressValidationError.SocketAddress: ["embedded message failed validation"] | caused by field: "port_specifier", reason: is required): static_resources { [15:17:22] <_joe_> soo, what's happened here that allowed an invalid envoy configuration? [15:18:25] Toolhub's values-codfw.yaml doesn't set any values for envoy things. What is my chart pulling from the shared config... [15:18:32] <_joe_> ut dies [15:18:35] <_joe_> err [15:18:38] <_joe_> it does [15:18:44] <_joe_> found the issue, fixing it [15:20:00] (03PS1) 10Giuseppe Lavagetto: toohub: do not overwrite listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/724450 [15:20:40] <_joe_> bd808: ^^ [15:20:54] <_joe_> "search-https-codfw" is not a valid listener [15:20:59] _joe_: yeah. nice spotting. That was old crap in the codfw file [15:21:13] <_joe_> well at least we can proceed [15:21:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] toohub: do not overwrite listeners [deployment-charts] - 10https://gerrit.wikimedia.org/r/724450 (owner: 10Giuseppe Lavagetto) [15:23:32] (03PS1) 10Ottomata: CommonSettings-labs.php - test Eventbus x_client_ip_forwarding_enabled in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724451 (https://phabricator.wikimedia.org/T288853) [15:23:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T291948 (10Joe) a:03Joe [15:23:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T291948 (10Joe) p:05Triage→03Medium [15:26:06] (03CR) 10Ottomata: [C: 03+2] CommonSettings-labs.php - test Eventbus x_client_ip_forwarding_enabled in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724451 (https://phabricator.wikimedia.org/T288853) (owner: 10Ottomata) [15:30:12] (03CR) 10Jgiannelos: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/724127 (owner: 10Jgiannelos) [15:30:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T291948 (10Joe) Hi @SCherukuwada, and welcome! Indeed if you just need to access superset, we don't need to define any ssh keys for your access. If that's the case, please confirm... [15:31:27] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [15:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:40] (03CR) 10MSantos: [C: 03+1] "LGTM. Minor nit: If you could add a reason on the commit message or in the code. After that you can merge it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/724127 (owner: 10Jgiannelos) [15:32:04] (03PS2) 10Jgiannelos: tegola-vector-tiles: Exclude master nodes from postgres proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/724127 [15:33:49] 10SRE, 10envoy, 10serviceops: The TLS proxy configuration in deployment-charts allows invalid listeners - https://phabricator.wikimedia.org/T291959 (10Joe) [15:33:53] (03PS1) 10Volans: sre.experimental.reimage: fix --new behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/724453 [15:34:03] 10SRE, 10envoy, 10serviceops: The TLS proxy configuration in deployment-charts allows invalid listeners - https://phabricator.wikimedia.org/T291959 (10Joe) p:05Triage→03High [15:34:25] <_joe_> bd808: ok toolhub now works in codfw, I also created a task to make the charts fail explicitly in case of such an error [15:34:29] <_joe_> so that CI would reject it [15:36:05] (03PS1) 10Giuseppe Lavagetto: Revert "toolhub: temporarily eqiad-only" [puppet] - 10https://gerrit.wikimedia.org/r/724377 [15:36:24] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "toolhub: temporarily eqiad-only" [puppet] - 10https://gerrit.wikimedia.org/r/724377 (owner: 10Giuseppe Lavagetto) [15:36:30] (03CR) 10jerkins-bot: [V: 04-1] sre.experimental.reimage: fix --new behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/724453 (owner: 10Volans) [15:36:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T291948 (10SCherukuwada) Confirmation: As it stands now, I only need access to superset. I acknowledge that I've read https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Us... [15:37:07] PROBLEM - Host clouddb1020 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:30] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [15:38:32] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Exclude master nodes from postgres proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/724127 (owner: 10Jgiannelos) [15:39:06] (03CR) 10Ssingh: haproxy: Basic TLS terminator based on HAProxy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/715932 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:39:36] (03PS1) 10Jgiannelos: tegola-vector-tiles: Use a different cache basepath for each env [deployment-charts] - 10https://gerrit.wikimedia.org/r/724454 [15:39:41] <_joe_> !log restarting pybal on lvs2010 [15:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:46] (03PS2) 10Volans: sre.experimental.reimage: fix --new behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/724453 [15:41:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:23] (03Merged) 10jenkins-bot: tegola-vector-tiles: Exclude master nodes from postgres proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/724127 (owner: 10Jgiannelos) [15:44:06] (03PS2) 10MSantos: tegola-vector-tiles: Use a different cache basepath for each env [deployment-charts] - 10https://gerrit.wikimedia.org/r/724454 (owner: 10Jgiannelos) [15:44:10] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [15:44:25] uh? [15:44:28] checking [15:44:43] (03CR) 10Papaul: [C: 03+1] sre.experimental.reimage: fix --new behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/724453 (owner: 10Volans) [15:45:08] bstorm: clouddb1020 has crashed [15:45:18] bd808: ^ (not sure if I should ping someone else) [15:45:20] Was just looking at that [15:45:29] Ah ok :) [15:45:31] Thank you [15:45:42] I think a.rturo was already looking, based on -cloud-admin too [15:45:47] I'll redirect the traffic [15:45:59] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: fix --new behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/724453 (owner: 10Volans) [15:46:26] huh? I tested connecting s5,s8.analytics and both seemed up to me [15:46:36] majavah: we have two hosts per section [15:46:49] looks like a thermal event [15:46:56] bstorm: :-/ [15:47:12] as in a volcano eruption? [15:47:22] why do we need to manually switch anything over then? [15:47:23] (03CR) 10Cwhite: [C: 03+1] profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [15:47:29] as in crappy paste on the cpu most likely [15:48:00] 10Puppet, 10Infrastructure-Foundations: investigate how rspec parses define parameters - https://phabricator.wikimedia.org/T291374 (10Aklapper) [15:48:30] Should I create a task for this or WMCS will? [15:49:15] (03Merged) 10jenkins-bot: sre.experimental.reimage: fix --new behaviour [cookbooks] - 10https://gerrit.wikimedia.org/r/724453 (owner: 10Volans) [15:49:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "The service is up in both datacenters and responds to requests to the /healthz endpoint correctly." [puppet] - 10https://gerrit.wikimedia.org/r/711704 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [15:50:06] (03PS4) 10Giuseppe Lavagetto: service: Switch toolhub to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/711704 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [15:50:31] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Use a different cache basepath for each env [deployment-charts] - 10https://gerrit.wikimedia.org/r/724454 (owner: 10Jgiannelos) [15:51:21] (03CR) 10Ssingh: haproxy: Allow configuring TLS options (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/716000 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:52:08] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:24] (03PS1) 10Bstorm: wikireplicas: depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/724457 (https://phabricator.wikimedia.org/T291961) [15:53:24] !log pt1979@cumin2002 START - Cookbook sre.experimental.reimage for host mw2412.codfw.wmnet [15:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:29] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2412.codfw.wmnet [15:54:12] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool clouddb1020 [puppet] - 10https://gerrit.wikimedia.org/r/724457 (https://phabricator.wikimedia.org/T291961) (owner: 10Bstorm) [15:54:48] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) Tested in beta, I think this is working now. [15:54:48] (03Merged) 10jenkins-bot: tegola-vector-tiles: Use a different cache basepath for each env [deployment-charts] - 10https://gerrit.wikimedia.org/r/724454 (owner: 10Jgiannelos) [15:55:16] (03PS1) 10KartikMistry: Enable SectionTranslation in Igbo, Hausa, Yoruba Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724458 (https://phabricator.wikimedia.org/T290175) [15:56:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T291948 (10Ottomata) Approved. SCherukuwada will need analytics-privatedata-users group membership and wmf LDAP membership (if they don't already have it), but no ssh key. [15:56:54] ACKNOWLEDGEMENT - SSH on clouddb1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Bstorm Opened T291961 https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:56:55] ACKNOWLEDGEMENT - Host clouddb1020 is DOWN: PING CRITICAL - Packet loss = 100% Bstorm Opened T291961 [15:58:25] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 14 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [15:58:39] (03PS1) 10BryanDavis: toolhub: bump container version to 2021-09-27-221441-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/724459 [15:59:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T291948 (10dr0ptp4kt) Approved [16:00:05] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210928T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) a:05nskaggs→03aborrero We will need 2 NICs connected on these servers: * primary NIC, with a public IPv4 address, `cl... [16:05:09] (03CR) 10BryanDavis: [C: 03+2] toolhub: bump container version to 2021-09-27-221441-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/724459 (owner: 10BryanDavis) [16:05:20] (03PS4) 10Giuseppe Lavagetto: service: Switch toolhub to production [puppet] - 10https://gerrit.wikimedia.org/r/711705 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [16:06:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service: Switch toolhub to production [puppet] - 10https://gerrit.wikimedia.org/r/711705 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [16:07:33] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [16:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:06] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) a:05Mholloway→03Ottomata [16:09:40] (03Merged) 10jenkins-bot: toolhub: bump container version to 2021-09-27-221441-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/724459 (owner: 10BryanDavis) [16:09:52] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [16:09:52] (03CR) 10DCausse: "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/724354 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [16:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:44] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host mw2412.codfw.wmnet [16:10:47] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage executed with errors: - mw2412 (**FAIL**) - Forced PXE for next reboot - Host rebooted v... [16:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:16] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [16:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:25] (03PS12) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [16:13:30] 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Bstorm) [16:13:55] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [16:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:05] 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Bstorm) [16:14:46] Lucas_WMDE: Sorry I missed your message! Did you merge already? [16:15:02] nope, didn’t do anything new with that [16:15:06] (I left a comment on the train blocker task) [16:15:17] I’m currently in a meeting, but afterwards I might be able to merge if it’s a good time [16:16:54] !log bd808@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [16:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:35] 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Bstorm) This does not seem related to T289159 as it is a different rack, but you never know. [16:17:42] okay, if not there's always the next backport window [16:17:51] yup [16:18:08] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [16:18:15] it’s not a catastrophe if wmf.2 rolls out without the backport, especially on group0 [16:18:42] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As this service is active-passive, it uses metafo resources." [dns] - 10https://gerrit.wikimedia.org/r/711727 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [16:19:08] * bd808 does not love the lack of progress feedback from `helmfile apply` [16:19:21] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@f35571e] (eqiad): tegola: mirror kartotherian/eqiad traffic to codfw/tegola [16:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:28] (03PS1) 10Volans: sre.experimental.reimage: better check of OS [cookbooks] - 10https://gerrit.wikimedia.org/r/724461 [16:19:39] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@f35571e] (eqiad): tegola: mirror kartotherian/eqiad traffic to codfw/tegola (duration: 00m 18s) [16:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:20] 10SRE, 10serviceops: restart-php7.2-fpm attempts to run as non-root but can’t actually restart service, leaving instance depooled - https://phabricator.wikimedia.org/T291921 (10Joe) p:05Triage→03Medium Lucas is correct, but I think the best fix is to avoid needing the `-i` in the sudo process there, but gi... [16:21:23] bd808: yes me too... I do: "watch kubectl get pods" in a separate pane/window [16:21:46] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:21:50] dcausse: *nod* I've been doing similar things [16:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:26] so, I've tried to deploy with scap to a specifc environment but it didn't apply the proper targets https://gerrit.wikimedia.org/r/c/maps/kartotherian/deploy/+/724460 [16:23:47] Anyone knows why it failed? The command for deployment was: scap deploy --environment eqiad `git log --pretty=format:'%s' -n 1` [16:23:53] (03PS1) 10BryanDavis: toolhub: disable crawler cron in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/724462 (https://phabricator.wikimedia.org/T288685) [16:24:13] (03PS4) 10Giuseppe Lavagetto: service::catalog: remove ProxyFetch checks from services on k8s [puppet] - 10https://gerrit.wikimedia.org/r/722278 [16:24:21] (03CR) 10Muehlenhoff: [C: 03+2] webserver-misc-apps.discovery: Add os-reports.w.o [puppet] - 10https://gerrit.wikimedia.org/r/724416 (owner: 10Muehlenhoff) [16:25:20] 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Bstorm) That's a big nope from the server on restarting via console. It has a processor reporting bad voltage a... [16:26:18] 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Bstorm) [16:26:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:45] 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Bstorm) [16:26:49] 10SRE, 10ops-codfw, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability): decommission maps2001.codfw.wmnet, maps2002.codfw.wmnet, maps2003.codfw.wmnet, maps2004.codfw.wmnet - https://phabricator.wikimedia.org/T290588 (10Papaul) [16:27:05] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [16:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:11] 10SRE, 10ops-codfw, 10decommission-hardware, 10Platform Team Workboards (Platform Engineering Reliability): decommission maps2001.codfw.wmnet, maps2002.codfw.wmnet, maps2003.codfw.wmnet, maps2004.codfw.wmnet - https://phabricator.wikimedia.org/T290588 (10Papaul) 05Open→03Resolved complete [16:28:40] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@3e52e0a]: tegola: use global config var for load tests [16:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:54] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@3e52e0a]: tegola: use global config var for load tests (duration: 00m 14s) [16:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:25] (03PS1) 10PipelineBot: rdf-streaming-updater: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/724464 [16:30:53] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Audit usages or the realm variable with a view to drop it - https://phabricator.wikimedia.org/T289661 (10dcaro) [16:33:53] (03CR) 10BryanDavis: [C: 03+2] toolhub: disable crawler cron in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/724462 (https://phabricator.wikimedia.org/T288685) (owner: 10BryanDavis) [16:33:55] (03CR) 10Muehlenhoff: [C: 03+2] Create symlink for latest OS report [puppet] - 10https://gerrit.wikimedia.org/r/724447 (owner: 10Muehlenhoff) [16:35:09] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/724355 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [16:35:54] (03CR) 10Cwhite: [C: 03+1] prometheus: add instance-specific alerts path [puppet] - 10https://gerrit.wikimedia.org/r/724353 (https://phabricator.wikimedia.org/T289662) (owner: 10Filippo Giunchedi) [16:37:59] (03Merged) 10jenkins-bot: toolhub: disable crawler cron in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/724462 (https://phabricator.wikimedia.org/T288685) (owner: 10BryanDavis) [16:38:34] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. - elukey@cumin1001 [16:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:07] !log bd808@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [16:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:06] (03PS1) 10Jgiannelos: tegola-vector-tiles: Fix race condition for DB connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/724467 [16:41:45] (03PS1) 10Volans: sre.experimental.reimage: increase PuppetDB polls [cookbooks] - 10https://gerrit.wikimedia.org/r/724468 [16:42:05] (03PS2) 10Jgiannelos: tegola-vector-tiles: Fix race condition for DB connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/724467 [16:45:43] (03PS2) 10Volans: sre.experimental.reimage: increase PuppetDB polls [cookbooks] - 10https://gerrit.wikimedia.org/r/724468 [16:46:02] (03CR) 10Volans: [C: 03+2] "trivial increase of retries, self-merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/724468 (owner: 10Volans) [16:46:10] !log pt1979@cumin2002 START - Cookbook sre.experimental.reimage for host mw2412.codfw.wmnet [16:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2412.codfw.wmnet [16:49:36] (03Merged) 10jenkins-bot: sre.experimental.reimage: increase PuppetDB polls [cookbooks] - 10https://gerrit.wikimedia.org/r/724468 (owner: 10Volans) [16:49:56] (03PS2) 10Volans: sre.experimental.reimage: better check of OS [cookbooks] - 10https://gerrit.wikimedia.org/r/724461 [16:50:52] (03CR) 10Volans: sre.experimental.reimage: better check of OS (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/724461 (owner: 10Volans) [16:54:19] (03PS3) 10Jgiannelos: tegola-vector-tiles: Fix race condition for DB connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/724467 [16:59:02] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 2 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10Aklapper) a:05Gilles→03None Resetting inactive assignee account [16:59:06] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Fix race condition for DB connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/724467 (owner: 10Jgiannelos) [16:59:18] (03PS1) 10Muehlenhoff: Create landing page for invidual OS overviews [puppet] - 10https://gerrit.wikimedia.org/r/724470 [17:00:05] chrisalbon and accraze: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210928T1700). [17:00:29] !log pt1979@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host mw2412.codfw.wmnet [17:00:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - mw2412 (**WARN**) - Downtimed on Icinga - //Unable to disable Puppet, the h... [17:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:11] (03Merged) 10jenkins-bot: tegola-vector-tiles: Fix race condition for DB connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/724467 (owner: 10Jgiannelos) [17:04:26] !log pt1979@cumin2002 START - Cookbook sre.experimental.reimage for host mw2413.codfw.wmnet [17:04:26] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [17:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2413.codfw.wmnet [17:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:07] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [17:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [17:09:21] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@3e52e0a]: tegola: use global config var for load tests [17:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:32] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@3e52e0a]: tegola: use global config var for load tests (duration: 00m 11s) [17:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:50] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@1f90e6f]: tegola: hard code threshold because deployment fails [17:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:07] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@1f90e6f]: tegola: hard code threshold because deployment fails (duration: 00m 18s) [17:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:41] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:58] (03PS1) 10Brennen Bearnes: gitlab-runner: restrict allowed images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [17:20:26] (03PS1) 10Papaul: Add thumbor200[56] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724473 (https://phabricator.wikimedia.org/T290190) [17:20:58] (03CR) 10jerkins-bot: [V: 04-1] Add thumbor200[56] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724473 (https://phabricator.wikimedia.org/T290190) (owner: 10Papaul) [17:21:27] (03CR) 10jerkins-bot: [V: 04-1] gitlab-runner: restrict allowed images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [17:22:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, the presence of /target should be totally reliable" [cookbooks] - 10https://gerrit.wikimedia.org/r/724461 (owner: 10Volans) [17:23:05] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host mw2413.codfw.wmnet [17:24:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage executed with errors: - mw2413 (**FAIL**) - Forced PXE for next reboot - Host rebooted v... [17:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:29] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors [17:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:53] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@35b9174]: tegola: remove mirror_threshold variable because of parsing errors (duration: 00m 24s) [17:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:10] (03CR) 10Ahmon Dancy: gitlab-runner: restrict allowed images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [17:26:25] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [17:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:44] (03PS2) 10Papaul: Add thumbor200[56] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724473 (https://phabricator.wikimedia.org/T290190) [17:29:08] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 02m 43s) [17:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:24] (03CR) 10jerkins-bot: [V: 04-1] Add thumbor200[56] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724473 (https://phabricator.wikimedia.org/T290190) (owner: 10Papaul) [17:31:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) @Volans mw2413 failed with the same error [17:32:29] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [17:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:35] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 06s) [17:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:39] (03PS3) 10Papaul: Add thumbor200[56] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724473 (https://phabricator.wikimedia.org/T290190) [17:35:39] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [17:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:45] !log pt1979@cumin2002 START - Cookbook sre.experimental.reimage for host mw2413.codfw.wmnet [17:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:50] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2413.codfw.wmnet [17:35:56] (03CR) 10Papaul: [C: 03+2] Add thumbor200[56] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/724473 (https://phabricator.wikimedia.org/T290190) (owner: 10Papaul) [17:35:57] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 17s) [17:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:47] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [17:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:58] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 11s) [17:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:55] (03PS1) 10Cmjohnson: Adding dhcpd file and site.pp for new puppetmaster servers [puppet] - 10https://gerrit.wikimedia.org/r/724478 (https://phabricator.wikimedia.org/T291963) [17:44:15] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [17:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:34] (03PS2) 10Cmjohnson: Adding dhcpd file and site.pp for new puppetmaster servers [puppet] - 10https://gerrit.wikimedia.org/r/724478 (https://phabricator.wikimedia.org/T291963) [17:44:57] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:14] (03CR) 10Cmjohnson: [C: 03+2] Adding dhcpd file and site.pp for new puppetmaster servers [puppet] - 10https://gerrit.wikimedia.org/r/724478 (https://phabricator.wikimedia.org/T291963) (owner: 10Cmjohnson) [17:46:24] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [17:46:24] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [17:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:36] (03PS1) 10Bartosz Dziewoński: Fix almost all errors codes being logged as `http-0` [extensions/DiscussionTools] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724378 (https://phabricator.wikimedia.org/T290514) [17:47:05] (03PS1) 10Bartosz Dziewoński: Fix almost all errors codes being logged as `http-0` [extensions/DiscussionTools] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724379 (https://phabricator.wikimedia.org/T290514) [17:48:35] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:42] (03Abandoned) 10Zabe: Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 (https://phabricator.wikimedia.org/T291717) (owner: 10Zabe) [17:50:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host mw2413.codfw.wmnet [17:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:47] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - mw2413 (**WARN**) - Downtimed on Icinga - //Unable to disable Puppet, the h... [17:52:22] (03PS1) 10Ottomata: Guard against undefined index notice when setting x-client-ip [extensions/EventBus] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724480 (https://phabricator.wikimedia.org/T288853) [17:52:36] (03PS1) 10Ottomata: Guard against undefined index notice when setting x-client-ip [extensions/EventBus] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724481 (https://phabricator.wikimedia.org/T288853) [17:54:26] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724482 [17:54:28] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724482 (owner: 10Jeena Huneidi) [17:54:37] (03CR) 10Joal: Add analytics purge for Gobblin old files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724413 (https://phabricator.wikimedia.org/T287084) (owner: 10Joal) [17:55:35] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724482 (owner: 10Jeena Huneidi) [17:55:41] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.2 refs T281166 [17:55:41] (03PS1) 10Cmjohnson: Adding new servers an-db1001-2 to site.pp, dhcpd and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/724483 (https://phabricator.wikimedia.org/T289632) [17:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:49] T281166: 1.38.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T281166 [17:56:06] (03PS2) 10Cmjohnson: Adding new servers an-db1001-2 to site.pp, dhcpd and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/724483 (https://phabricator.wikimedia.org/T289632) [17:56:40] (03PS1) 10Ottomata: EventBus - Enable x_client_ip_forwarding_enabled for analytics purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724380 (https://phabricator.wikimedia.org/T288853) [17:57:17] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [17:57:17] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [17:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:28] (03CR) 10Cmjohnson: [C: 03+2] Adding new servers an-db1001-2 to site.pp, dhcpd and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/724483 (https://phabricator.wikimedia.org/T289632) (owner: 10Cmjohnson) [17:58:40] (03PS2) 10Ottomata: EventBus - Enable x_client_ip_forwarding_enabled for analytics purposes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724380 (https://phabricator.wikimedia.org/T288853) [17:59:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` puppetmaster1004.eqiad.... [18:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210928T1800) [18:00:15] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) Scheduled for a backport window tomrrow. [18:00:45] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [18:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:39] (03CR) 10Ottomata: Add analytics purge for Gobblin old files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724413 (https://phabricator.wikimedia.org/T287084) (owner: 10Joal) [18:01:50] !log pt1979@cumin1001 START - Cookbook sre.experimental.reimage for host thumbor2005.codfw.wmnet [18:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin1001 for host thumbor2005.codfw.wmnet [18:02:10] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [18:02:10] !log otto@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [18:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` puppetmaster1005.eqiad.... [18:04:17] PROBLEM - Check systemd state on ms-be2036 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [18:05:38] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-db1001.eqiad.wmnet `... [18:06:57] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:08:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:10:33] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, and 2 others: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-db1001.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-db1001.eqiad.... [18:12:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: REIMAGE [18:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:27] 10SRE-swift-storage, 10ops-codfw: swift - ms-be2036 - device sdi:6 unavailable - https://phabricator.wikimedia.org/T291988 (10Dzahn) [18:12:42] 10SRE-swift-storage, 10ops-codfw: swift - ms-be2036 - device sdg:4 unavailable - https://phabricator.wikimedia.org/T291988 (10Dzahn) [18:13:51] ACKNOWLEDGEMENT - Device not healthy -SMART- on ms-be2035 is CRITICAL: cluster=swift device=None instance=ms-be2035 job=node site=codfw daniel_zahn https://phabricator.wikimedia.org/T291988 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2035&var-datasource=codfw+prometheus/ops [18:13:51] ACKNOWLEDGEMENT - Check systemd state on ms-be2036 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service daniel_zahn https://phabricator.wikimedia.org/T291988 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:08] 10SRE, 10SRE-swift-storage, 10ops-codfw: swift - ms-be2035 - device sdi:6 unavailable - https://phabricator.wikimedia.org/T291896 (10Dzahn) No problem, here is another one on ms-be2036: T291988 [18:14:15] 10SRE-swift-storage, 10ops-codfw: swift - ms-be2036 - device sdg:4 unavailable - https://phabricator.wikimedia.org/T291988 (10Dzahn) also SMART alert on that one: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ms-be2035&service=Device+not+healthy+-SMART- [18:14:32] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on puppetmaster1004.eqiad.wmnet with reason: REIMAGE [18:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:14] ACKNOWLEDGEMENT - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:16:14] ACKNOWLEDGEMENT - SSH on db2078.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:16:14] ACKNOWLEDGEMENT - SSH on gerrit2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:16:14] ACKNOWLEDGEMENT - SSH on ms-fe2006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:16:15] ACKNOWLEDGEMENT - SSH on mw2253.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:16:48] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetmaster1005.eqiad.wmnet with reason: REIMAGE [18:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:03] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [18:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:11] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 08s) [18:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:59] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on puppetmaster1005.eqiad.wmnet with reason: REIMAGE [18:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:54] (03PS2) 10Brennen Bearnes: gitlab-runner: restrict allowed images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [18:21:53] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.experimental.reimage (exit_code=99) for host thumbor2005.codfw.wmnet [18:21:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage executed with errors: - thumbor2005 (**FAIL**) - Forced PXE for next reboot - Host r... [18:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster1004.eqiad.wmnet'] ` and were **ALL** successful. [18:24:27] (03CR) 10Ssingh: [C: 03+1] haproxy: Allow configuring timeouts [puppet] - 10https://gerrit.wikimedia.org/r/719479 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [18:25:17] (03PS3) 10Ryan Kemper: Add dsh targets for the new wcqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/721600 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:25:27] (03CR) 10Ppchelko: [C: 03+1] Guard against undefined index notice when setting x-client-ip [extensions/EventBus] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/724480 (https://phabricator.wikimedia.org/T288853) (owner: 10Ottomata) [18:26:01] (03CR) 10jerkins-bot: [V: 04-1] Add dsh targets for the new wcqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/721600 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:27:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['puppetmaster1005.eqiad.wmnet'] ` and were **ALL** successful. [18:29:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10Cmjohnson) [18:29:32] (03PS1) 10Ottomata: Add docs about template, label, and conary conventions [deployment-charts] - 10https://gerrit.wikimedia.org/r/724489 (https://phabricator.wikimedia.org/T291848) [18:29:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10Cmjohnson) 05Open→03Resolved [18:29:58] (03CR) 10jerkins-bot: [V: 04-1] Add docs about template, label, and conary conventions [deployment-charts] - 10https://gerrit.wikimedia.org/r/724489 (https://phabricator.wikimedia.org/T291848) (owner: 10Ottomata) [18:30:01] (03PS4) 10Ryan Kemper: Add dsh targets for the new wcqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/721600 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:31:36] (03PS2) 10Ottomata: Add docs about template, label, and canary conventions [deployment-charts] - 10https://gerrit.wikimedia.org/r/724489 (https://phabricator.wikimedia.org/T291848) [18:34:29] (03CR) 10Ssingh: [C: 03+1] haproxy: Add H2 performance tuning settings [puppet] - 10https://gerrit.wikimedia.org/r/719974 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [18:35:09] (03Restored) 10Legoktm: Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 (https://phabricator.wikimedia.org/T291717) (owner: 10Zabe) [18:37:03] 10SRE, 10PoolCounter, 10observability, 10Sustainability (Incident Followup): Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Krinkle) [18:41:21] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/721600 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:42:18] (03CR) 10Ppchelko: [C: 04-1] Add docs about template, label, and canary conventions (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/724489 (https://phabricator.wikimedia.org/T291848) (owner: 10Ottomata) [18:42:56] (03CR) 10Ppchelko: [C: 03+1] Guard against undefined index notice when setting x-client-ip [extensions/EventBus] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/724481 (https://phabricator.wikimedia.org/T288853) (owner: 10Ottomata) [18:45:07] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.2 refs T281166 (duration: 49m 27s) [18:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:14] T281166: 1.38.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T281166 [18:48:23] (03PS3) 10Ottomata: Add docs about template, label, and canary conventions [deployment-charts] - 10https://gerrit.wikimedia.org/r/724489 (https://phabricator.wikimedia.org/T291848) [18:48:39] (03CR) 10Brennen Bearnes: gitlab-runner: restrict allowed images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [18:50:01] (03PS3) 10Brennen Bearnes: gitlab-runner: restrict allowed images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [18:52:30] (03CR) 10Ryan Kemper: [C: 03+2] Add dsh targets for the new wcqs cluster [puppet] - 10https://gerrit.wikimedia.org/r/721600 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:54:19] !log T280001 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/721600 (add wcqs scap dsh groups), running puppet on scap::dsh hosts: `ryankemper@cumin1001:~$ sudo cumin 'P:scap::dsh' 'sudo run-puppet-agent'` [18:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:25] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [18:57:29] (03CR) 10Legoktm: [C: 03+2] P::toolforge: Use composer package on buster [puppet] - 10https://gerrit.wikimedia.org/r/723760 (https://phabricator.wikimedia.org/T287900) (owner: 10Majavah) [19:00:05] jeena and dduvall: Your horoscope predicts another unfortunate MediaWiki train - American Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210928T1900). [19:00:12] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin [19:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:33] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@0a38bc5]: tegola: use eqiad discovery endpoin (duration: 00m 20s) [19:00:35] (03PS2) 10Legoktm: Add toolhub to discovery [dns] - 10https://gerrit.wikimedia.org/r/711727 (https://phabricator.wikimedia.org/T280881) [19:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:06] (03PS4) 10Ryan Kemper: trafficserver: Create routing for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [19:01:38] (03PS5) 10Ryan Kemper: trafficserver: Create routing for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [19:01:56] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [19:03:20] (03CR) 10Legoktm: Add toolhub to discovery (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/711727 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [19:03:24] (03CR) 10Legoktm: [C: 03+2] Add toolhub to discovery [dns] - 10https://gerrit.wikimedia.org/r/711727 (https://phabricator.wikimedia.org/T280881) (owner: 10Legoktm) [19:04:54] !log adding toolhub to discovery DNS (T280881) [19:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:01] T280881: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 [19:05:15] (03PS4) 10Legoktm: Add Toolhub public DNS name [dns] - 10https://gerrit.wikimedia.org/r/711637 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [19:05:24] (03PS5) 10Legoktm: Add toolhub to cache backends [puppet] - 10https://gerrit.wikimedia.org/r/711648 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [19:05:27] (03PS1) 10Jeena Huneidi: group0 wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724492 [19:05:29] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724492 (owner: 10Jeena Huneidi) [19:05:55] !log legoktm@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=toolhub [19:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:15] {"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=toolhub"} [19:06:16] {"eqiad": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=toolhub"} [19:06:42] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724492 (owner: 10Jeena Huneidi) [19:07:10] $ curl https://toolhub.discovery.wmnet:4011/healthz [19:07:10] {"status": "OK"} [19:07:14] bd808: ^^ [19:07:37] such progress! much wow! :) [19:08:21] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.2 refs T281166 [19:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:27] T281166: 1.38.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T281166 [19:14:24] 10SRE, 10Wikimedia-General-or-Unknown, 10Sustainability (Incident Followup): Better monitoring and error reporting of Errors and Exceptions - https://phabricator.wikimedia.org/T51757 (10Krinkle) [19:18:21] (03CR) 10Krinkle: [C: 03+1] gitlab-runner: restrict allowed images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [19:20:54] (03CR) 10Joal: Add analytics purge for Gobblin old files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724413 (https://phabricator.wikimedia.org/T287084) (owner: 10Joal) [19:21:18] (03PS2) 10Joal: Add analytics purge for Gobblin old files [puppet] - 10https://gerrit.wikimedia.org/r/724413 (https://phabricator.wikimedia.org/T287084) [19:23:58] Rolling back the train due to a spike in errors [19:26:54] fwiw, the risky Timeline change I mentioned did break some font stuff, I'll get that fixed (but it shouldn't hold the train on its own) [19:27:19] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.38.0-wmf.1" [19:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:41] thanks legoktm! I don't think it's related [19:27:44] filed https://phabricator.wikimedia.org/T292010 [19:28:03] oh, privatesettings is still using the IP class [19:28:06] not sure who to tag yet [19:28:46] it should just need to use Wikimedia\IPUtils instead [19:28:48] (03PS1) 10Jeena Huneidi: Revert "group0 wikis to 1.38.0-wmf.2 refs T281166" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724496 [19:28:49] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "group0 wikis to 1.38.0-wmf.2 refs T281166" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724496 (owner: 10Jeena Huneidi) [19:28:50] ah, so related to T291008 it seems [19:28:51] T291008: Remove deprecated IP class - https://phabricator.wikimedia.org/T291008 [19:29:46] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.38.0-wmf.2 refs T281166" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724496 (owner: 10Jeena Huneidi) [19:30:31] I committed d1df4753484bc334253a383855a2da1eb26a665f to /srv/mw-staging/private if someone wants to review that [19:30:46] Probably known, but datapoints in case useful: Trying to save edits at officewiki was giving me a "Fatal exception of type "Error"" (let me know if the long code is needed). And the list of images uploaded there were all displaying blank. (i.e. Special:ListFiles had only empty white rectangles) [19:31:31] quiddity: yeah, same issue [19:31:53] (03PS3) 10Ryan Kemper: Deploy query_service microsite for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/717630 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [19:33:19] jeena: ok if I (or you can as well) sync PrivateSettings.php? [19:33:27] legoktm: looks sane to me [19:34:00] yeah [19:35:43] !log legoktm@deploy1002 Synchronized private/PrivateSettings.php: Use IPUtils instead of removed IP class (T292010) (duration: 01m 09s) [19:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:49] T292010: Error: Class 'IP' not found - https://phabricator.wikimedia.org/T292010 [19:36:17] * bd808 came here to look for info on IP::isInRange and sees that legoktm is on it [19:36:29] I was trying to find the change...so you have to change it directly on the server? [19:36:58] /srv/mediawiki-staging/private is its own Git repo that just lives on deploy servers [19:37:06] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/717630 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [19:37:06] gotcha [19:39:09] (03PS1) 10Bearloga: statistics::product_analytics: create and prepare [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) [19:39:38] (03CR) 10jerkins-bot: [V: 04-1] statistics::product_analytics: create and prepare [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [19:40:16] I'll wait a few minutes before rolling to group0 again [19:41:06] (03PS2) 10Bearloga: statistics::product_analytics: create and prepare [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) [19:42:40] (03CR) 10Ryan Kemper: [C: 03+2] Deploy query_service microsite for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/717630 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [19:45:25] Deploying to group0 [19:45:44] (03PS1) 10Jeena Huneidi: group0 wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724499 [19:45:46] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724499 (owner: 10Jeena Huneidi) [19:46:40] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.2 refs T281166 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724499 (owner: 10Jeena Huneidi) [19:48:12] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.2 refs T281166 [19:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:19] T281166: 1.38.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T281166 [19:55:54] (03PS1) 10Andrew Bogott: manila: use manila-srv service user rather than novaadmin for auth [puppet] - 10https://gerrit.wikimedia.org/r/724500 (https://phabricator.wikimedia.org/T291257) [19:59:13] zabe: may I pm? [19:59:29] sure [20:00:39] (03CR) 10Andrew Bogott: "I think this is all we need to avoid novaadmin usage here; we may need to add roles for manila-srv in other projects but I suspect that ju" [puppet] - 10https://gerrit.wikimedia.org/r/724500 (https://phabricator.wikimedia.org/T291257) (owner: 10Andrew Bogott) [20:02:48] RECOVERY - Check systemd state on ms-be2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:18] (03CR) 10Hashar: "Thanks! CI images will be rebuild via https://gerrit.wikimedia.org/r/c/integration/config/+/724501" [puppet] - 10https://gerrit.wikimedia.org/r/720241 (owner: 10Hashar) [20:08:41] (03CR) 10Dzahn: [C: 03+1] "lgtm, just needs rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/724053 (owner: 10Muehlenhoff) [20:13:22] (03CR) 10Bstorm: "Adding mdipietro since we are working together on this." [puppet] - 10https://gerrit.wikimedia.org/r/723808 (https://phabricator.wikimedia.org/T291806) (owner: 10Marostegui) [20:24:38] (03PS6) 10Ryan Kemper: query_service: Support proxying to microsite from backend [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [20:30:10] (03CR) 10Ryan Kemper: [C: 03+2] query_service: Support proxying to microsite from backend [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [20:31:04] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:31:32] ryankemper: ready? [20:32:16] legoktm: ~10 mins, deploying an nginx patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/720801 [20:32:23] ok :) [20:33:33] !log Adding IPv6 address to NaWas sub-interfaceon cr2-esams (AMS-IX) - T288505 [20:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:44] !log T280247 `ryankemper@cumin1001` -> `sudo cumin 'P{w*qs*}' 'sudo disable-puppet "Make query_service nginx proxy to GUI microsite - T280247"'` [20:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:52] T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247 [20:33:53] !log T280247 Running on single wcqs hosts: `ryankemper@wcqs1001:~$ sudo run-puppet-agent --force` [20:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:13] (03CR) 10Legoktm: [C: 03+2] Add Toolhub public DNS name [dns] - 10https://gerrit.wikimedia.org/r/711637 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [20:37:17] !log T280247 Ran on wdqs canary `wdqs1003`: `ryankemper@wdqs1003:~$ sudo run-puppet-agent --force` [20:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:24] !log T280247 Test queries on `wdqs1003` passed (tunneled into `wdqs1003`), proceeding to rest of fleet [20:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:57] !log T280247 `ryankemper@cumin1001:~$ sudo cumin -b 5 'P{w*qs*}' 'sudo run-puppet-agent --force'`; 25 hosts total so will take 5 iterations [20:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:04] T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247 [20:40:14] legoktm: okay so let's roll the toolhub change first. I assume, like my change, that yours is relying on the text cache as opposed to the upload cache? [20:40:18] yep [20:40:54] so 1) disable puppet on all A:cp-text, 2) merge puppet change, 3) enable puppet on 1 cp-text node, run it and verify, 4) re-enable puppet everywhere and batch puppet runs [20:41:16] legoktm: LGTM. I'll let you drive for this toolhub change ofc [20:41:41] !log disabling puppet on A:cp-text in preparation for adding toolhub [20:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:56] (03CR) 10Ottomata: [C: 03+1] Add analytics purge for Gobblin old files [puppet] - 10https://gerrit.wikimedia.org/r/724413 (https://phabricator.wikimedia.org/T287084) (owner: 10Joal) [20:44:25] (03PS6) 10Legoktm: Add toolhub to cache backends [puppet] - 10https://gerrit.wikimedia.org/r/711648 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [20:45:32] (03CR) 10Legoktm: [C: 03+2] Add toolhub to cache backends [puppet] - 10https://gerrit.wikimedia.org/r/711648 (https://phabricator.wikimedia.org/T280881) (owner: 10Majavah) [20:45:46] (03CR) 10Ryan Kemper: [C: 03+1] trafficserver: Create routing for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [20:46:20] (03CR) 10Ottomata: "Commented on task" [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [20:46:34] running puppet on cp1075 [20:47:50] legoktm@cp1075:~$ sudo grep toolhub /etc/trafficserver/remap.config [20:47:50] map http://toolhub.wikimedia.org https://toolhub.discovery.wmnet:4011 [20:48:22] yup, LGTM https://www.irccloud.com/pastebin/HdwO2Lcq/ [20:48:35] * ryankemper piped cat to grep, oops [20:48:42] :P [20:48:48] you win this round :P [20:48:50] ok, going to re-enable puppet everywhere now [20:49:11] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [20:49:11] !log otto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [20:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:39] !log re-enabling and running puppet on A:cp-text: sudo cumin -b 5 A:cp-text 'enable-puppet --force && run-puppet-agent' [20:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:47] !log T280247 Puppet successfully ran on all `w*qs*` hosts; GUI working as before for WDQS, and WCQS seems fine as well. Deploy succeeded without any hitches [20:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:52] T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247 [20:52:06] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [20:52:06] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [20:52:09] (03CR) 10Ottomata: Add docs about template, label, and canary conventions (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/724489 (https://phabricator.wikimedia.org/T291848) (owner: 10Ottomata) [20:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:11] (03PS4) 10Ottomata: Add docs about template, label, and canary conventions [deployment-charts] - 10https://gerrit.wikimedia.org/r/724489 (https://phabricator.wikimedia.org/T291848) [20:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:21] (03CR) 10Juan90264: [C: 03+1] "Hello folks, I would appreciate it if you could review this change. And in the other linked to this one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [20:54:23] (03CR) 10Juan90264: [C: 03+1] "Hello folks, I would appreciate it if you could review this change. And in the other linked to this one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [20:57:20] (03PS6) 10Ryan Kemper: trafficserver: Create routing for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [20:59:29] bd808: it works! https://toolhub.wikimedia.org/audit-logs [20:59:48] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Jclark-ctr) [21:00:04] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Jclark-ctr) Rack c6 was recently replaced with new switch T251616 [21:00:13] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Jclark-ctr) racked all switches in row C not cabled or powered yet [21:00:25] majavah: sweet! All praise to legoktm and _joe_ for the bits they enabled there today! [21:00:32] puppet is still running :p [21:01:28] search seems to not work though, but that's a minor thing compared to everything else [21:01:46] (re puppet still running) That explains why I got a not found initially, yet it now succeeded :P traffic roulette [21:01:47] yeah, there is no data in the database and no search index yet. [21:02:00] !log legoktm@deploy1002:~$ echo "https://toolhub.wikimedia.org/" | mwscript purgeList.php [21:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:08] ryankemper: I think I'm all done now [21:02:39] legoktm: great, proceeding to my change [21:03:16] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Regression, 10Sustainability (Incident Followup): operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801 (10Krinkle) [21:03:43] !log T280247 `ryankemper@cumin1001:~$ sudo cumin 'A:cp-text' 'sudo disable-puppet "Add trafficserver backend mapping for commons-query.wikimedia.org - T280247"'` [21:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:49] T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247 [21:03:56] (03CR) 10Ryan Kemper: [C: 03+2] trafficserver: Create routing for commons-query.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/720078 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [21:04:48] 10SRE, 10Mobile-Content-Service, 10Product-Infrastructure-Team-Backlog, 10Sustainability (Incident Followup), 10User-Joe: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551 (10Krinkle) [21:05:28] !log T280247 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/720078 [21:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:01] !log T280247 Running on single cp-text host: `ryankemper@cp1075:~$ sudo run-puppet-agent --force` [21:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:29] !log pt1979@cumin1001 START - Cookbook sre.experimental.reimage for host thumbor2005.codfw.wmnet [21:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:34] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin1001 for host thumbor2005.codfw.wmnet [21:09:24] !log T280247 `ryankemper@cp1075:~$ sudo grep commons-query /etc/trafficserver/remap.config` shows `map http://commons-query.wikimedia.org https://wcqs.discovery.wmnet`; proceeding to rest of fleet in batches of 5 [21:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:29] T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247 [21:09:33] legoktm: Do 404 responses get cached at our edge? I'm getting consistent 404 responses for https://toolhub.wikimedia.org/static/js/chunk-vendors.js from cp4031. Maybe from m.ajavah hitting the new site url a bit early? [21:09:57] I can load the file when I route through another edge pop [21:10:16] !log T280247 `ryankemper@cumin1001:~$ sudo cumin -b 5 'A:cp-text' 'sudo run-puppet-agent --force'` [21:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:40] !log Configure cr2-esams for NaWas BGP peering to gateway-1 IPv4 (T288505) [21:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:29] bd808: in a meeting, I can look after [21:15:05] bd808: I wonder if the fix would be a one-off purge per https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges_(bans)? Only thing is there's a lot of cautionary tape for the obvious performance reasons, but I'd think that a ban with `req.http.host == "toolhub.wikimedia.org"` would do the trick [21:17:24] !log Configure cr2-esams for NaWas BGP peering to gateway-1 IPv6 and gateway-2 (T288505) [21:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:40] ryankemper: seems right. I think I can actually do this via purgeList.php [21:18:37] 10SRE, 10Performance-Team, 10Epic, 10Sustainability (Incident Followup): During deployment old servers may populate new cache URIs - https://phabricator.wikimedia.org/T47877 (10Krinkle) [21:18:38] purgelist is less-impactful than varnishadm "ban", in cases where it can work [21:18:53] ack, I was thinking purgelist would only work on the exact url [21:18:56] * bd808 tries [21:18:58] does it actually purge anything that matches the substring? [21:19:02] it does only work on a single URL [21:19:10] i.e. don't we want to purge *everything* under toolhub.wikimedia.org, not just that one chunk-vendors.js? [21:19:22] (ack re single url) [21:19:23] or just let them expire [21:19:42] !log bd808@mwmaint1002 echo "https://toolhub.wikimedia.org/static/js/chunk-vendors.js" | mwscript purgeList.php [21:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:50] (03CR) 10Jdlrobson: "No need to have a separate patch for this. Please squash this patch into https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/70" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704167 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [21:19:51] I believe the default expirty for 40x is 5min [21:20:02] 10 minutes, yeah [21:20:14] it's not even a default: even if the applayer tries to go higher, it gets capped at 10 minutes [21:20:21] nice, I would have guessed it'd be like 30 minutes or something [21:20:31] yeah in that case figuring out how to do the ban would take longer than the expiration window :P [21:20:32] (at least in varnish, I'd have to double-check ATS) [21:20:34] it either fell out of cache of the purgeList.php call knocked it out. Mischief managed legoktm [21:20:37] (03CR) 10Jdlrobson: [C: 03+1] "I've left a note in the follow up about doing this in a single patch but otherwise this looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [21:20:39] for other static responses like HTML and CSS/JS, I think the default is something like 24 hours, based on the bugs we keep having with doc.wikimedia.org where many doc pages are corrupt because one of the assets is stuck in the cache. [21:20:57] (confirmed, ATS also caps at 10m) [21:22:16] !log pt1979@cumin1001 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host thumbor2005.codfw.wmnet [21:22:18] another way to avoid this issue (not that it's usually a big deal) is to wait until after the cache/routing layers are configured before pushing the DNS-side patch for the new public name (so that no public requests can come in for that hostname and cache a 404 before it's ready) [21:22:20] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - thumbor2005 (**WARN**) - Downtimed on Icinga - //Unable to disable Pupp... [21:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:29] !log T280247 Puppet run complete on all of `cp-text`, trafficserver backend work is done [21:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:36] T280247: Create WCQS UI microsite deployment - https://phabricator.wikimedia.org/T280247 [21:25:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) @Volans I was able to get thumbor2005 installed without adding the MAC address but the install failed also like mw2413 ` Run Puppet in NOOP mode... [21:25:36] 10SRE, 10observability, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup), 10Tracking-Neverending: Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942 (10Krinkle) [21:26:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) [21:26:50] 10SRE, 10Contributors-Team, 10observability, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090 (10Krinkle) [21:28:28] 10SRE, 10Security-Team, 10observability, 10Sustainability (Incident Followup): icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300 (10Krinkle) [21:31:09] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:33:25] (03CR) 10Juan90264: [C: 03+1] Add optimised square logo and wordmark for Wikimania on mobile (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [21:35:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2414.codfw.wmnet ` The log can be found in `/var/... [21:40:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2415.codfw.wmnet ` The log can be found in `/var/... [21:42:27] bd808: great, I guess that's something to keep in mind for future deploys, that it needs to be purged or use unique URLs [21:43:11] (we could also set varnish to not cache anything and pass all requests, I don't think that's the right long-term solution though) [21:45:33] legoktm: yeah... unique urls would be ideal. I'll open a ticket to remind myself to figure that out. [21:46:11] 10SRE, 10Icinga, 10SRE Observability, 10observability, and 2 others: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10Krinkle) [21:46:35] 10SRE, 10Sustainability (Incident Followup): Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters - https://phabricator.wikimedia.org/T161145 (10Krinkle) [21:48:32] 10SRE, 10Toolhub, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10Legoktm) [21:49:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2414.codfw.wmnet with reason: REIMAGE [21:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:25] (03CR) 10Jdlrobson: [C: 03+1] Add optimised square logo and wordmark for Wikimania on mobile (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704166 (https://phabricator.wikimedia.org/T286405) (owner: 10Juan90264) [21:51:23] bd808: is there anything left on the SRE side of things? Or are you still blocked on needing exec access? [21:52:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2414.codfw.wmnet with reason: REIMAGE [21:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:18] (03PS5) 10Legoktm: Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 (https://phabricator.wikimedia.org/T291717) (owner: 10Zabe) [21:53:31] (03CR) 10Legoktm: [C: 03+2] Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 (https://phabricator.wikimedia.org/T291717) (owner: 10Zabe) [21:55:43] (03PS1) 10Jdlrobson: Enable sticky header on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/724514 (https://phabricator.wikimedia.org/T289721) [21:56:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2415.codfw.wmnet with reason: REIMAGE [21:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:35] (03PS12) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [21:58:18] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2415.codfw.wmnet with reason: REIMAGE [21:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:01] (03PS1) 10Thcipriani: fix: scap: remove confusing logstash dashboard link [puppet] - 10https://gerrit.wikimedia.org/r/724515 (https://phabricator.wikimedia.org/T291870) [21:59:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2414.codfw.wmnet'] ` and were **ALL** successful. [21:59:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2416.codfw.wmnet ` The log can be found in `/var/... [22:02:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [22:05:36] legoktm: I have gotten far enough now to see errors from the crawler! This may sound like a bad thing, but it helps. Having exec access would help more, but I'll deal with what I have until that is possible. [22:06:46] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2415.codfw.wmnet'] ` and were **ALL** successful. [22:08:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2417.codfw.wmnet ` The log can be found in `/var/... [22:15:21] !log legoktm@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wcqs [22:15:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2416.codfw.wmnet with reason: REIMAGE [22:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:32] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: Q2) eqiad: Upgrades of Management Switches - https://phabricator.wikimedia.org/T259758 (10Jclark-ctr) racked all switches in row D not cabled or powered yet [22:17:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2416.codfw.wmnet with reason: REIMAGE [22:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:57] (03Merged) 10jenkins-bot: Fix erroneous en-gb translations [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723729 (https://phabricator.wikimedia.org/T291717) (owner: 10Zabe) [22:19:01] bearloga [22:20:08] …sigh, I forgot IRCCloud client doesn't have a find text feature *eye roll* [22:23:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2417.codfw.wmnet with reason: REIMAGE [22:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:48] !log legoktm@deploy1002 Started scap: Fix erroneous en-gb translations in 1.38.0-wmf.1 (T291717) [22:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:54] T291717: Erroneous en-gb translations landed in 1.38.0-wmf.1 - https://phabricator.wikimedia.org/T291717 [22:25:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2417.codfw.wmnet with reason: REIMAGE [22:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2416.codfw.wmnet'] ` and were **ALL** successful. [22:29:12] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2418.codfw.wmnet ` The log can be found in `/var/... [22:32:17] (03PS1) 10Ebernhardson: wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/724520 [22:33:26] (03PS2) 10Ebernhardson: wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/724520 (https://phabricator.wikimedia.org/T224324) [22:34:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2417.codfw.wmnet'] ` and were **ALL** successful. [22:35:49] (03CR) 10Legoktm: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/724520 (https://phabricator.wikimedia.org/T224324) (owner: 10Ebernhardson) [22:37:33] (03PS3) 10Ryan Kemper: wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/724520 (https://phabricator.wikimedia.org/T282117) (owner: 10Ebernhardson) [22:41:32] !log legoktm@deploy1002 Finished scap: Fix erroneous en-gb translations in 1.38.0-wmf.1 (T291717) (duration: 17m 43s) [22:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:38] T291717: Erroneous en-gb translations landed in 1.38.0-wmf.1 - https://phabricator.wikimedia.org/T291717 [22:44:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2418.codfw.wmnet with reason: REIMAGE [22:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2418.codfw.wmnet with reason: REIMAGE [22:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:42] 10SRE, 10MediaWiki-General, 10Sustainability (Incident Followup): Investigate spike in 500s during asw-c2-eqiad replacement - https://phabricator.wikimedia.org/T156475 (10Krinkle) [22:49:34] 10SRE, 10SRE-swift-storage, 10Sustainability (Incident Followup): Increase swift replication factor for accounts - https://phabricator.wikimedia.org/T156136 (10Krinkle) [22:49:37] 10SRE, 10Traffic-Icebox, 10Sustainability (Incident Followup): Investigate varnishd child crashes when multiple nodes get depooled/pooled concurrently - https://phabricator.wikimedia.org/T154801 (10Krinkle) [22:51:49] 10SRE, 10Cloud-Services, 10Sustainability (Incident Followup): Determine appropriate proxy_read_timeout setting for Tools Proxy - https://phabricator.wikimedia.org/T163393 (10Krinkle) [22:55:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2418.codfw.wmnet'] ` and were **ALL** successful. [22:55:44] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2419.codfw.wmnet ` The log can be found in `/var/... [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210928T2300). [23:00:05] No Gerrit patches in the queue for this window AFAICS. [23:09:55] (03PS4) 10Ryan Kemper: wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/724520 (https://phabricator.wikimedia.org/T282117) (owner: 10Ebernhardson) [23:11:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2419.codfw.wmnet with reason: REIMAGE [23:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:38] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/724520 (https://phabricator.wikimedia.org/T282117) (owner: 10Ebernhardson) [23:13:20] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw2419.codfw.wmnet with reason: REIMAGE [23:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:13] !log T282117 `ryankemper@authdns1001:~$ sudo -i authdns-update` following merge of https://gerrit.wikimedia.org/r/724520 [23:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:19] T282117: WCQS needs to be exposed through a wikimedia.org domain - https://phabricator.wikimedia.org/T282117 [23:14:56] !log !log T282117 `error: plugin_geoip: Invalid resource name 'disc-wcqs' detected from zonefile lookup` We must be missing a line, reverting change to fix [23:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:11] PROBLEM - snapshot of s4 in codfw on alert1001 is CRITICAL: Last snapshot for s4 at codfw (db2139.codfw.wmnet:3314) taken on 2021-09-28 21:19:24 is 1531 GB, but previous one was 1803 GB, a change of 15.1% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [23:15:24] (03PS1) 10Ryan Kemper: Revert "wcqs: add discovery record" [dns] - 10https://gerrit.wikimedia.org/r/724546 [23:17:55] (03CR) 10Ryan Kemper: [C: 03+2] Revert "wcqs: add discovery record" [dns] - 10https://gerrit.wikimedia.org/r/724546 (owner: 10Ryan Kemper) [23:21:51] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2419.codfw.wmnet'] ` and were **ALL** successful. [23:24:04] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [23:26:50] (03PS1) 10Ryan Kemper: wcqs: state change: lvs_setup -> monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/724533 (https://phabricator.wikimedia.org/T280001) [23:27:27] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724533 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [23:34:32] (03PS2) 10Ryan Kemper: wcqs: state change: lvs_setup -> monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/724533 (https://phabricator.wikimedia.org/T280001) [23:34:40] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724533 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [23:37:12] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724533 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [23:41:25] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:59] (03PS3) 10Ryan Kemper: wcqs: state change: lvs_setup -> monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/724533 (https://phabricator.wikimedia.org/T280001) [23:43:18] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31355/console" [puppet] - 10https://gerrit.wikimedia.org/r/724533 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [23:43:55] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: state change: lvs_setup -> monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/724533 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [23:45:15] !log T280001 Changing wcqs state from `lvs_setup` to `monitoring_setup`: `ryankemper@cumin1001:~$ sudo cumin 'A:icinga' 'run-puppet-agent'` [23:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:22] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [23:49:39] !log T280001New icinga alerts showing up as expected following wcqs state change to `monitoring_setup`: `LVS wcqs codfw port 443/tcp - Wikimedia Commons Query Service IPv4` and `LVS wcqs eqiad port 443/tcp - Wikimedia Commons Query Service IPv4` [23:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:49] typo, deleting that line from SLA [23:49:50] SAL* [23:49:52] !log T280001 New icinga alerts showing up as expected following wcqs state change to `monitoring_setup`: `LVS wcqs codfw port 443/tcp - Wikimedia Commons Query Service IPv4` and `LVS wcqs eqiad port 443/tcp - Wikimedia Commons Query Service IPv4` [23:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:44] 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560 (10Krinkle) [23:53:05] !log T280001 New icinga checks are green, will proceed to next step of moving wcqs state from `monitoring_setup` -> `production_setup` [23:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:12] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [23:53:42] !log T280001 New icinga checks are green, will proceed to next step of moving wcqs state from `monitoring_setup` -> `production` [23:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:09] 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560 (10Krinkle) a:03Joe Can this be resolved, or is there more to be done? (Haven't looked, triaging a large number of tasks. asking you as the one las... [23:54:19] 10SRE, 10Elasticsearch, 10SRE Observability, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335 (10Krinkle) [23:55:44] (03PS1) 10Ryan Kemper: wcqs: state change: monitoring_setup -> production [puppet] - 10https://gerrit.wikimedia.org/r/724536 (https://phabricator.wikimedia.org/T280001) [23:56:23] (03CR) 10jerkins-bot: [V: 04-1] wcqs: state change: monitoring_setup -> production [puppet] - 10https://gerrit.wikimedia.org/r/724536 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [23:57:03] (03PS2) 10Ryan Kemper: wcqs: state change: monitoring_setup -> production [puppet] - 10https://gerrit.wikimedia.org/r/724536 (https://phabricator.wikimedia.org/T280001) [23:59:23] (03PS1) 10Ryan Kemper: wcqs: add discovery record [dns] - 10https://gerrit.wikimedia.org/r/724538 (https://phabricator.wikimedia.org/T282117) [23:59:41] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/724536 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper)