[00:00:05] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T0000). [00:01:44] (03PS1) 10Dzahn: mail::mx: fix typo in command name for mail-exim-aliases [puppet] - 10https://gerrit.wikimedia.org/r/723011 [00:01:55] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: mail-exim-aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:02:09] is it possible to run a python app that uses threading on toolforge? I tried and I saw a message in the logs that said threading is disabled for performance reasons. [00:02:12] (03CR) 10Dzahn: [C: 03+2] mail::mx: fix typo in command name for mail-exim-aliases [puppet] - 10https://gerrit.wikimedia.org/r/723011 (owner: 10Dzahn) [00:02:19] grr wrong channel [00:02:27] (03CR) 10Catrope: [C: 03+2] "This change is ready for review." [extensions/MediaSearch] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/722990 (https://phabricator.wikimedia.org/T291590) (owner: 10Catrope) [00:03:10] OK now we have to wait for CI, which might be a longer wait than for beta lol [00:04:17] I'll keep checking beta in the mean time [00:07:59] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:07] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [00:18:00] (03CR) 10Dzahn: "systemctl start mail-exim-aliases also works now, though there is a short delay from starting the service to actually receiving mail, unl" [puppet] - 10https://gerrit.wikimedia.org/r/723011 (owner: 10Dzahn) [00:18:54] So Special:Version for BetaCommons now has MediaSearch as being at https://gerrit.wikimedia.org/g/mediawiki/extensions/MediaSearch/+/9c0ef7516797a5e96ed914e1a19b3ede051e05d4 [00:19:31] but I'm still seeing the messed up text in the UI. I don't know if there is a caching thing (I've cleared everything locally) or if the change is not yet live on beta [00:21:27] looks like &action=purge did it [00:21:36] (03PS10) 10Jdlrobson: Unset logo config rather than set to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/719619 [00:21:46] (03PS1) 10Dzahn: mail::mx: switch recipient for alias file from dzahn to ITS [puppet] - 10https://gerrit.wikimedia.org/r/723013 (https://phabricator.wikimedia.org/T273673) [00:21:47] I can confirm that Beta commons now has correct UI messages for Special Mediasearch [00:21:55] (03Merged) 10jenkins-bot: Use text() instead of parse() for MediaSearch UI messages [extensions/MediaSearch] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/722990 (https://phabricator.wikimedia.org/T291590) (owner: 10Catrope) [00:22:29] (03PS2) 10Jdlrobson: Hiding fallback button depends on HTML order [core] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/722988 (https://phabricator.wikimedia.org/T291272) [00:23:21] https://commons.wikimedia.beta.wmflabs.org/w/index.php?search=cat&title=Special:MediaSearch&type=image [00:23:41] (03CR) 10Dzahn: [C: 03+2] mail::mx: switch recipient for alias file from dzahn to ITS [puppet] - 10https://gerrit.wikimedia.org/r/723013 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [00:27:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:53] @RoanKattouw looks like that patch is good to go, it restored the correct behavior on Beta [00:32:10] Thanks for the ping [00:32:20] CI took so long to merge that I got distracted with another task [00:32:32] ha! I've been there [00:34:08] I've confirmed the fix on the test server too, deploying [00:34:11] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [00:35:57] !log catrope@deploy1002 Synchronized php-1.38.0-wmf.1/extensions/MediaSearch/: Use text() instead of parse() for MediaSearch UI messages (T291590) (duration: 01m 08s) [00:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:03] T291590: DiscussionTools markup appears in dropdowns on Special:MediaSearch - https://phabricator.wikimedia.org/T291590 [00:36:19] Deployed! [00:51:32] \o/ [01:05:53] MatmaRex: is there a task for the RL msg() issue, or ok if I make one? [01:06:39] Krinkle: there isn't [01:07:09] i'm not sure if i actually understand myself what the issue is, exactly. so if you do, please file a task :D [01:08:09] We call inLanguage() there so that we get the language from RL context instead of global wgUser/wgLang/RequestContext:main which is disabled on static contexts like load.php but otherwise it'd the same as wfMessage() with implied user language elsewhere. [01:08:15] so calling setInterfaceMessageFlag() would be fine for this case [01:08:35] I don't yet know whether that's fine to just do always and get rid of the inLanguage() code unsetting it [01:09:06] but at least here it'd be safe I believe and likely matches what we did previously when we didn't yet use this approach to setting the language for RL [01:12:34] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T291506" [puppet] - 10https://gerrit.wikimedia.org/r/714624 (https://phabricator.wikimedia.org/T280247) (owner: 10Ebernhardson) [01:15:17] ah, that makes sense [01:15:48] i'm off, see you all [01:16:19] 10SRE, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Dzahn) caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/714624 ? [01:16:28] 10SRE, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Dzahn) It definitely should not be 2 levels of subdomain. That won't be covered by the cert and would explain the error. That being said,... [01:22:17] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:24:56] (03CR) 10Ppchelko: ratelimit: load environment variables file in entrypoint (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/722914 (https://phabricator.wikimedia.org/T254917) (owner: 10Hnowlan) [02:05:05] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:30] !log Deployed patch for T291600 [02:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:15] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:52:08] (03CR) 10Huji: Temporarily disable article editing by anonymous users on fawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [02:52:11] (03PS7) 10Huji: Temporarily disable article editing by anonymous users on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) [03:21:35] (03CR) 10RLazarus: [C: 03+1] "Just a nit, LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/722631 (owner: 10Giuseppe Lavagetto) [03:37:41] (03CR) 10Albertoleoncio: Temporarily disable article editing by anonymous users on fawiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721108 (https://phabricator.wikimedia.org/T291018) (owner: 10Huji) [05:02:52] 10SRE, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Marostegui) p:05Triage→03High [05:24:50] !log Optimize ruwiki.logging on codfw T286102 [05:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:56] T286102: Please optimize logging table in ruwiki - https://phabricator.wikimedia.org/T286102 [05:38:54] (03CR) 10Marostegui: [C: 03+1] "+1 as discussed via IRC, this will still get everyone but DBAs in RO mode for Orchestrator" [puppet] - 10https://gerrit.wikimedia.org/r/722884 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [06:10:47] (03CR) 10Elukey: [C: 03+1] Enable the kerberos auto-renew service for stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/722352 (https://phabricator.wikimedia.org/T268985) (owner: 10Btullis) [06:30:49] (03PS2) 10Muehlenhoff: Disable the "long running screen/tmux session" check by default [puppet] - 10https://gerrit.wikimedia.org/r/712123 (https://phabricator.wikimedia.org/T288028) [06:37:22] (03PS1) 10Elukey: knative-serving: add wikimedia.org as default domain and improve secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/723035 [06:38:13] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:56] (03PS2) 10Elukey: knative-serving: add wikimedia.org as default domain and improve secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/723035 [06:42:59] (03PS1) 10Muehlenhoff: os-reports: Make the report directory configurable via the .cfg file [puppet] - 10https://gerrit.wikimedia.org/r/723036 [06:48:34] (03PS3) 10Elukey: knative-serving: add wikimedia.org as default domain and improve secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/723035 [06:49:02] (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Make the report directory configurable via the .cfg file [puppet] - 10https://gerrit.wikimedia.org/r/723036 (owner: 10Muehlenhoff) [06:51:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/722884 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [06:53:12] !log Upgrade db2085, db2088 and db2092 [06:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:38] (03CR) 10Elukey: [C: 03+2] knative-serving: add wikimedia.org as default domain and improve secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/723035 (owner: 10Elukey) [06:55:36] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:55:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [06:55:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [06:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:41] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [06:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:16] !log Upgrade db2116 [06:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:25] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [06:57:27] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [06:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:35] (03PS1) 10Marostegui: db2103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/723038 (https://phabricator.wikimedia.org/T290865) [06:58:08] (03CR) 10Marostegui: [C: 03+2] db2103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/723038 (https://phabricator.wikimedia.org/T290865) (owner: 10Marostegui) [06:59:34] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [06:59:37] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [06:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [07:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:54] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [07:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:32] !log running `mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki=$WIKI --search-index --db-table --statsd` for growthexperiments.dblist wikis [07:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:36] (03PS1) 10Elukey: helmfile.d: fix revscoring-editquality secret config [deployment-charts] - 10https://gerrit.wikimedia.org/r/723039 [07:19:57] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [07:20:18] (03CR) 10Elukey: [C: 03+2] helmfile.d: fix revscoring-editquality secret config [deployment-charts] - 10https://gerrit.wikimedia.org/r/723039 (owner: 10Elukey) [07:35:32] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [07:39:09] (03PS1) 10Legoktm: Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723046 [07:39:12] (03PS1) 10Legoktm: Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723047 [07:39:13] (03PS1) 10Legoktm: Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723048 [07:39:15] (03PS1) 10Legoktm: Have SyntaxHighlight use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723049 (https://phabricator.wikimedia.org/T289227) [07:39:18] (03PS1) 10Legoktm: Have PdfHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723050 (https://phabricator.wikimedia.org/T289228) [07:39:20] (03PS1) 10Legoktm: Only set tiff settings when $wmgUsePagedTiffHandler = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723051 [07:39:22] (03PS1) 10Legoktm: Have PagedTiffHandler use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723052 (https://phabricator.wikimedia.org/T289228) [07:47:09] (03CR) 10Legoktm: [C: 03+2] Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723046 (owner: 10Legoktm) [07:47:55] (03Merged) 10jenkins-bot: Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723046 (owner: 10Legoktm) [07:49:32] (03CR) 10David Caro: P:base: move production specific code to there own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [07:49:49] (03CR) 10Legoktm: [C: 03+2] Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723047 (owner: 10Legoktm) [07:49:49] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (1/3) (duration: 01m 06s) [07:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:45] (03Merged) 10jenkins-bot: Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723047 (owner: 10Legoktm) [07:51:55] (03CR) 10Legoktm: [C: 03+2] Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723048 (owner: 10Legoktm) [07:52:40] (03Merged) 10jenkins-bot: Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723048 (owner: 10Legoktm) [07:52:49] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (2/3) (duration: 01m 05s) [07:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:42] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Rename $wmgUseGeSHi to $wmgUseSyntaxHighlight (3/3) (duration: 01m 05s) [07:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:25] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "fix yaml formatting; Apart from that LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/723007 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [07:55:28] (03CR) 10Legoktm: [C: 03+2] Have SyntaxHighlight use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723049 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [07:56:15] (03Merged) 10jenkins-bot: Have SyntaxHighlight use Shellbox service on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723049 (https://phabricator.wikimedia.org/T289227) (owner: 10Legoktm) [07:57:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] toolhub: Add no_proxy envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/723006 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [08:01:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:54] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Have SyntaxHighlight use Shellbox service on group0 wikis (1/2) (T289227) (duration: 01m 06s) [08:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:00] T289227: Convert SyntaxHighlight to use Shellbox - https://phabricator.wikimedia.org/T289227 [08:04:16] !log legoktm@deploy1002 Synchronized wmf-config/CommonSettings.php: Have SyntaxHighlight use Shellbox service on group0 wikis (2/2) (T289227) (duration: 01m 05s) [08:04:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:03] (03CR) 10Nikerabbit: Add support for SectionTranslationTargetLanguages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720982 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [08:10:19] (03CR) 10KartikMistry: Add support for SectionTranslationTargetLanguages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720982 (https://phabricator.wikimedia.org/T290302) (owner: 10KartikMistry) [08:11:06] (03CR) 10David Caro: P:base: move production specific code to there own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [08:12:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:59] (03PS6) 10KartikMistry: Add support for SectionTranslationTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720982 (https://phabricator.wikimedia.org/T290302) [08:16:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:12] (03CR) 10Muehlenhoff: P:base: move production specific code to there own profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [08:33:13] (03PS1) 10Elukey: helmfile.d: replace _ in release name for revscoring-editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/723062 [08:37:28] (03CR) 10Jbond: O:base::resolving: make nameservers mandatory (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [08:41:12] (03PS24) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [08:41:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Update eventgate helmfile.d for eventgate 0.5 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/722935 (https://phabricator.wikimedia.org/T291504) (owner: 10Ppchelko) [08:41:44] (03CR) 10Jbond: O:base::resolver: unify resolv.conf templates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [08:41:46] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [08:53:36] (03CR) 10Jbond: O:base::resolver: unify resolv.conf templates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [08:59:03] (03PS1) 10Muehlenhoff: d-i: Stop using the udebs from sid for the bullseye config [puppet] - 10https://gerrit.wikimedia.org/r/723064 [08:59:09] (03CR) 10Jbond: resolvconf: create new class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [09:00:27] (03PS2) 10Elukey: helmfile.d: replace _ in release name for revscoring-editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/723062 [09:02:59] (03CR) 10Giuseppe Lavagetto: _tls_helpers: bump to envoy config v3 api (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/722631 (owner: 10Giuseppe Lavagetto) [09:03:01] (03CR) 10Muehlenhoff: [C: 03+2] d-i: Stop using the udebs from sid for the bullseye config [puppet] - 10https://gerrit.wikimedia.org/r/723064 (owner: 10Muehlenhoff) [09:03:13] (03PS3) 10Giuseppe Lavagetto: _tls_helpers: bump to envoy config v3 api [deployment-charts] - 10https://gerrit.wikimedia.org/r/722631 [09:08:50] (03CR) 10Elukey: [C: 03+2] helmfile.d: replace _ in release name for revscoring-editquality [deployment-charts] - 10https://gerrit.wikimedia.org/r/723062 (owner: 10Elukey) [09:14:45] (03PS17) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [09:14:47] (03PS23) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [09:14:49] (03PS25) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [09:14:51] (03PS16) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [09:14:53] (03PS16) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [09:15:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] _tls_helpers: bump to envoy config v3 api [deployment-charts] - 10https://gerrit.wikimedia.org/r/722631 (owner: 10Giuseppe Lavagetto) [09:22:12] (03Merged) 10jenkins-bot: _tls_helpers: bump to envoy config v3 api [deployment-charts] - 10https://gerrit.wikimedia.org/r/722631 (owner: 10Giuseppe Lavagetto) [09:23:30] (03CR) 10Hnowlan: ratelimit: load environment variables file in entrypoint (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/722914 (https://phabricator.wikimedia.org/T254917) (owner: 10Hnowlan) [09:25:46] (03PS26) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [09:25:50] (03CR) 10Jbond: O:base::resolver: unify resolv.conf templates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [09:26:03] (03PS17) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [09:26:45] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [09:27:15] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [09:29:37] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:17] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:40] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:55] (03PS18) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [09:34:57] (03PS24) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [09:34:59] (03PS27) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [09:35:02] (03PS18) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [09:35:03] (03PS17) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [09:36:30] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [09:37:00] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [09:37:09] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/723068 [09:40:57] !log reinstalling mx2002 (test server) to validate bullseye installs are fixed [09:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31225/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:44:16] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/723068 (owner: 10Effie Mouzeli) [09:44:39] (03CR) 10Jelto: [C: 03+2] "lgtm. I tested it as well on gitlab-test.wmcloud.org and gitlab2001." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/714382 (https://phabricator.wikimedia.org/T288392) (owner: 10Jbond) [09:45:40] PROBLEM - Exim SMTP on mx2002 is CRITICAL: connect to address 208.80.153.72 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [09:46:22] (03PS28) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [09:47:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31226/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:48:38] (03Merged) 10jenkins-bot: tegola-vector-tiles: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/723068 (owner: 10Effie Mouzeli) [09:50:08] ^ mx2002 is me, will recover in a bit [09:51:04] !log jiji@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:17] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [09:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for mx2002.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [09:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:55] (03PS19) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [09:55:02] (03PS20) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [09:55:04] (03PS18) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [09:55:11] !log jiji@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:02] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [09:56:32] PROBLEM - Check systemd state on mx2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.72: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:44] (03PS19) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [09:56:44] PROBLEM - spamassassin on mx2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.72: Connection reset by peer https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [09:57:10] (03CR) 10Jbond: base::resolving: convert base::resolving to a profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [09:57:53] (03PS1) 10Mvolz: Update zotero to "760d6cae" [deployment-charts] - 10https://gerrit.wikimedia.org/r/723070 [09:58:13] (03PS2) 10Mvolz: Update zotero to 760d6cae [deployment-charts] - 10https://gerrit.wikimedia.org/r/723070 [09:59:18] !log jiji@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] mvolz: That opportune time is upon us again. Time for a Services – Citoid / Zotero deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1000). [10:02:00] (03PS20) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:02:43] (03PS21) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [10:04:17] (03PS29) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [10:06:10] (03CR) 10jerkins-bot: [V: 04-1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [10:06:42] PROBLEM - Check systemd state on mx2002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.153.72: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:58] RECOVERY - spamassassin on mx2002 is OK: PROCS OK: 3 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [10:07:57] RECOVERY - Exim SMTP on mx2002 is OK: OK - Certificate mx1001.wikimedia.org will expire on Sun 14 Nov 2021 01:37:16 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [10:08:21] (03PS30) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [10:14:09] (03PS1) 10Muehlenhoff: Temporarily filter port 25 on mx1001 for reimage [homer/public] - 10https://gerrit.wikimedia.org/r/723072 (https://phabricator.wikimedia.org/T286911) [10:18:21] (03PS1) 10Elukey: kubernetes: add the revscoring-editquality-deploy fake user/token [labs/private] - 10https://gerrit.wikimedia.org/r/723073 (https://phabricator.wikimedia.org/T286791) [10:18:37] (03PS22) 10David Caro: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [10:18:39] (03PS21) 10David Caro: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:19:14] (03PS2) 10Elukey: kubernetes: add the revscoring-editquality-deploy fake user/token [labs/private] - 10https://gerrit.wikimedia.org/r/723073 (https://phabricator.wikimedia.org/T286791) [10:19:28] (03CR) 10Elukey: [V: 03+2 C: 03+2] kubernetes: add the revscoring-editquality-deploy fake user/token [labs/private] - 10https://gerrit.wikimedia.org/r/723073 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:21:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31228/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:21:53] (03PS1) 10Giuseppe Lavagetto: mediawiki: bump version of common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/723074 [10:27:08] (03PS1) 10Elukey: role::deployment_server: add revscoring-editquality-deploy k8s user [puppet] - 10https://gerrit.wikimedia.org/r/723077 (https://phabricator.wikimedia.org/T286791) [10:27:55] !log volans@cumin1001 START - Cookbook sre.experimental.reimage for host sretest1002.eqiad.wmnet [10:27:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31231/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: bump version of common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/723074 (owner: 10Giuseppe Lavagetto) [10:29:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31232/console" [puppet] - 10https://gerrit.wikimedia.org/r/723077 (https://phabricator.wikimedia.org/T286791) (owner: 10Elukey) [10:29:36] (03PS1) 10Volans: remote: refactor wait_reboot_since() [software/spicerack] - 10https://gerrit.wikimedia.org/r/723130 [10:31:16] RECOVERY - SSH on sretest1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:33:07] (03Merged) 10jenkins-bot: mediawiki: bump version of common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/723074 (owner: 10Giuseppe Lavagetto) [10:33:28] (03PS2) 10Daniel Kinzler: WIP: api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) [10:33:56] (03PS3) 10Daniel Kinzler: api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) [10:34:01] (03PS22) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [10:34:58] (03CR) 10jerkins-bot: [V: 04-1] remote: refactor wait_reboot_since() [software/spicerack] - 10https://gerrit.wikimedia.org/r/723130 (owner: 10Volans) [10:35:00] (03CR) 10Muehlenhoff: [C: 03+2] Temporarily filter port 25 on mx1001 for reimage [homer/public] - 10https://gerrit.wikimedia.org/r/723072 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [10:36:55] (03PS4) 10Daniel Kinzler: api-gateway: add script for generating beta config [deployment-charts] - 10https://gerrit.wikimedia.org/r/722411 (https://phabricator.wikimedia.org/T254917) [10:37:17] (03PS2) 10Daniel Kinzler: Create functional values-beta.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/722649 [10:37:30] (03PS2) 10Daniel Kinzler: Generate a .env file for use by ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/722956 [10:41:57] (03CR) 10Mvolz: [C: 03+2] Update zotero to 760d6cae [deployment-charts] - 10https://gerrit.wikimedia.org/r/723070 (owner: 10Mvolz) [10:42:37] (03PS3) 10Daniel Kinzler: Create functional values-beta.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/722649 [10:43:41] (03PS1) 10Jbond: P:puppetmaster::puppetdb: dont filter partitions facts [puppet] - 10https://gerrit.wikimedia.org/r/723141 [10:46:37] (03Merged) 10jenkins-bot: Update zotero to 760d6cae [deployment-charts] - 10https://gerrit.wikimedia.org/r/723070 (owner: 10Mvolz) [10:46:45] (03CR) 10Elukey: [C: 03+1] P:puppetmaster::puppetdb: dont filter partitions facts [puppet] - 10https://gerrit.wikimedia.org/r/723141 (owner: 10Jbond) [10:46:53] (03CR) 10Jbond: [C: 03+2] P:puppetmaster::puppetdb: dont filter partitions facts [puppet] - 10https://gerrit.wikimedia.org/r/723141 (owner: 10Jbond) [10:47:48] !log mvolz@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [10:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:37] !log Upgrade db2102 db2116 db2130 db2145 db2146 [10:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:26] !log volans@cumin1001 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1002.eqiad.wmnet [10:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:18] !log mvolz@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:03] !log mx1001 filterered on the routers for forthcoming reimage to bullseye T286911 [10:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:09] T286911: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 [10:54:02] (03PS3) 10Effie Mouzeli: conftool-data: add tegola-vector-tiles discovery 1 [puppet] - 10https://gerrit.wikimedia.org/r/704949 (https://phabricator.wikimedia.org/T283159) [10:55:10] !log mvolz@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, and apergos: Time to snap out of that daydream and deploy EU Backport and Config training. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1100). [11:00:05] kostajh: A patch you scheduled for EU Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] o/ [11:00:28] any trainees on this fine day? [11:00:42] not here, our team has Code Jam this week, no trainees have signed up for training, there is only kostajh's patch (config change) which looks fine [11:03:29] \o I'm here [11:04:00] I can deploy the patch [11:04:08] ok, go ahead! [11:04:33] (03PS3) 10Jcrespo: dbbackups: Switch s1 backup generation from db2097 to db2141 [puppet] - 10https://gerrit.wikimedia.org/r/721285 (https://phabricator.wikimedia.org/T290865) [11:05:19] (03CR) 10Jbond: "Looks good to me, thanks for all the work, just missing a default for listen_address" [puppet] - 10https://gerrit.wikimedia.org/r/722370 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:05:21] (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: Place new dewiki accounts in control group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722961 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan) [11:05:23] (03CR) 10Jbond: [C: 04-1] modules::gitlab add missing fields from ansible gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/722370 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:07:12] (03PS2) 10Kosta Harlan: GrowthExperiments: Place new dewiki accounts in control group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722961 (https://phabricator.wikimedia.org/T288420) [11:08:25] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [software/ecs] - 10https://gerrit.wikimedia.org/r/722966 (owner: 10Cwhite) [11:09:25] (03CR) 10Jbond: [C: 03+2] O:idp: update access permissions for sre-admins [puppet] - 10https://gerrit.wikimedia.org/r/722884 (https://phabricator.wikimedia.org/T289779) (owner: 10Jbond) [11:10:15] (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: Place new dewiki accounts in control group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722961 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan) [11:10:35] !log restart and upgrade db2141 T290865 [11:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:45] T290865: Upgrade s1 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290865 [11:11:21] (03Merged) 10jenkins-bot: GrowthExperiments: Place new dewiki accounts in control group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722961 (https://phabricator.wikimedia.org/T288420) (owner: 10Kosta Harlan) [11:11:42] (03PS2) 10Volans: remote: refactor wait_reboot_since() [software/spicerack] - 10https://gerrit.wikimedia.org/r/723130 [11:11:44] (03PS1) 10Volans: setup.py: limit elasticsearch max version [software/spicerack] - 10https://gerrit.wikimedia.org/r/723153 [11:15:50] !log kharlan@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:722961|GrowthExperiments: Place new dewiki accounts in control group (T288420)]] (duration: 01m 06s) [11:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:57] T288420: Deploy Growth features on German Wikipedia - https://phabricator.wikimedia.org/T288420 [11:16:42] !log UTC morning backport and config deploys done [11:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:53] !log Upgrade db2081 db2082 db2083 db2084 db2091 db2152 T290868 [11:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:59] T290868: Upgrade s8 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T290868 [11:21:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:45] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:31:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31239/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:32:11] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:32:39] (03PS23) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [11:33:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31240/console" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:40:53] (03PS31) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [11:42:05] (03PS3) 10Daniel Kinzler: Generate a .env file for use by ratelimiter [deployment-charts] - 10https://gerrit.wikimedia.org/r/722956 [11:42:46] (03PS32) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [11:44:59] (03PS23) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [11:45:01] (03PS24) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [11:45:37] (03PS19) 10Jbond: O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) [11:45:45] (03PS25) 10Jbond: O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) [11:45:54] (03PS33) 10Jbond: O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) [11:46:01] (03PS24) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [11:46:09] (03PS25) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [11:50:46] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [11:51:23] (03CR) 10Jbond: [C: 03+1] setup.py: limit elasticsearch max version [software/spicerack] - 10https://gerrit.wikimedia.org/r/723153 (owner: 10Volans) [11:53:51] (03CR) 10Jbond: [C: 03+1] remote: refactor wait_reboot_since() [software/spicerack] - 10https://gerrit.wikimedia.org/r/723130 (owner: 10Volans) [11:55:53] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) Ok well we're about a week after DC switchover back to eqiad so we can make some conclusions on the results of the changes in eqiad. Overall there definitel... [11:57:06] (03PS7) 10Jelto: modules::gitlab add missing fields from ansible gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/722370 (https://phabricator.wikimedia.org/T283076) [11:58:57] (03CR) 10Jelto: modules::gitlab add missing fields from ansible gitlab.rb template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/722370 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [11:59:23] (03CR) 10Jbond: [C: 03+1] "LGTM thanks <3" [puppet] - 10https://gerrit.wikimedia.org/r/722370 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [12:01:17] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) @jcrespo thanks for the above comments. In terms of... [12:03:22] 10SRE, 10Infrastructure-Foundations, 10netops: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [12:08:54] (03PS1) 10Muehlenhoff: Allow miscweb hosts to pull OS reports via rsync [puppet] - 10https://gerrit.wikimedia.org/r/723172 [12:11:54] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/723172 (owner: 10Muehlenhoff) [12:19:56] (03CR) 10Jbond: [C: 03+1] "lgtm optional omment" [puppet] - 10https://gerrit.wikimedia.org/r/723172 (owner: 10Muehlenhoff) [12:26:10] (03CR) 10Muehlenhoff: Allow miscweb hosts to pull OS reports via rsync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/723172 (owner: 10Muehlenhoff) [12:27:34] (03PS2) 10Muehlenhoff: Allow miscweb hosts to pull OS reports via rsync [puppet] - 10https://gerrit.wikimedia.org/r/723172 [12:30:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/723172 (owner: 10Muehlenhoff) [12:39:52] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31241/" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [12:56:39] (03CR) 10Muehlenhoff: [C: 03+2] Allow miscweb hosts to pull OS reports via rsync [puppet] - 10https://gerrit.wikimedia.org/r/723172 (owner: 10Muehlenhoff) [12:58:52] (03CR) 10Gehel: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/723153 (owner: 10Volans) [12:59:45] (03CR) 10Volans: [C: 03+2] setup.py: limit elasticsearch max version [software/spicerack] - 10https://gerrit.wikimedia.org/r/723153 (owner: 10Volans) [12:59:53] (03CR) 10Volans: [C: 03+2] remote: refactor wait_reboot_since() [software/spicerack] - 10https://gerrit.wikimedia.org/r/723130 (owner: 10Volans) [13:00:05] dduvall and hashar: #bothumor I � Unicode. All rise for MediaWiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1300). [13:01:38] (03PS1) 10Volans: sre.experimental.reimage: improve message logging [cookbooks] - 10https://gerrit.wikimedia.org/r/723178 [13:03:12] (03PS5) 10DCausse: search-platform: add flink alerts [alerts] - 10https://gerrit.wikimedia.org/r/720066 (https://phabricator.wikimedia.org/T276467) [13:03:14] (03PS5) 10DCausse: search-platform: Alert when blazegraph burns allocator too rapidly [alerts] - 10https://gerrit.wikimedia.org/r/720684 (https://phabricator.wikimedia.org/T284446) [13:06:08] (03Merged) 10jenkins-bot: setup.py: limit elasticsearch max version [software/spicerack] - 10https://gerrit.wikimedia.org/r/723153 (owner: 10Volans) [13:09:24] !log update pcc facts (after change in puppetdb's fact filter list, to allow partitions for analytics) [13:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:37] (03CR) 10David Caro: [C: 03+1] "For the cloud hosts, https://puppet-compiler.wmflabs.org/compiler1003/31242/ is looking good too" [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:10:51] (03CR) 10David Caro: [C: 03+1] O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:10:57] (03CR) 10David Caro: [C: 03+1] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:11:04] (03CR) 10David Caro: [C: 03+1] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:11:25] !log Deploy schema change on s3 testwikidatawiki.wb_changes T291584 [13:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:31] T291584: Schema change for adding change_object_id index on wb_changes - https://phabricator.wikimedia.org/T291584 [13:12:54] (03Merged) 10jenkins-bot: remote: refactor wait_reboot_since() [software/spicerack] - 10https://gerrit.wikimedia.org/r/723130 (owner: 10Volans) [13:13:38] (03PS25) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [13:14:24] !log Deploy schema change on s4 {commonswiki,testcommonswiki}.wb_changes T291584 [13:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:50] (03CR) 10jerkins-bot: [V: 04-1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:17:32] (03PS26) 10Jbond: resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) [13:17:51] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) To keep archives happy between Phabricator/IRC - I tried to deploy the new ml `revscoring-editquality` service and got: ` "revscoring-editquality" cannot list resource "secrets"... [13:20:20] (03PS26) 10Jbond: base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) [13:20:42] (03CR) 10Jbond: [C: 03+2] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:22:01] (03PS13) 10Elukey: helmfile: add the ability to inject labels to Namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) [13:22:45] (03CR) 10David Caro: [C: 03+1] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:22:53] (03PS9) 10Elukey: kubeflow-kfserving: move Namespace creation to helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/721268 (https://phabricator.wikimedia.org/T288829) [13:22:55] (03CR) 10David Caro: [C: 03+1] base::resolving: convert base::resolving to a profile [puppet] - 10https://gerrit.wikimedia.org/r/717241 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [13:23:03] !log merge refactor of resolv.conf puppet class - (gerrit 717241) [13:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:11] (03CR) 10Jbond: [C: 03+2] O:base::resolving: drop the domain keyword and use the domain fact [puppet] - 10https://gerrit.wikimedia.org/r/690515 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:23:15] (03CR) 10Jbond: [C: 03+2] O:base::resolving: make nameservers mandatory [puppet] - 10https://gerrit.wikimedia.org/r/690529 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:23:21] (03CR) 10Jbond: [C: 03+2] O:base::resolver: unify resolv.conf templates [puppet] - 10https://gerrit.wikimedia.org/r/690522 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:23:27] (03CR) 10Jbond: [C: 03+2] resolvconf: create new class [puppet] - 10https://gerrit.wikimedia.org/r/691080 (https://phabricator.wikimedia.org/T171498) (owner: 10Jbond) [13:25:21] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: reimage [13:25:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: reimage [13:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:41] !log reimaging mx1001 to bullseye T286911 [13:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:46] T286911: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 [13:28:11] !log Deploy schema change on s8 codfw wikidatawiki.wb_changes T291584 [13:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:17] T291584: Schema change for adding change_object_id index on wb_changes - https://phabricator.wikimedia.org/T291584 [13:34:33] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for mx1001.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [13:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for mx1001.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [13:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:08] (03CR) 10Jbond: "lgtm but see nit" [cookbooks] - 10https://gerrit.wikimedia.org/r/723178 (owner: 10Volans) [13:36:34] !log jmm@cumin2002 START - Cookbook sre.puppet.renew-cert for mx1001.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [13:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for mx1001.wikimedia.org: Renew puppet certificate - jmm@cumin2002 [13:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:55] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) @cmooney Please feel free to resolve this ticket and... [13:38:21] (03PS2) 10Volans: sre.experimental.reimage: improve message logging [cookbooks] - 10https://gerrit.wikimedia.org/r/723178 [13:38:33] (03CR) 10Volans: sre.experimental.reimage: improve message logging (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/723178 (owner: 10Volans) [13:40:25] (03CR) 10Jelto: [C: 03+2] modules::gitlab add missing fields from ansible gitlab.rb template [puppet] - 10https://gerrit.wikimedia.org/r/722370 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [13:41:48] (03CR) 10Hashar: "+ Bstorm who paired on it on the task 😊" [puppet] - 10https://gerrit.wikimedia.org/r/722476 (https://phabricator.wikimedia.org/T277078) (owner: 10Krinkle) [13:44:58] (03PS1) 10Muehlenhoff: Revert "Temporarily filter port 25 on mx1001 for reimage" [homer/public] - 10https://gerrit.wikimedia.org/r/723184 (https://phabricator.wikimedia.org/T28691) [13:46:46] (03CR) 10Hashar: ci: Add 'bullseye' to docker lsbdistcodename hack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [13:48:10] (03PS8) 10Hashar: ci: Add 'bullseye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [13:48:21] (03CR) 10Hashar: [C: 03+1] ci: Add 'bullseye' to docker lsbdistcodename hack [puppet] - 10https://gerrit.wikimedia.org/r/717687 (https://phabricator.wikimedia.org/T284774) (owner: 10Krinkle) [13:51:49] jouncebot: now [13:51:49] For the next 1 hour(s) and 8 minute(s): MediaWiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1300) [13:51:52] jouncebot: next [13:51:53] In 2 hour(s) and 8 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1600) [13:53:19] !log upgrade php7.2 on codfw - T291052 [13:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:28] T291052: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 [14:00:06] (03CR) 10Herron: [C: 03+1] Revert "Temporarily filter port 25 on mx1001 for reimage" [homer/public] - 10https://gerrit.wikimedia.org/r/723184 (https://phabricator.wikimedia.org/T28691) (owner: 10Muehlenhoff) [14:02:44] (03CR) 10Elukey: [C: 03+2] "Taking the liberty to merge this since it is a no op and I worked with it in the past days with Janis. If there is anything weird or that " [deployment-charts] - 10https://gerrit.wikimedia.org/r/720997 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [14:02:55] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Temporarily filter port 25 on mx1001 for reimage" [homer/public] - 10https://gerrit.wikimedia.org/r/723184 (https://phabricator.wikimedia.org/T28691) (owner: 10Muehlenhoff) [14:03:45] (03CR) 10Ottomata: [C: 03+2] Eventgate: Symlink _helpers and _tls_helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/722654 (https://phabricator.wikimedia.org/T291504) (owner: 10Ppchelko) [14:04:42] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Eventgate: Symlink _helpers and _tls_helpers [deployment-charts] - 10https://gerrit.wikimedia.org/r/722654 (https://phabricator.wikimedia.org/T291504) (owner: 10Ppchelko) [14:05:00] (03PS3) 10Ottomata: Update eventgate helmfile.d for eventgate 0.5 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/722935 (https://phabricator.wikimedia.org/T291504) (owner: 10Ppchelko) [14:05:44] (03CR) 10Ottomata: [C: 03+2] Update eventgate helmfile.d for eventgate 0.5 chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/722935 (https://phabricator.wikimedia.org/T291504) (owner: 10Ppchelko) [14:06:42] (03CR) 10Herron: profile::logstash::gelf_relay: ingest GELF logs and output as JSON over UDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721345 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [14:11:21] (03CR) 10Elukey: [C: 03+2] kubeflow-kfserving: move Namespace creation to helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/721268 (https://phabricator.wikimedia.org/T288829) (owner: 10Elukey) [14:14:31] !log otto@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [14:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:30] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:19:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:45] !log removed routers filter for mx1001, reimage to bullseye complete T286911 [14:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:50] T286911: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 [14:28:09] (03PS1) 10Ssingh: durum: use the noscript tag in the body [puppet] - 10https://gerrit.wikimedia.org/r/723210 [14:37:11] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 (10MoritzMuehlenhoff) Both mx1001 and mx2001 are now running Bullseye. There's a little cleanup/followup work, but the core of the work is completed. [14:41:59] (03CR) 10Ssingh: [C: 03+2] durum: use the noscript tag in the body [puppet] - 10https://gerrit.wikimedia.org/r/723210 (owner: 10Ssingh) [14:51:04] (03PS1) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [14:51:30] (03PS1) 10Elukey: helmfile.d: force quotation to namaspace label values [deployment-charts] - 10https://gerrit.wikimedia.org/r/723215 (https://phabricator.wikimedia.org/T290476) [14:52:09] (03PS2) 10ZPapierski: Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) [14:52:54] (03CR) 10Ottomata: Added spicerack.kafka with offset transfer function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [14:56:06] (03PS1) 10Reedy: [beta] Update wgCdnServersNoPurge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723216 (https://phabricator.wikimedia.org/T291643) [14:56:11] jouncebot: now [14:56:11] For the next 0 hour(s) and 3 minute(s): MediaWiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1300) [14:56:34] (03CR) 10Reedy: [C: 03+2] [beta] Update wgCdnServersNoPurge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723216 (https://phabricator.wikimedia.org/T291643) (owner: 10Reedy) [14:57:19] (03Merged) 10jenkins-bot: [beta] Update wgCdnServersNoPurge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723216 (https://phabricator.wikimedia.org/T291643) (owner: 10Reedy) [14:58:33] (03PS2) 10Elukey: helmfile.d: force quotation to namaspace label values [deployment-charts] - 10https://gerrit.wikimedia.org/r/723215 (https://phabricator.wikimedia.org/T290476) [14:58:54] !log reedy@deploy1002 Synchronized wmf-config/reverse-proxy-staging.php: T291643 (duration: 01m 05s) [14:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:00] T291643: Upload cache not invalidated after purge - https://phabricator.wikimedia.org/T291643 [15:01:45] (03CR) 10jerkins-bot: [V: 04-1] Added spicerack.kafka with offset transfer function [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [15:02:04] 10SRE, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Ladsgroup) The config I'm getting is query-commons.wikimedia.org https://gerrit.wikimedia.org/r/c/wikidata/query/gui-deploy/+/720072/2/sit... [15:04:17] (03CR) 10Elukey: [C: 03+2] helmfile.d: force quotation to namaspace label values [deployment-charts] - 10https://gerrit.wikimedia.org/r/723215 (https://phabricator.wikimedia.org/T290476) (owner: 10Elukey) [15:06:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:40] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:46] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:09:51] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:58] (03PS1) 10MVernon: alertmanager: route data-persistence team alerts [puppet] - 10https://gerrit.wikimedia.org/r/723220 (https://phabricator.wikimedia.org/T257056) [15:17:11] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/723220 (https://phabricator.wikimedia.org/T257056) (owner: 10MVernon) [15:32:53] 10SRE, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10EBernhardson) Commons query has not been deployed yet. No public DNS has been assigned. Nothing is configured to route traffic from the pu... [15:41:13] PROBLEM - SSH on ms-fe2006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:41:20] (03CR) 10Bstorm: ci: Apply profile::wmcs::lvm as needed for new integration instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/722476 (https://phabricator.wikimedia.org/T277078) (owner: 10Krinkle) [15:42:31] (03CR) 10Bstorm: [C: 03+1] "That should do it. If both get factored out into a role instead of being used directly in a profile, just make sure both are used." [puppet] - 10https://gerrit.wikimedia.org/r/722476 (https://phabricator.wikimedia.org/T277078) (owner: 10Krinkle) [15:48:31] (03CR) 10Brennen Bearnes: [V: 03+2] gitlab cas: uid instead of CN; add nickname_key [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/714382 (https://phabricator.wikimedia.org/T288392) (owner: 10Jbond) [15:54:15] 10SRE, 10Wikimedia-Logstash, 10observability, 10SRE Observability (FY2021/2022-Q1): Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10Ottomata) [15:54:27] !log gitlab1001: brief downtime to apply [[gerrit:714382|gitlab cas: uid instead of CN; add nickname_key]] for T288392 [15:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:34] T288392: GitLab uses 'real name' as username (rather than 'shell name' or an user-specified name) - https://phabricator.wikimedia.org/T288392 [15:54:53] !log lucaswerkmeister-wmde@mwmaint1002:~$ echo 'https://query.wikidata.org/querybuilder/' | mwscript purgeList.php # T285761 [15:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:58] T285761: Add proper security headers to Query Builder - https://phabricator.wikimedia.org/T285761 [15:55:57] (03PS1) 10MVernon: data-protection: add alerting for prometheus-mysqld-exporter failing [alerts] - 10https://gerrit.wikimedia.org/r/723223 (https://phabricator.wikimedia.org/T257056) [15:57:47] 10SRE, 10Analytics, 10Event-Platform, 10Wikimedia-Logstash, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10Ottomata) [15:58:24] (03CR) 10jerkins-bot: [V: 04-1] data-protection: add alerting for prometheus-mysqld-exporter failing [alerts] - 10https://gerrit.wikimedia.org/r/723223 (https://phabricator.wikimedia.org/T257056) (owner: 10MVernon) [16:00:04] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1600). [16:00:05] Lucas_WMDE: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:12] o/ [16:00:24] my patch is just a proposed comment improvement, I thought I’d bring it to the window [16:00:32] looking! [16:00:51] (03CR) 10Ladsgroup: [C: 03+1] etcd::backup: convert backup cron to timer job [puppet] - 10https://gerrit.wikimedia.org/r/722950 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:00:56] ahhhh jeez I dunno, this seems like it might have some pretty serious downstream effects ;) [16:01:00] merging, thanks for the patch [16:01:07] :D thanks \o/ [16:01:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:01:22] (03CR) 10RLazarus: [C: 03+2] Clarify comment of restricted group [puppet] - 10https://gerrit.wikimedia.org/r/722331 (owner: 10Lucas Werkmeister (WMDE)) [16:01:24] I know, I made the file two bytes longer… [16:01:31] (03PS2) 10RLazarus: Clarify comment of restricted group [puppet] - 10https://gerrit.wikimedia.org/r/722331 (owner: 10Lucas Werkmeister (WMDE)) [16:03:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:03:19] (03CR) 10MVernon: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/723223 (https://phabricator.wikimedia.org/T257056) (owner: 10MVernon) [16:03:43] ✅ [16:04:49] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: improve message logging [cookbooks] - 10https://gerrit.wikimedia.org/r/723178 (owner: 10Volans) [16:07:52] (03Merged) 10jenkins-bot: sre.experimental.reimage: improve message logging [cookbooks] - 10https://gerrit.wikimedia.org/r/723178 (owner: 10Volans) [16:09:15] !log gitlab1001: reverting [[gerrit:714382|gitlab cas: uid instead of CN; add nickname_key]] for T288392, as existing user logins are broken. [16:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:22] T288392: GitLab uses 'real name' as username (rather than 'shell name' or an user-specified name) - https://phabricator.wikimedia.org/T288392 [16:10:37] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:10:38] 10SRE, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10Dzahn) It seems that the reported "currently giving a broken https cert" is basically impossible with this not being in DNS. [16:13:21] (03PS2) 10Ryan Kemper: query_service: fix newly broken gc-log-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/721646 [16:13:42] !log reboot an-worker1096 to see if megacli status for a new disk changes - T290805 [16:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:48] T290805: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 [16:14:38] (03PS1) 10Brennen Bearnes: Revert "gitlab cas: uid instead of CN; add nickname_key" [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/723230 (https://phabricator.wikimedia.org/T288392) [16:15:10] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/723231 [16:16:39] (03PS3) 10Ryan Kemper: query_service: fix newly broken gc-log-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/721646 [16:17:08] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] query_service: fix newly broken gc-log-cleanup [puppet] - 10https://gerrit.wikimedia.org/r/721646 (owner: 10Ryan Kemper) [16:24:34] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q1): Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) [16:24:50] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/723231 (owner: 10Volans) [16:25:07] (03PS2) 10Cwhite: add dynamic_templates template rendering [software/ecs] - 10https://gerrit.wikimedia.org/r/722966 (https://phabricator.wikimedia.org/T291647) [16:25:40] (03PS2) 10MVernon: data-protection: add alerting for prometheus-mysqld-exporter failing [alerts] - 10https://gerrit.wikimedia.org/r/723223 (https://phabricator.wikimedia.org/T257056) [16:29:57] (03CR) 10MVernon: "Hi," [alerts] - 10https://gerrit.wikimedia.org/r/723223 (https://phabricator.wikimedia.org/T257056) (owner: 10MVernon) [16:31:13] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.0.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/723231 (owner: 10Volans) [16:33:04] (03PS1) 10Volans: Upstream release v1.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/723245 [16:33:38] (03PS8) 10Ryan Kemper: blazegraph: LVS for WCQS step 1 [puppet] - 10https://gerrit.wikimedia.org/r/713959 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [16:33:57] (03PS9) 10Ryan Kemper: blazegraph: LVS for WCQS step 1 [puppet] - 10https://gerrit.wikimedia.org/r/713959 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [16:34:49] 10SRE, 10Infrastructure-Foundations, 10netops: Create an alert for output discards on network devices - https://phabricator.wikimedia.org/T284593 (10cmooney) [16:34:57] 10SRE, 10Infrastructure-Foundations, 10netops: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) [16:36:23] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:37:04] (03CR) 10Ebernhardson: [C: 03+1] blazegraph: LVS for WCQS step 1 [puppet] - 10https://gerrit.wikimedia.org/r/713959 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [16:37:19] (03PS10) 10Ryan Kemper: query_service: LVS for WCQS step 1 [puppet] - 10https://gerrit.wikimedia.org/r/713959 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [16:37:33] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1096 - https://phabricator.wikimedia.org/T290805 (10elukey) New disk up and running, I added some more info to https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk (in this case there was no unconfi... [16:37:41] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+1] query_service: LVS for WCQS step 1 [puppet] - 10https://gerrit.wikimedia.org/r/713959 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [16:37:43] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] query_service: LVS for WCQS step 1 [puppet] - 10https://gerrit.wikimedia.org/r/713959 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [16:38:46] !log T280001 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959, running puppet on `*w*qs*` (i.e. wcqs and wdqs) [16:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:52] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [16:39:09] 10SRE, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) 05Open→03Resolved @jcrespo thanks. As you say i... [16:42:13] RECOVERY - SSH on ms-fe2006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:44:55] (03PS1) 10Ryan Kemper: wcqs: go from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/723254 (https://phabricator.wikimedia.org/T280001) [16:45:02] (03CR) 10Volans: [C: 03+2] Upstream release v1.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/723245 (owner: 10Volans) [16:48:02] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/723254 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [16:51:36] (03CR) 10Ebernhardson: [C: 03+1] wcqs: go from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/723254 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [16:52:23] (03Merged) 10jenkins-bot: Upstream release v1.0.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/723245 (owner: 10Volans) [16:55:11] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:31] (03CR) 10DCausse: Added spicerack.kafka with offset transfer function (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [16:59:18] !log uploaded spicerack_1.0.1 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [16:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1700). [17:01:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:08] (03CR) 10DCausse: [C: 03+1] Add kafka clusters' brokers to spicerack config [puppet] - 10https://gerrit.wikimedia.org/r/721857 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [17:06:37] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [17:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:43] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by volans@cumin2002 for host sretest1001.eqiad.wmnet [17:06:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Cmjohnson) [17:08:49] PROBLEM - Long running screen/tmux on releases1002 is CRITICAL: CRIT: Long running SCREEN process. (user: dancy PID: 30932, 1734300s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [17:09:33] fiiiine [17:10:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Elasticsearch, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T281989 (10Cmjohnson) dns and port descriptions updated [17:11:37] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:13:26] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10wiki_willy) [17:18:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:34] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:21] PROBLEM - Elevated latency for icinga checks in codfw on alert1001 is CRITICAL: cluster=alerting instance=alert2001 job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [17:22:03] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wcqs on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/wcqs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [17:24:47] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wcqs on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/wcqs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [17:28:07] 10SRE, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 2 others: Fix broken https at https://query.commons.wikimedia.org/ - https://phabricator.wikimedia.org/T291542 (10EBernhardson) 05Open→03Invalid Seems to be a miscommunication, the service is not yet publicly available. [17:28:18] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:36] !log volans@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1001.eqiad.wmnet [17:31:41] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks: convert wmf-auto-reimage scripts to Cookbooks - https://phabricator.wikimedia.org/T205885 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - sretest1001 (**PASS**) - Downtimed on Icinga - Disabled Puppet - Removed from Pup... [17:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10Cmjohnson) [17:35:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:(Need By: TBD) rack/setup/install puppetmaster100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T289732 (10Cmjohnson) dns and nework information updated in netbox [17:36:41] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:37:09] RECOVERY - Elevated latency for icinga checks in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/rsCfQfuZz/icinga [17:42:39] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wcqs on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/wcqs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [17:44:50] (03CR) 10BryanDavis: [C: 03+2] toolhub: Add no_proxy envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/723006 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [17:45:03] ^ Taking a look at `Confd template for /srv/config-master/pybal/eqiad/wcqs on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/wcqs`; this is related to the ongoing work to productionize wcqs. https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959 is what would have introduced the error [17:48:58] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/wcqs on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/wcqs Ryan Kemper https://phabricator.wikimedia.org/T280001 https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959 https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [17:48:58] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/wcqs on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/wcqs Ryan Kemper https://phabricator.wikimedia.org/T280001 https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959 https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [17:48:58] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/eqiad/wcqs on puppetmaster2001 is CRITICAL: File not found: /srv/config-master/pybal/eqiad/wcqs Ryan Kemper https://phabricator.wikimedia.org/T280001 https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959 https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [17:49:09] (03Merged) 10jenkins-bot: toolhub: Add no_proxy envvar [deployment-charts] - 10https://gerrit.wikimedia.org/r/723006 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [17:49:43] (03PS2) 10BryanDavis: toolhub: text-lb egress + no_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/723007 (https://phabricator.wikimedia.org/T291447) [17:56:48] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:13] (03PS1) 10Volans: sre.experimental.reimage: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/723280 [17:57:49] (03PS1) 10Volans: remote, puppet: reduce logging verbosity [software/spicerack] - 10https://gerrit.wikimedia.org/r/723281 [18:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1800). [18:00:05] No Gerrit patches in the queue for this window AFAICS. [18:01:36] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wcqs on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/wcqs https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [18:02:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:17] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) [18:12:34] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Data-Engineering: Q1:(Need By: ASAP) rack/setup/install an-db100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T289632 (10Cmjohnson) updated dns and network [18:22:24] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:24:32] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:37:46] (03CR) 10Krinkle: [C: 03+2] Remove $wmgLogstashServers (step 1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720385 (owner: 10Krinkle) [18:38:39] (03Merged) 10jenkins-bot: Remove $wmgLogstashServers (step 1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720385 (owner: 10Krinkle) [18:38:59] (03CR) 10Jbond: [C: 03+1] sre.experimental.reimage: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/723280 (owner: 10Volans) [18:40:36] (03CR) 10Jbond: [C: 03+1] remote, puppet: reduce logging verbosity [software/spicerack] - 10https://gerrit.wikimedia.org/r/723281 (owner: 10Volans) [18:47:00] (03CR) 10Ottomata: Install Alluxio to the test cluster (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [18:47:39] (03CR) 10Ottomata: "Some comments, looking great Ben!" [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [18:48:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:21] (03CR) 10Herron: opensearch: fork elasticsearch module into opensearch module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [18:51:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:26] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/721089 (owner: 10Ebernhardson) [18:52:32] (03PS7) 10Ryan Kemper: Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 (owner: 10Ebernhardson) [18:52:46] (03CR) 10Ryan Kemper: [C: 03+1] Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 (owner: 10Ebernhardson) [18:54:07] (03PS8) 10Ryan Kemper: Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:54:30] mewoph and I might have a backport for wmf.1 coming in a few minutes; maybe we will leave it for the next window though [18:54:38] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+1] Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:54:41] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] Declare wikimedia_cluster for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/721089 (https://phabricator.wikimedia.org/T280001) (owner: 10Ebernhardson) [18:56:11] (03CR) 10Krinkle: [C: 03+2] Remove $wmgLogstashServers (step 2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720386 (owner: 10Krinkle) [18:56:26] kostajh: I'll be done in 2min [18:56:32] nothing else in the window afaik [18:56:43] and I'm only just hijacking it as well. [18:57:13] !log krinkle@deploy1002 Synchronized wmf-config/logging.php: I2cd81a5165ea14c (duration: 01m 05s) [18:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:01] (03PS3) 10Krinkle: Remove $wmgLogstashServers (step 2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720386 [18:58:06] (03CR) 10Krinkle: Remove $wmgLogstashServers (step 2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720386 (owner: 10Krinkle) [18:58:08] (03CR) 10Krinkle: [C: 03+2] Remove $wmgLogstashServers (step 2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720386 (owner: 10Krinkle) [18:58:37] !log T280001 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/721089 to see if it resolves the `confd` error that popped up [18:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:42] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [18:58:56] (03Merged) 10jenkins-bot: Remove $wmgLogstashServers (step 2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720386 (owner: 10Krinkle) [18:59:31] Krinkle: ok lmk when you're done, thanks [19:00:02] > WARNING logentry Failed to instantiate RC log entry performer [19:00:04] dduvall and hashar: Dear deployers, time to do the MediaWiki train - American+European Version deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210923T1900). [19:00:05] This seems to happen on every edit [19:00:10] I don't think that's been there in the past [19:00:14] (03PS1) 10Kosta Harlan: Suggested Edits: Update editor preference for tasks that shouldn't open the editor by default [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723194 (https://phabricator.wikimedia.org/T291020) [19:00:23] probably muted for prod traffic, but visible when enabling verbose logging [19:00:57] tgr: Pchelolo: ^ might file a task later, but fyi in case that rings any bell [19:02:00] oh... needs a task certainly. [19:02:11] I can file one if you don't [19:02:31] Krinkle: is it OK if I +2 the patch I'm backporting, or should I wait? (still pretty new to the backporting process) [19:02:35] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: I3323ce3d4446a2 (duration: 01m 07s) [19:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:55] kostajh: done [19:04:06] (03CR) 10Kosta Harlan: [C: 03+2] "backport" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723194 (https://phabricator.wikimedia.org/T291020) (owner: 10Kosta Harlan) [19:04:15] thx [19:04:53] dduvall / hashar OK for me to go ahead with this backport? it should be about 15-20 minutes [19:05:02] Pchelolo: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2021.09.23?id=VtgGFHwBHB1njYG3x5a1 [19:05:04] go ahead :) [19:05:05] kostajh: sure thing [19:05:17] cheers [19:05:42] Pchelolo: it's not seen by prod error triage because it's caught and logged as non-fatal warning in a custom channel, so that's hopefully mostly intentional, but anyway, have fun :) [19:06:13] yeah, it's intentional that it's not a prod error, but it still shouldn't be happening [19:06:15] thank you! [19:07:24] created a task [19:10:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:59] (03PS3) 10BryanDavis: toolhub: text-lb egress + no_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/723007 (https://phabricator.wikimedia.org/T291447) [19:22:01] (03PS1) 10BryanDavis: toolhub: Do not force envvars to uppercase [deployment-charts] - 10https://gerrit.wikimedia.org/r/723297 (https://phabricator.wikimedia.org/T291447) [19:25:30] dduvall: sorry, the gate-and-submit process seems to be taking longer than usual. [19:25:47] (03Merged) 10jenkins-bot: Suggested Edits: Update editor preference for tasks that shouldn't open the editor by default [extensions/GrowthExperiments] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723194 (https://phabricator.wikimedia.org/T291020) (owner: 10Kosta Harlan) [19:26:36] (03CR) 10BryanDavis: [C: 03+2] toolhub: Do not force envvars to uppercase [deployment-charts] - 10https://gerrit.wikimedia.org/r/723297 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [19:29:50] kostajh: no problem. looks like it finished [19:30:09] dduvall: yes, just checking on mwdebug now [19:30:38] (03Merged) 10jenkins-bot: toolhub: Do not force envvars to uppercase [deployment-charts] - 10https://gerrit.wikimedia.org/r/723297 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [19:34:28] dduvall: so, the patch I'm backporting (of course) worked as intended locally, but doesn't have the hoped for impact in production AFAICT. Maybe we'll try to rework something in time for the next backport window. Is it preferable to merge this as is and maybe submit another backport later, or to abandon this patch and try again later? [19:35:23] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [19:35:47] kostajh: imo it would be best to revert the backport so there isn't an inconsistent state between wmf/1.38.0-wmf.1 and what's synced [19:36:43] dduvall: I'm inclined to sync it -- it's a patch aiming to fix a couple different scenarios, and it's possible it improves the situation for some scenarios I didn't verify now. [19:36:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:28] i see. well, if you think it's safe to sync even though it didn't have the effect you wanted, that's ok with me. your call [19:38:12] dduvall: OK, I'll sync it [19:39:56] !log kharlan@deploy1002 Synchronized php-1.38.0-wmf.1/extensions/GrowthExperiments/includes/HomepageHooks.php: Backport: [[gerrit:723194|Suggested Edits: Update editor preference for tasks that shouldn't open the editor by default (T291020)]] (duration: 01m 05s) [19:40:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:06] T291020: Newcomer tasks: open VE by default - https://phabricator.wikimedia.org/T291020 [19:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:42] !log UTC morning backport window done [19:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:50] dduvall: all done, thank you [19:44:08] kostajh: you got it [19:44:21] hashar: o/ rolling :) [19:46:13] (03PS2) 10Ryan Kemper: wcqs: go from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/723254 (https://phabricator.wikimedia.org/T280001) [19:47:45] (03CR) 10BBlack: [C: 03+1] wcqs: go from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/723254 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [19:48:15] (03PS1) 10Dduvall: all wikis to 1.38.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723298 [19:48:17] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.38.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723298 (owner: 10Dduvall) [19:49:00] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723298 (owner: 10Dduvall) [19:49:41] (03PS5) 10Ebernhardson: query_service: Support proxying to microsite from backend [puppet] - 10https://gerrit.wikimedia.org/r/720801 (https://phabricator.wikimedia.org/T280247) [19:50:33] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.1 [19:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:17] !log 1.38.0-wmf.1 promoted to all wikis. no new errors or rising rates (T281165) [20:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:23] T281165: 1.38.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T281165 [20:07:29] dduvall: epilogue -- the patch works now that we are reviewing it in group2. So I guess I did something wrong with the mwdebug setup, although I'm confused about that as I verified that the relevant file existed on mwdebug1002, definitely had the browser plugin on and set to use mwdebug1002 [20:14:56] (03PS1) 10Daniel Kinzler: Create generic config extract script [deployment-charts] - 10https://gerrit.wikimedia.org/r/723306 [20:16:05] (03CR) 10Ottomata: [C: 03+1] Stream config changes for android_daily_stats schema Bug: T286000 Change-Id: Icbc8465a97fe9713b8321314d407573f0967488f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722970 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [20:20:17] (03PS1) 10Ebernhardson: query_service: Add dummy credentials for query_service oauth [labs/private] - 10https://gerrit.wikimedia.org/r/723307 [20:31:01] ebernhardson: icinga is complaining about there being no hostgroup matching wcqs_codfw. I think it's missing an entry in `monitoring::groups` [20:34:34] (03PS3) 10Ryan Kemper: wcqs: go from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/723254 (https://phabricator.wikimedia.org/T280001) [20:35:05] (03CR) 10jerkins-bot: [V: 04-1] wcqs: go from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/723254 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [20:38:11] (03PS4) 10Ryan Kemper: wcqs: go from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/723254 (https://phabricator.wikimedia.org/T280001) [20:38:43] (03CR) 10BryanDavis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/723007 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [20:40:04] PROBLEM - SSH on analytics1069.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:47:24] !log T280001 Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/723254 to proceed with `lvs_setup` state change; will be restarting low-traffic lvs hosts shortly [20:47:28] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: go from service_setup to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/723254 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [20:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:31] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [20:49:49] (03PS4) 10BryanDavis: toolhub: text-lb egress + no_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/723007 (https://phabricator.wikimedia.org/T291447) [20:49:51] (03PS1) 10BryanDavis: toolhub: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/723309 [20:53:10] !log T280001 Ran puppet on all lvs hosts => `ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'` [20:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:17] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [20:53:55] !log T280001 Restarting pybal on backup low-traffic hosts `lvs2010` and `lvs1016`... [20:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:37] !log T280001 Restarted pybal on backup low-traffic hosts: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2010*,lvs1016*}' 'sudo systemctl restart pybal'` [20:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:20] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:56:58] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:57:22] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:58:24] !log canceling backport training window for 2021-09-23 [20:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:32] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) https://wikitech.wikimedia.org/wiki/PyBal [20:58:50] ^ These diffchecks errors are an expected part of the process, will ack [20:58:56] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 63 connections established with conf2004.codfw.wmnet:4001 (min=64) https://wikitech.wikimedia.org/wiki/PyBal [20:59:52] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 71 connections established with conf1004.eqiad.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal [21:00:05] !log T280001 `TCP 10.2.1.67:443 wrr` shows up on `ryankemper@lvs1016:~$ sudo ipvsadm -L -n ` and `TCP 10.2.2.67:443 wrr` shows up on `ryankemper@lvs2010:~$ sudo ipvsadm -L -n` as expected [21:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:12] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [21:00:15] !log T280001 Sanity check of `sudo ipvsadm -L -n` on backup `lvs2010` and `lvs1016` looks good, proceeding [21:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:54] small typo in prior log-line, correcting for OCD's sake [21:00:55] !log T280001 Sanity check of `sudo ipvsadm -L -n` on low-traffic backups `lvs2010` and `lvs1016` looks good, proceeding [21:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:42] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) Ryan Kemper phabricator.wikimedia.org/T280001 these alerts are expected per https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers https://wikitech.wikimedia.org/wiki/PyBal [21:02:42] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 71 connections established with conf1004.eqiad.wmnet:4001 (min=72) Ryan Kemper phabricator.wikimedia.org/T280001 these alerts are expected per https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers https://wikitech.wikimedia.org/wiki/PyBal [21:02:42] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.67:443]) Ryan Kemper phabricator.wikimedia.org/T280001 these alerts are expected per https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers https://wikitech.wikimedia.org/wiki/PyBal [21:02:42] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) Ryan Kemper phabricator.wikimedia.org/T280001 these alerts are expected per https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers https://wikitech.wikimedia.org/wiki/PyBal [21:02:42] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 63 connections established with conf2004.codfw.wmnet:4001 (min=64) Ryan Kemper phabricator.wikimedia.org/T280001 these alerts are expected per https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers https://wikitech.wikimedia.org/wiki/PyBal [21:02:43] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.67:443]) Ryan Kemper phabricator.wikimedia.org/T280001 these alerts are expected per https://wikitech.wikimedia.org/wiki/LVS#Configure_the_load_balancers https://wikitech.wikimedia.org/wiki/PyBal [21:03:27] (03CR) 10BryanDavis: [C: 03+2] toolhub: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/723309 (owner: 10BryanDavis) [21:03:36] (03PS1) 10Dzahn: puppetmaster::rsync: replace data sync crons with timers/jobs [puppet] - 10https://gerrit.wikimedia.org/r/723310 (https://phabricator.wikimedia.org/T273673) [21:03:38] (03PS1) 10Dzahn: puppetmaster::rsync: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/723311 (https://phabricator.wikimedia.org/T273673) [21:03:57] ACKNOWLEDGEMENT - Confd template for /srv/config-master/pybal/codfw/wcqs on puppetmaster1001 is CRITICAL: File not found: /srv/config-master/pybal/codfw/wcqs Ryan Kemper https://phabricator.wikimedia.org/T280001 https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [21:04:12] !log T280001 Waited 120s and checked https://icinga.wikimedia.org/alerts, proceeding to primary low-traffic hosts `lvs2009` and `lvs1015` [21:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:26] thanks for the ACKs, Ryan [21:04:40] !log T280001 Restarting pybal on low-traffic primaries `lvs2009` and `lvs1015`... [21:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:04] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 64 connections established with conf2004.codfw.wmnet:4001 (min=64) https://wikitech.wikimedia.org/wiki/PyBal [21:05:44] !log T280001 Restarted pybal on low-traffic primaries: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009*,lvs1015*}' 'sudo systemctl restart pybal'` [21:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:50] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [21:06:00] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 72 connections established with conf1004.eqiad.wmnet:4001 (min=72) https://wikitech.wikimedia.org/wiki/PyBal [21:07:18] (03PS2) 10Dzahn: puppetmaster::rsync: replace data sync crons with timers/jobs [puppet] - 10https://gerrit.wikimedia.org/r/723310 (https://phabricator.wikimedia.org/T273673) [21:07:43] (03Merged) 10jenkins-bot: toolhub: bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/723309 (owner: 10BryanDavis) [21:09:54] RECOVERY - Long running screen/tmux on releases1002 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [21:16:39] (03PS1) 10Dzahn: puppetmaster: rename cron references to jobs/timers [puppet] - 10https://gerrit.wikimedia.org/r/723313 [21:18:09] (03PS3) 10Dzahn: puppetmaster::rsync: replace data sync crons with timers/jobs [puppet] - 10https://gerrit.wikimedia.org/r/723310 (https://phabricator.wikimedia.org/T273673) [21:19:47] !log The pybal side of the changes looks good, but I made a mistake with the assigning of IPs in netbox; `wcqs.svc.eqiad.wmnet` is routing to where codfw should go and vice versa. Fixing... [21:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:20] (03CR) 10BryanDavis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/723007 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [21:23:28] !log T280001 Swapped IPs of https://netbox.wikimedia.org/ipam/ip-addresses/9062/ and https://netbox.wikimedia.org/ipam/ip-addresses/9063; this should fix the issue where eqiad and codfw were swapped in netbox (my error)...still need to run netbox cookbook and possibly a manual `sudo authdns-update` [21:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:36] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [21:24:01] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [21:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:48] cwhite: thanks for the ping, looking into it. [21:27:00] !log T280001 `ryankemper@cumin1001:~$ sudo -i cookbook sre.dns.netbox -t T280001 'Fix swapped wcqs.svc.[eqiad,codfw].wmnet'` in progress (note: no `sudo authdns-update` will be necessary because that's just for `operations/dns` repo changes; we only need to run the netbox cookbook) [21:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:05] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:51] (03PS1) 10Ebernhardson: query_service: Add monitoring::groups for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/723314 [21:36:04] !log T280001 `sre.dns.netbox` run complete, netbox IP mixup *should* be resolved [21:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:10] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [21:36:44] 10SRE, 10Patch-For-Review: stop using mod_php anywhere - https://phabricator.wikimedia.org/T208257 (10Dzahn) https://debmonitor.wikimedia.org/packages/libapache2-mod-php https://debmonitor.wikimedia.org/packages/libapache2-mod-php5.6 https://debmonitor.wikimedia.org/packages/libapache2-mod-php7.0 https://de... [21:37:11] (03CR) 10BryanDavis: [C: 03+2] "Let's find out if this is the right magic!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/723007 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [21:38:27] (03CR) 10Btullis: Install Alluxio to the test cluster (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/712974 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [21:41:04] RECOVERY - SSH on analytics1069.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:41:41] (03Merged) 10jenkins-bot: toolhub: text-lb egress + no_proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/723007 (https://phabricator.wikimedia.org/T291447) (owner: 10BryanDavis) [21:41:50] bblack: so I went through the pybal restart process etc, and during the post-checks noticed that I'd mixed up the netbox IPAM addreses for `wcqs.svc.{eqiad,codfw}.wmnet`. I swapped them to match how they should be, and re-ran the `sre.dns.netbox` cookbook [21:42:23] bblack: but from an arbitrary host (`cumin1001`) when I `ping wcqs.svc.codfw.wmnet` I still see it pointing to the wrong IP, and vice versa [21:42:55] is there a TTL that I just need to wait for, or is there another button I need to push somewhere? and my followup question is whether the IPAM mixup is the source of the `ipvs pybal diff check` alerts not resolving now that I'm done with the pybal restarts [21:43:03] !log altering some rows in the `securepoll_elections` table on metawiki [21:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:45] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [21:43:48] and relatedly wondering if I need to roll the pybal restarts again, I haven't restarted them since fixing the netbox IPAM mixup [21:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:38] ryankemper: so you're saying they each had each others' IPs in netbox [21:44:42] ? [21:44:44] bblack: correct [21:45:34] just so I don't get confused, what's the correct way? [21:45:37] https://netbox.wikimedia.org/ipam/ip-addresses/9062/ and https://netbox.wikimedia.org/ipam/ip-addresses/9063/ are the addresses in question, and what you see in netbox is correct now [21:46:18] are you sure? [21:46:51] e.g. on https://netbox.wikimedia.org/ipam/ip-addresses/9063/ ... I see: [21:46:54] DNS Name wcqs.svc.codfw.wmnet [21:46:56] which are in: [21:46:58] Description Wikimedia Common Query Service - codfw [21:47:04] 10.2.2.0/24Active—Equinix Ashburn—LVS service IPsLVS low-traffic (internal) services [21:47:12] there's still some mismatch there [21:47:42] bblack: hmm right...I also see that the lvs docs say `codfw should be in the 10.2.1.0/24 range` and `eqiad should be in the 10.2.2.0/24 range` [21:48:03] So it's possible the IPAM was right and our `operations/dns` and `operations/puppet` changes were wrong, one sec... [21:48:20] so... netbox was right, gerrit patch had it backwards, and now netbox was corrected in the wrong direction, is where we're at, I think? [21:49:02] bblack: that's my (rapidly changing) current understanding, yes [21:49:23] bblack: if so I think that means I need to swap IPAM again, run the cookbook again, then fix the `operations/dns` and do a `sudo authdns-update`? [21:50:55] that sounds about right [21:51:38] and then you'll also need to fix the puppet side in: hieradata/common/service.yaml [21:52:02] and then after all those, re-run puppet on pybals + restart those affected pybals again [21:52:25] and then we might need to do some manual cleanup on the pybal hosts, as it won't clear the old wrong addresses out of the kernel tables on its own. [21:52:36] bblack: it looks like `operations/dns` is correct actually, so I suppose the original problem was only in the puppet side [21:52:45] I'll fix the IPAM right now since we know that's wrong [21:53:18] ryankemper: yes, confirmed, ops/dns copy is correct [21:53:19] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [21:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:25] well, I missed one step above [21:55:47] (03PS1) 10Ryan Kemper: wcqs: fix swapped codfw / eqiad ip defaults [puppet] - 10https://gerrit.wikimedia.org/r/723315 (https://phabricator.wikimedia.org/T280001) [21:55:49] commit/merge the puppet fix, re-run puppet on the affected LVS realservers first (to change their local lo:LVS destination IPs) [21:56:05] then re-run on lvses, then start the usual pybal restarts sequence [21:56:32] (03CR) 10BBlack: [C: 03+1] wcqs: fix swapped codfw / eqiad ip defaults [puppet] - 10https://gerrit.wikimedia.org/r/723315 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [21:56:34] bblack: okay, so puppet on `wdqs*`, then puppet on all lvs, then the low-traffic backup restarts then the low-traffic primary restarts [21:56:44] correct [21:56:50] (03CR) 10Ryan Kemper: [C: 03+2] wcqs: fix swapped codfw / eqiad ip defaults [puppet] - 10https://gerrit.wikimedia.org/r/723315 (https://phabricator.wikimedia.org/T280001) (owner: 10Ryan Kemper) [21:56:55] and then remind me I need to go do some manual cleanup :) [21:57:43] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:01] we need to preserve this backscroll and drag it out every time someone talks about the value of having a single source of truth for address data! :) [21:58:17] :P you can say that again [21:58:21] (which netbox aims to be, but there's still a lot of manual/separate things, obviously!) [21:59:08] !log T280001 Swapped the netbox IPAM addresses back, after erroneously swapping them earlier. `sre.dns.netbox` cookbook run complete as well [21:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:18] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [21:59:50] !log T280001 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/723315, ran puppet agent on `wcqs*` to fix `local lo:LVS destination IPs` [21:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:52] !log T280001 Running puppet on all lvs hosts: `ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'`... [22:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:18] !log T280001 Ran puppet on all lvs hosts: `ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'` [22:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:51] !log T280001 Restarting pybal on low-traffic backups `lvs2010` and `lvs1016`... [22:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:09] !log T280001 Restarted pybal on low-traffic backups: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2010*,lvs1016*}' 'sudo systemctl restart pybal'` [22:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:51] (03PS1) 10Andrew Bogott: mediawiki-vagrant: added lxc config for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/723319 (https://phabricator.wikimedia.org/T291660) [22:05:32] !log T280001 [Sanity check] `TCP 10.2.2.67:443 wrr` shows up on `ryankemper@lvs1016:~$ sudo ipvsadm -L -n` and `TCP 10.2.1.67:443 wrr` shows up on `ryankemper@lvs2010:~$ sudo ipvsadm -L -n` as expected [22:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:38] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [22:05:50] !log T280001 [Cleanup required] `TCP 10.2.1.67:443 wrr` shows up on `ryankemper@lvs1016:~$ sudo ipvsadm -L -n` and `TCP 10.2.2.67:443 wrr` shows up on `ryankemper@lvs2010:~$ sudo ipvsadm -L -n` (erroneous) [22:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:18] (03CR) 10Andrew Bogott: [C: 03+2] mediawiki-vagrant: added lxc config for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/723319 (https://phabricator.wikimedia.org/T291660) (owner: 10Andrew Bogott) [22:06:33] !log T280001 Waited 120s and checked https://icinga.wikimedia.org/alerts, proceeding to primary low-traffic hosts `lvs2009` and `lvs1015` [22:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:54] !log T280001 Restarted pybal on low-traffic primaries: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009*,lvs1015*}' 'sudo systemctl restart pybal'` [22:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:27] bblack: Okay, ready for cleanup. I imagine we at least need to remove `10.2.1.67:443 wrr` from `lvs1016` as well as `TCP 10.2.2.67:443 wrr` from `lvs2010` [22:08:43] ryankemper: also, relatedly (perhaps you hadn't gotten to it yet!) - but the wcqsx00N hosts in conftool are still in the inactive state [22:09:04] ryankemper: yeah, and from the primaries [22:09:24] ryankemper: the command should be this, as root: [22:09:25] right [22:09:41] ipvsadm -Dt 10.2.1.67:443 [22:09:48] (with the correct IP to delete on each, the wrong ones) [22:10:05] bblack: okay, I can run those commands (and log here) if you're ready [22:10:12] yup, go for it [22:13:10] !log T280001 [eqiad] `root@lvs1016:/home/ryankemper# ipvsadm -Dt 10.2.1.67:443` and `root@lvs1015:/home/ryankemper# ipvsadm -Dt 10.2.1.67:443` [22:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:17] !log T280001 [codfw] `root@lvs2010:/home/ryankemper# ipvsadm -Dt 10.2.2.67:443` and `root@lvs2009:/home/ryankemper# ipvsadm -Dt 10.2.2.67:443` [22:13:17] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [22:13:20] bblack: sanity check ^ [22:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:08] ryankemper: looks right to me! [22:15:00] bblack: now as for the `ipvs diff checks` not resoving, is that related to your earlier point about still being listed as `inactive`? [22:15:10] probably [22:15:11] bblack@cumin1001:~$ confctl select 'name=wcqs.*' get [22:15:11] {"wcqs2001.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=wcqs,service=wcqs"} [22:15:15] ... [22:15:26] (03PS1) 10Reedy: Add table and script for mcdc2021 election [extensions/SecurePoll] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723203 (https://phabricator.wikimedia.org/T291668) [22:15:27] I think once you pool some servers, it can have a valid set of backends in ipvs [22:15:31] (03CR) 10Reedy: [C: 03+2] Add table and script for mcdc2021 election [extensions/SecurePoll] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723203 (https://phabricator.wikimedia.org/T291668) (owner: 10Reedy) [22:15:58] (and give them non-zero weights, too, I think!) [22:17:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:53] !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=wcqs.* [22:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:21] !log T280001 `ryankemper@puppetmaster1001:/srv$ sudo confctl select 'name=wcqs.*' set/pooled=yes:weight=10` [22:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:28] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [22:18:36] Going to force a recheck of the ipvs diff alerts [22:19:54] IS there intentionally no backport window today? https://wikitech.wikimedia.org/wiki/Deployments [22:20:03] (03Merged) 10jenkins-bot: Add table and script for mcdc2021 election [extensions/SecurePoll] (wmf/1.38.0-wmf.1) - 10https://gerrit.wikimedia.org/r/723203 (https://phabricator.wikimedia.org/T291668) (owner: 10Reedy) [22:20:20] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:20:20] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:20:20] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:20:20] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [22:20:26] \o/ [22:20:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:06] * bblack hands ryankemper an LVS Service Deployment Training Certificate of Completion and adds his name to the list of people to ping when someone wants to deploy a new LVS service :) [22:21:39] * ryankemper frantically tries to burn certificate, only to have it keep re-materializing into existence [22:21:44] bblack: I graciously accept :) [22:21:53] Jdlrobson: https://sal.toolforge.org/log/hXVzFHwB1jz_IcWuyDMZ [22:22:13] bblack: thanks for helping me clean up the mess! :D [22:23:46] Thanks ryankemper [22:23:53] takes notes who to ask for miscweb LVS deployment [22:27:01] !log T280001 The pooling of the `wcqs*` hosts has gotten `/srv/config-master/pybal/${DC}/wcqs` to render, but we need to clear away the stale error files to get rid of the associated warnings `Stale template error files present for '/srv/config-master/pybal/${DC}/wcqs'` => `sudo rm -fv /var/run/confd-template/.wcqs*` [22:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:06] T280001: Set up puppet configuration for new WCQS cluster - https://phabricator.wikimedia.org/T280001 [22:27:50] !log T280001 `ryankemper@cumin1001:~$ sudo cumin 'P{puppetmaster*}' 'sudo rm -fv /var/run/confd-template/.wcqs*'` complete, forcing recheck [22:27:54] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wcqs on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:04] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wcqs on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:28:16] Actually no need to force a recheck, it happens quick enough on its own apparently :) [22:29:12] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/wcqs on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:29:36] RECOVERY - Confd template for /srv/config-master/pybal/codfw/wcqs on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [22:33:34] !log reedy@deploy1002 Synchronized php-1.38.0-wmf.1/extensions/SecurePoll/cli/wm-scripts/: T291668 (duration: 00m 57s) [22:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:41] T291668: Create SecurePoll election for MCDC 2021 - https://phabricator.wikimedia.org/T291668 [22:34:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [22:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:23] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:58:10] !log creating `mcdc2021_edits` table on each wiki for elections voterlist https://phabricator.wikimedia.org/T291668 [22:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:50] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [22:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:34] (03PS1) 10BryanDavis: production-m5.sql.erb: Update toolhub grants [puppet] - 10https://gerrit.wikimedia.org/r/723329 (https://phabricator.wikimedia.org/T271480) [23:38:16] !log running wm-scripts/mcdc2021/populateEditCount.php on each wiki (s1 thru s8 simultaneously) https://phabricator.wikimedia.org/T291668 [23:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:22] (03PS2) 10BryanDavis: production-m5.sql.erb: Update toolhub grants [puppet] - 10https://gerrit.wikimedia.org/r/723329 (https://phabricator.wikimedia.org/T271480) [23:40:44] (03PS3) 10Cwhite: opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) [23:46:51] (03CR) 10BryanDavis: "I'm not sure if it is nicer to have a grant for each /24 in the /21 or if you would like a more compact representation, but I think this w" [puppet] - 10https://gerrit.wikimedia.org/r/723329 (https://phabricator.wikimedia.org/T271480) (owner: 10BryanDavis) [23:46:58] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) fwiw: While looking at this I found we have the email alias maxmind@wikimedia and it forwards to fr-tech@w... [23:53:52] (03PS3) 10Cwhite: opensearch_dashboards: fork kibana module into opensearch_dashboards module [puppet] - 10https://gerrit.wikimedia.org/r/721385 (https://phabricator.wikimedia.org/T288618) [23:53:54] (03PS3) 10Cwhite: icinga: fork icinga::monitor::elasticsearch::base_checks [puppet] - 10https://gerrit.wikimedia.org/r/721386 (https://phabricator.wikimedia.org/T288618) [23:53:56] (03PS2) 10Cwhite: profile: fork elasticsearch profile into opensearch::server [puppet] - 10https://gerrit.wikimedia.org/r/721388 (https://phabricator.wikimedia.org/T288618) [23:53:58] (03PS3) 10Cwhite: profile: fork elasticsearch base_checks for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/721389 (https://phabricator.wikimedia.org/T288618) [23:54:00] (03PS2) 10Cwhite: profile: fork kibana profile into opensearch::dashboards [puppet] - 10https://gerrit.wikimedia.org/r/721391 (https://phabricator.wikimedia.org/T288618) [23:54:02] (03PS3) 10Cwhite: profile: fork elasticsearch::logstash into opensearch::logstash [puppet] - 10https://gerrit.wikimedia.org/r/721395 (https://phabricator.wikimedia.org/T288618) [23:54:04] (03PS2) 10Cwhite: role: add logging::opensearch::collector role [puppet] - 10https://gerrit.wikimedia.org/r/721397 (https://phabricator.wikimedia.org/T288618) [23:54:06] (03PS3) 10Cwhite: role: add logging::opensearch::data role [puppet] - 10https://gerrit.wikimedia.org/r/721400 (https://phabricator.wikimedia.org/T288618)