[00:01:10] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:725132|Add WN as an alias to project namespace in Polish Wikinews (T291344)]] (duration: 01m 04s) [00:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:16] T291344: Request for WN namespace alias on Polish Wikinews - https://phabricator.wikimedia.org/T291344 [00:01:22] RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:29] legoktm: done [00:01:32] (03CR) 10Bstorm: [C: 03+2] toolforge postgres: drop database tuning [puppet] - 10https://gerrit.wikimedia.org/r/726723 (https://phabricator.wikimedia.org/T267616) (owner: 10Bstorm) [00:02:15] ty! [00:02:41] (03CR) 10Arlolra: [C: 03+2] Add a separate config for content.media.less [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726709 (https://phabricator.wikimedia.org/T292498) (owner: 10Arlolra) [00:02:43] (03CR) 10Arlolra: [C: 03+2] Add a separate config for content.media.less [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726707 (https://phabricator.wikimedia.org/T292498) (owner: 10Arlolra) [00:03:19] (03CR) 10Dzahn: "You were right, it was a little simpler that way, though not by much :) I solved it with a couple separate merges you can see in git log. " [puppet] - 10https://gerrit.wikimedia.org/r/725390 (owner: 10Dzahn) [00:03:47] (03Abandoned) 10Dzahn: puppetmaster/geoip: do not duplicate pulling of maxmind on all servers [puppet] - 10https://gerrit.wikimedia.org/r/725390 (owner: 10Dzahn) [00:04:50] (03PS4) 10Dzahn: mediawiki/geoip: add option to also pull new MaxMind databases from master [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) [00:05:12] PROBLEM - Check systemd state on puppetmaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:14] PROBLEM - Check systemd state on puppetmaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:32] PROBLEM - Check systemd state on puppetmaster2003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:36] ah [00:05:50] that would be me. duplicate config in logrotate [00:06:06] PROBLEM - Check systemd state on puppetmaster1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:16] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:34] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:51] ACKNOWLEDGEMENT - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service daniel_zahn duplicate entry in config, fix coming https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:51] ACKNOWLEDGEMENT - Check systemd state on puppetmaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service daniel_zahn duplicate entry in config, fix coming https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:51] ACKNOWLEDGEMENT - Check systemd state on puppetmaster1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service daniel_zahn duplicate entry in config, fix coming https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:51] ACKNOWLEDGEMENT - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service daniel_zahn duplicate entry in config, fix coming https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:51] ACKNOWLEDGEMENT - Check systemd state on puppetmaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service daniel_zahn duplicate entry in config, fix coming https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:51] ACKNOWLEDGEMENT - Check systemd state on puppetmaster2003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service daniel_zahn duplicate entry in config, fix coming https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:14:54] !log puppetmaster2002 - rm /etc/logrotate.d/geoipupdate (not managed by puppet anymore but not removed, caused duplicate logrotate config, made logrotate service fail), start logrotate [00:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:14] RECOVERY - Check systemd state on puppetmaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:34] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:56] !log puppetmasters: rm /etc/logrotate.d/geoipupdate && systemctl start logrotate && puppet agent -tv [00:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:08] RECOVERY - Check systemd state on puppetmaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:30] RECOVERY - Check systemd state on puppetmaster2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:02] RECOVERY - Check systemd state on puppetmaster1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:12] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:22:21] (03CR) 10Dzahn: "I do still need this https://gerrit.wikimedia.org/r/c/operations/puppet/+/726094" [puppet] - 10https://gerrit.wikimedia.org/r/725390 (owner: 10Dzahn) [00:22:45] (03Merged) 10jenkins-bot: Add a separate config for content.media.less [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726709 (https://phabricator.wikimedia.org/T292498) (owner: 10Arlolra) [00:22:54] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/31500/" [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [00:27:08] (03Merged) 10jenkins-bot: Add a separate config for content.media.less [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726707 (https://phabricator.wikimedia.org/T292498) (owner: 10Arlolra) [00:29:39] !log arlolra@deploy1002 Synchronized php-1.38.0-wmf.3/includes/DefaultSettings.php: Add a separate config for content.media.less 1/2 (duration: 01m 04s) [00:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:49] (03PS1) 10Dzahn: mcrouter: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726726 (https://phabricator.wikimedia.org/T266479) [00:30:24] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726726 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [00:32:33] !log arlolra@deploy1002 Synchronized php-1.38.0-wmf.3/includes/resourceloader/ResourceLoaderSkinModule.php: Add a separate config for content.media.less 2/2 (duration: 01m 03s) [00:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:26] (03PS2) 10Dzahn: mcrouter: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726726 (https://phabricator.wikimedia.org/T266479) [00:34:36] PROBLEM - Check systemd state on ms-be1043 is CRITICAL: CRITICAL - degraded: The following units failed: session-207264.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:40] !log arlolra@deploy1002 Synchronized php-1.38.0-wmf.2/includes/DefaultSettings.php: Add a separate config for content.media.less 1/2 (duration: 01m 03s) [00:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:32] (03PS1) 10Dzahn: statistics::web: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726728 (https://phabricator.wikimedia.org/T266479) [00:37:08] !log arlolra@deploy1002 Synchronized php-1.38.0-wmf.2/includes/resourceloader/ResourceLoaderSkinModule.php: Add a separate config for content.media.less 2/2 (duration: 01m 03s) [00:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:18] arlolra: https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/8450/consoleFull [00:38:25] (03CR) 10Legoktm: [C: 03+1] Enable legacy media dom on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726643 (https://phabricator.wikimedia.org/T292498) (owner: 10Arlolra) [00:40:27] (03PS4) 10Arlolra: Enable legacy media dom on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726643 (https://phabricator.wikimedia.org/T292498) [00:41:22] (03PS5) 10Arlolra: Enable legacy media dom on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726643 (https://phabricator.wikimedia.org/T292498) [00:41:42] (03PS1) 10Dzahn: dynamicproxy: replace proxydb-bak cron with systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/726729 (https://phabricator.wikimedia.org/T273673) [00:41:46] (03PS1) 10Dzahn: dynamicproxy: remove absented cron code [puppet] - 10https://gerrit.wikimedia.org/r/726730 (https://phabricator.wikimedia.org/T273673) [00:41:52] (03CR) 10Arlolra: [C: 03+2] Enable legacy media dom on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726643 (https://phabricator.wikimedia.org/T292498) (owner: 10Arlolra) [00:43:09] (03Merged) 10jenkins-bot: Enable legacy media dom on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726643 (https://phabricator.wikimedia.org/T292498) (owner: 10Arlolra) [00:43:28] (03CR) 10Dzahn: "assuming you don't care that it runs at a specific time, as long as it is daily (24 hours after whenever the last run was)" [puppet] - 10https://gerrit.wikimedia.org/r/726729 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [00:44:31] (03CR) 10Dzahn: "also of course you could now, if you wanted to, have monitoring/logging/mail about failures" [puppet] - 10https://gerrit.wikimedia.org/r/726729 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [00:55:23] arlolra: https://gerrit.wikimedia.org/r/726731 [00:59:56] !log arlolra@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Enable legacy media dom on metawiki (duration: 01m 05s) [01:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:42] (03PS1) 10Legoktm: Bump GlobalUserPage::PARSED_CACHE_VERSION for media DOM changes [extensions/GlobalUserPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726600 (https://phabricator.wikimedia.org/T292498) [01:00:47] (03CR) 10Legoktm: [C: 03+2] Bump GlobalUserPage::PARSED_CACHE_VERSION for media DOM changes [extensions/GlobalUserPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726600 (https://phabricator.wikimedia.org/T292498) (owner: 10Legoktm) [01:00:55] (03PS1) 10Legoktm: Bump GlobalUserPage::PARSED_CACHE_VERSION for media DOM changes [extensions/GlobalUserPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726601 (https://phabricator.wikimedia.org/T292498) [01:00:59] (03CR) 10Legoktm: [C: 03+2] Bump GlobalUserPage::PARSED_CACHE_VERSION for media DOM changes [extensions/GlobalUserPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726601 (https://phabricator.wikimedia.org/T292498) (owner: 10Legoktm) [01:04:40] (03Merged) 10jenkins-bot: Bump GlobalUserPage::PARSED_CACHE_VERSION for media DOM changes [extensions/GlobalUserPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726600 (https://phabricator.wikimedia.org/T292498) (owner: 10Legoktm) [01:04:48] (03Merged) 10jenkins-bot: Bump GlobalUserPage::PARSED_CACHE_VERSION for media DOM changes [extensions/GlobalUserPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726601 (https://phabricator.wikimedia.org/T292498) (owner: 10Legoktm) [01:12:32] !log arlolra@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/GlobalUserPage/includes/GlobalUserPage.php: Bump GlobalUserPage::PARSED_CACHE_VERSION for media DOM changes (duration: 01m 17s) [01:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:12] RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:25] !log arlolra@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/GlobalUserPage/includes/GlobalUserPage.php: Bump GlobalUserPage::PARSED_CACHE_VERSION for media DOM changes (duration: 01m 03s) [01:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:00] Is it normal for an already merged change to take a while to work? [01:20:09] Juan_90264: it depends, what's the change? [01:20:40] just in general, we're all done now with deploys [01:21:19] This one: https://phabricator.wikimedia.org/T292109 | https://gerrit.wikimedia.org/r/725413 [01:23:13] (IcingaOverload) firing: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [01:23:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [01:23:55] Juan_90264: I think it might take logos up to 24h to show up [01:25:27] Ok [01:26:21] It's finally working now [01:33:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [01:38:13] (IcingaOverload) resolved: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [01:38:30] PROBLEM - Disk space on aqs1004 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra-a 112169 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1004&var-datasource=eqiad+prometheus/ops [01:39:28] great [01:39:36] I purged the cache manually for it [01:39:43] !log legoktm@mwmaint1002:~$ echo "https://en.wikiversity.org/static/images/mobile/copyright/wikiversity.svg" |mwscript purgeList.php [01:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:27] Oh so that's what helped it work, thanks [02:04:26] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: generate_os_reports.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:16] (03PS1) 10Tim Starling: RowCommentIterator: Cast data coming out of the DB [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726602 (https://phabricator.wikimedia.org/T292590) [02:24:23] Hello, i returned [02:26:26] I noticed that several Phabricator users have a "User-USERNAME" project, can any administrators there create one for me? My username there is Juan90264 [02:29:49] Someone online? [02:45:24] (03CR) 10Tim Starling: [C: 03+2] RowCommentIterator: Cast data coming out of the DB [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726602 (https://phabricator.wikimedia.org/T292590) (owner: 10Tim Starling) [02:50:57] Juan_90264: see https://www.mediawiki.org/wiki/Phabricator/Creating_and_renaming_projects#Creating_new_projects [02:55:42] Okay [03:07:14] (03Merged) 10jenkins-bot: RowCommentIterator: Cast data coming out of the DB [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726602 (https://phabricator.wikimedia.org/T292590) (owner: 10Tim Starling) [03:11:53] !log tstarling@deploy1002 Synchronized php-1.38.0-wmf.3/includes/CommentFormatter/RowCommentIterator.php: fix UBN T292590 (duration: 01m 04s) [03:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:01] T292590: TypeError: Argument 1 passed to MediaWiki\CommentFormatter\CommentItem::__construct() must be of the type string, null given - https://phabricator.wikimedia.org/T292590 [03:15:38] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 103 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [03:20:20] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:20:48] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1012.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:26:52] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-10-09 03:00:15 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [03:28:58] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [03:38:48] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_text layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [03:39:54] looking [03:39:59] hi [03:40:19] seems like its esquin text? [03:40:43] but before that it was esams [03:40:47] looks like not just eqsin, but the others have mostly recovered https://grafana.wikimedia.org/d/000000479/frontend-traffic?orgId=1&refresh=1m&viewPanel=13&from=now-1h&to=now [03:40:48] yeah [03:40:58] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 42.85 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:40:58] oh and one of those is "global" [03:41:02] so just esams eqsin yeah [03:41:33] user report in -tech also [03:43:05] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [03:43:45] that's good at least [03:45:12] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 84.63 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:45:14] yeah, graphs look recovered but no idea what happened eyt [03:46:40] see _security [03:46:47] 👍 [04:03:14] 10SRE, 10Observability-Metrics, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482 (10lmata) [04:06:40] 10SRE, 10Analytics, 10Event-Platform, 10Observability-Logging, and 3 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10lmata) [04:12:22] (03PS3) 10Juan90264: Enable NewUserMessage for ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726603 (https://phabricator.wikimedia.org/T290820) [04:13:18] PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:15:11] RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:25:41] !log [WDQS] Repooling eqiad hosts following the brief outage from earlier: `wdqs1004`, `wdqs1006`, `wdqs1007` [04:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:47] (done) [04:27:38] !log [WDQS] `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph` (attempting to bring downed `wdqs1012` back into health) [04:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:13] (IcingaOverload) firing: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [04:29:25] !log [WDQS] `wdqs1012` is back up after restarting blazegraph (blazegraph was locked up) [04:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:48] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:30:14] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [04:37:34] (03CR) 10Ladsgroup: [C: 03+1] "Seems straightforward enough." [puppet] - 10https://gerrit.wikimedia.org/r/726729 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [04:38:13] (IcingaOverload) resolved: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [04:49:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:50:08] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:23] (03PS1) 10Gergő Tisza: Delete gettingstarted-with-category-suggestions dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726743 (https://phabricator.wikimedia.org/T235752) [04:51:48] (03CR) 10jerkins-bot: [V: 04-1] Delete gettingstarted-with-category-suggestions dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726743 (https://phabricator.wikimedia.org/T235752) (owner: 10Gergő Tisza) [04:54:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:31] 10SRE, 10observability: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10RLazarus) [05:04:51] 10SRE, 10observability: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10RLazarus) [05:06:46] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-10-09 03:00:15 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [05:08:52] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [05:11:04] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:49] * kart_ is updating cxserver.. [05:21:06] (03PS4) 10KartikMistry: Update cxserver to use nodejs12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725866 (https://phabricator.wikimedia.org/T290754) [05:22:11] (03PS1) 10Majavah: wikimediacloud.org: Add CAA records [dns] - 10https://gerrit.wikimedia.org/r/726745 [05:25:33] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to use nodejs12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725866 (https://phabricator.wikimedia.org/T290754) (owner: 10KartikMistry) [05:27:24] (03PS1) 10Ladsgroup: mediawiki: Absent wikibase_repo_prune2 systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/726746 (https://phabricator.wikimedia.org/T292604) [05:29:42] (03Merged) 10jenkins-bot: Update cxserver to use nodejs12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/725866 (https://phabricator.wikimedia.org/T290754) (owner: 10KartikMistry) [05:31:07] !log kartik@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [05:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:24] (03CR) 10Vgutierrez: [C: 03+1] "It looks good assuming that you're not currently issuing certs with any other CA for wikimediacloud.org" [dns] - 10https://gerrit.wikimedia.org/r/726745 (owner: 10Majavah) [05:36:19] 10SRE, 10DNS, 10Traffic: Additional DNS entries for Wikilearn project (Community Development) - https://phabricator.wikimedia.org/T292537 (10Vgutierrez) 05Stalled→03In progress [05:36:25] !log start of mwscript extensions/Wikibase/repo/maintenance/pruneChanges.php --wiki wikidatawiki --number-of-days=2 [05:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:34] (03PS1) 10Vgutierrez: learn.wiki: forum related records [dns] - 10https://gerrit.wikimedia.org/r/726747 (https://phabricator.wikimedia.org/T292537) [05:39:12] !log kartik@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:20] (03CR) 10Vgutierrez: [C: 03+2] learn.wiki: forum related records [dns] - 10https://gerrit.wikimedia.org/r/726747 (https://phabricator.wikimedia.org/T292537) (owner: 10Vgutierrez) [05:39:55] _joe_: hi, can I bother you with https://gerrit.wikimedia.org/r/c/operations/puppet/+/726746 ? [05:40:04] It can be safely removed now [05:40:48] <_joe_> Amir1: in a few [05:40:54] Thanks [05:43:49] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Additional DNS entries for Wikilearn project (Community Development) - https://phabricator.wikimedia.org/T292537 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez ` vgutierrez@carrot:~/wikimedia.org/operations/dns$ host _fbf735f01a612e98f20b40a80776ee... [05:46:46] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:25] !log kartik@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:23] !log Updated cxserver to use nodejs12 (T290754) [05:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:29] T290754: Migrate cxserver production service to node12 - https://phabricator.wikimedia.org/T290754 [05:54:56] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10KartikMistry) [06:05:24] (03PS7) 10Vgutierrez: cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) [06:05:43] (03CR) 10Vgutierrez: cache::haproxy: Manage request/response headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [06:15:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, I am not sure if we might want to make all the http2 protocol options mandatory once you define one though." [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [06:15:38] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-10-09 03:00:15 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [06:17:44] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [06:18:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: Support PreserveCase HeaderKeyFormat [puppet] - 10https://gerrit.wikimedia.org/r/713460 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [06:19:29] <_joe_> this thing with toolserver is happening repeatedly [06:19:39] <_joe_> it looks like one server has an old cert? [06:21:03] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: add Bullseye version of graphite auth/index [puppet] - 10https://gerrit.wikimedia.org/r/726613 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [06:26:08] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-10-09 03:00:15 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [06:27:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "the code lgtm, but I am unsure if we want to throw an error if someone tries to set a value on a v2 configuration." [puppet] - 10https://gerrit.wikimedia.org/r/714039 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [06:28:04] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [06:28:24] (03PS2) 10Gergő Tisza: Delete gettingstarted-with-category-suggestions dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726743 (https://phabricator.wikimedia.org/T235752) [06:31:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: Allow setting per_connection_buffer_limit_bytes [puppet] - 10https://gerrit.wikimedia.org/r/714379 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [06:32:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoyproxy: Add downstream idle_timeout config option [puppet] - 10https://gerrit.wikimedia.org/r/714380 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [06:34:02] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - Certificate toolserver.org valid until 2021-10-09 03:00:15 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/ [06:36:00] RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate toolserver.org valid until 2021-12-08 03:00:13 +0000 (expires in 62 days) https://phabricator.wikimedia.org/tag/toolforge/ [06:41:40] (03PS3) 10Filippo Giunchedi: pontoon: auto generate service certificates [puppet] - 10https://gerrit.wikimedia.org/r/725838 [06:41:42] (03PS1) 10Filippo Giunchedi: pontoon: use graphite-04 in o11y stack [puppet] - 10https://gerrit.wikimedia.org/r/726750 (https://phabricator.wikimedia.org/T247963) [06:43:01] (03PS2) 10Filippo Giunchedi: pontoon: use graphite-04 in o11y stack [puppet] - 10https://gerrit.wikimedia.org/r/726750 (https://phabricator.wikimedia.org/T247963) [06:43:42] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: use graphite-04 in o11y stack [puppet] - 10https://gerrit.wikimedia.org/r/726750 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [06:43:51] _joe_: afaik toolserver.org is only backed by one server, so not sure what's going on [06:54:46] I restarted apache2 on that vm, let's see if it alerts again [07:14:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10ayounsi) About the Ganeti, as I understand it, the issue is that they have .svc. in their fqdn, while not an SVC IP. I agree that changing the IP to so... [07:17:55] (03CR) 10Muehlenhoff: [C: 03+2] owners.yaml: Update a few entries [puppet] - 10https://gerrit.wikimedia.org/r/726657 (owner: 10Muehlenhoff) [07:19:29] 10SRE, 10LDAP-Access-Requests: Grant Access to (some Superset dashboards) for - https://phabricator.wikimedia.org/T292575 (10Aklapper) [07:24:26] (03PS1) 10Elukey: prometheus-amd-rocm-stats.py: support ROCm 4.2.0's smi output [puppet] - 10https://gerrit.wikimedia.org/r/726759 (https://phabricator.wikimedia.org/T287267) [07:26:09] (03CR) 10Elukey: [C: 03+2] prometheus-amd-rocm-stats.py: support ROCm 4.2.0's smi output [puppet] - 10https://gerrit.wikimedia.org/r/726759 (https://phabricator.wikimedia.org/T287267) (owner: 10Elukey) [07:31:42] (03PS1) 10JMeybohm: Enable NamespaceDefaultLabelName for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/726846 (https://phabricator.wikimedia.org/T290476) [07:33:07] jayme: \o/ [07:33:33] elukey: I see you don't have that enabled in ML. Do you plan to? [07:33:59] jayme: yes I wanted to wait for your input before proceeding, I can follow up anytime :) [07:35:10] elukey: okay. I'll let you know how it went [07:37:27] <_joe_> majavah: yeah looks like instance-toolserver-proxy-01.tools.wmflabs.org [07:40:07] <_joe_> and indeed, the cert expires in december [07:46:25] (03PS2) 10Majavah: wikimediacloud.org: Add CAA records [dns] - 10https://gerrit.wikimedia.org/r/726745 [07:50:29] jouncebot: nowandnext [07:50:30] No deployments scheduled for the next 3 hour(s) and 9 minute(s) [07:50:30] In 3 hour(s) and 9 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1100) [07:51:16] I'll do the fix I described at https://phabricator.wikimedia.org/T291344#7404686 now [07:51:31] !log Staging at mwdebug1001 for T291344 [07:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:39] T291344: Request for WN namespace alias on Polish Wikinews - https://phabricator.wikimedia.org/T291344 [07:53:41] 10SRE, 10Infrastructure-Foundations, 10netbox: Evaluate Nautobot fork of Netbox and decide whether to use. - https://phabricator.wikimedia.org/T288515 (10Majavah) [07:53:43] (03CR) 10Hashar: "Adding Reedy for the CSP configuration tuning." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/725900 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [07:53:49] (03PS4) 10Hashar: Enable Content-Security-Policy reporting [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/725900 (https://phabricator.wikimedia.org/T285363) [07:55:21] !log mwdebug1001: scap pull (T291344 fix done) [07:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:06] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php plwikinews # T291344 [07:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:14] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php plwikinews --fix # T291344 [07:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:00] (03CR) 10Volans: "Code LGTM, couple of nits for the tests and one for the documentation, sorry for not having mentioned it earlier." [software/spicerack] - 10https://gerrit.wikimedia.org/r/723214 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [07:58:05] (03CR) 10JMeybohm: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [07:59:38] (03CR) 10JMeybohm: [C: 03+1] hier::common::deployment_server add environment helmfile-defaults [puppet] - 10https://gerrit.wikimedia.org/r/721373 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:01:02] (03PS2) 10JMeybohm: hiera:kubernetes:deployment_server add deploy users for helm3 [puppet] - 10https://gerrit.wikimedia.org/r/725014 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:03:19] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31501/console" [puppet] - 10https://gerrit.wikimedia.org/r/725014 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:04:38] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [08:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:01] (03CR) 10JMeybohm: [V: 03+1 C: 04-1] "This requires the users to be added to profile::kubernetes::master::infrastructure_users in labs/private and private." [puppet] - 10https://gerrit.wikimedia.org/r/725014 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:21:39] !log add ROAs for 185.15.58.0/24 and 185.15.59.0/24 - T288505 - T283050 [08:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:46] T283050: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 [08:29:35] !log volans@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1001.eqiad.wmnet [08:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:50] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [08:37:27] (03PS1) 10Muehlenhoff: Don't include rsync::server for absented rsync modules [puppet] - 10https://gerrit.wikimedia.org/r/726851 [08:39:05] (03CR) 10jerkins-bot: [V: 04-1] Don't include rsync::server for absented rsync modules [puppet] - 10https://gerrit.wikimedia.org/r/726851 (owner: 10Muehlenhoff) [08:43:11] (03CR) 10Jelto: [C: 03+2] hier::common::deployment_server add environment helmfile-defaults [puppet] - 10https://gerrit.wikimedia.org/r/721373 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:43:36] (03PS3) 10Jelto: hier::common::deployment_server add environment helmfile-defaults [puppet] - 10https://gerrit.wikimedia.org/r/721373 (https://phabricator.wikimedia.org/T251305) [08:43:46] (03PS1) 10Phedenskog: Eventlogging: Remove unused RUM Speed Index. [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T286700) [08:45:25] jouncebot: nowandnext [08:45:25] No deployments scheduled for the next 2 hour(s) and 14 minute(s) [08:45:25] In 2 hour(s) and 14 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1100) [08:45:34] (03CR) 10Ladsgroup: [C: 03+2] Don't fail job if subscribed wiki is unknown [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/725923 (https://phabricator.wikimedia.org/T292446) (owner: 10Ladsgroup) [08:51:37] (03PS2) 10Muehlenhoff: Don't include rsync::server for absented rsync modules [puppet] - 10https://gerrit.wikimedia.org/r/726851 [08:52:44] (03PS1) 10Phedenskog: Remove unused RUM-SpeedIndex. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) [08:52:49] (Traffic bill over quota) resolved: (2) Traffic bill over quota - https://alerts.wikimedia.org [08:53:29] (03CR) 10Ladsgroup: [C: 04-1] "blocked on T292609. We possibly can make it once a day instead." [puppet] - 10https://gerrit.wikimedia.org/r/726746 (https://phabricator.wikimedia.org/T292604) (owner: 10Ladsgroup) [08:59:53] (03PS1) 10Jcrespo: mariadb: Add easy-to-use wrapper for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/726857 [09:00:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10akosiaris) >>! In T270071#7404620, @ayounsi wrote: > About the Ganeti, as I understand it, the issue is that they have .svc. in their fqdn, while not an... [09:00:23] (03PS5) 10Hashar: Enable Content-Security-Policy reporting [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/725900 (https://phabricator.wikimedia.org/T285363) [09:00:45] (03CR) 10Hashar: "I have adjusted the Lucene query from "source" to "source.keyword"." [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/725900 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [09:00:54] (03CR) 10Jbond: [C: 03+1] admin: update lmata ssh key [puppet] - 10https://gerrit.wikimedia.org/r/726682 (https://phabricator.wikimedia.org/T292583) (owner: 10Herron) [09:02:01] (03PS2) 10Jcrespo: mariadb: Add easy-to-use wrapper for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/726857 [09:03:34] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add easy-to-use wrapper for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [09:04:24] (03CR) 10Jbond: [C: 03+1] mediawiki/geoip: add option to also pull new MaxMind databases from master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [09:05:23] (03CR) 10Jcrespo: "This will need DBA review and coordination (they are out today), but I gave it a first shot based on the outage feedback given,- simplifyi" [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [09:06:06] (03Merged) 10jenkins-bot: Don't fail job if subscribed wiki is unknown [extensions/Wikibase] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/725923 (https://phabricator.wikimedia.org/T292446) (owner: 10Ladsgroup) [09:06:28] (03PS3) 10Jcrespo: mariadb: Add easy-to-use wrapper for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/726857 [09:09:29] (03PS1) 10Muehlenhoff: Add grafana import hook to pick the latest Grafana 7.x version [puppet] - 10https://gerrit.wikimedia.org/r/726858 (https://phabricator.wikimedia.org/T282863) [09:09:47] (03CR) 10Vgutierrez: envoyproxy: Allow setting http2 protocol options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714381 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:10:15] (03PS3) 10Muehlenhoff: Don't include rsync::server for absented rsync modules [puppet] - 10https://gerrit.wikimedia.org/r/726851 [09:10:27] (03PS2) 10Muehlenhoff: Add grafana import hook to pick the latest Grafana 7.x version [puppet] - 10https://gerrit.wikimedia.org/r/726858 (https://phabricator.wikimedia.org/T282863) [09:13:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:37] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/Wikibase/repo/includes/ChangeModification/DispatchChangesJob.php: Backport: [[gerrit:725923|Don't fail job if subscribed wiki is unknown (T292446 T292440)]] (duration: 01m 15s) [09:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:45] T292446: Change Dispatching should not stumble over mowiki redirecting to rowiki - https://phabricator.wikimedia.org/T292446 [09:13:45] T292440: huwikinews is closed, but still subscribed to some items - https://phabricator.wikimedia.org/T292440 [09:13:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/726851 (owner: 10Muehlenhoff) [09:13:54] (03CR) 10Jbond: [C: 03+2] facter: interface_primary consider tokenize slaac addresses [puppet] - 10https://gerrit.wikimedia.org/r/726625 (owner: 10Jbond) [09:14:03] (03CR) 10David Caro: [C: 03+1] Add grafana import hook to pick the latest Grafana 7.x version [puppet] - 10https://gerrit.wikimedia.org/r/726858 (https://phabricator.wikimedia.org/T282863) (owner: 10Muehlenhoff) [09:15:05] (03CR) 10Filippo Giunchedi: [C: 03+1] Add grafana import hook to pick the latest Grafana 7.x version [puppet] - 10https://gerrit.wikimedia.org/r/726858 (https://phabricator.wikimedia.org/T282863) (owner: 10Muehlenhoff) [09:15:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [09:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:59] (03CR) 10Alexandros Kosiaris: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [09:16:09] (03CR) 10Muehlenhoff: [C: 03+2] Add grafana import hook to pick the latest Grafana 7.x version [puppet] - 10https://gerrit.wikimedia.org/r/726858 (https://phabricator.wikimedia.org/T282863) (owner: 10Muehlenhoff) [09:16:58] jbond: shall I puppet-merge your slaac patch along? [09:17:07] (03CR) 10Alexandros Kosiaris: Rename main cluster to services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/725003 (owner: 10Alexandros Kosiaris) [09:17:23] (03CR) 10Filippo Giunchedi: [C: 03+1] Add logstash-output-opensearch plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/713713 (owner: 10Cwhite) [09:17:46] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: test moving the k8s parsing to earlier in the pipeline [puppet] - 10https://gerrit.wikimedia.org/r/726671 (https://phabricator.wikimedia.org/T292099) (owner: 10Cwhite) [09:19:05] !log update ipaddress6 fact - https://gerrit.wikimedia.org/r/c/operations/puppet/+/726625 [09:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:21] moritzm: happy for me to merge yours [09:19:29] ack, sure go ahead [09:19:44] merged [09:20:37] ack, thx [09:23:59] 10SRE, 10observability: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10fgiunchedi) Thanks for the report -- I'm assuming this is a new bug post-upgrade of Grafana yesterday (cc @colewhite). I'll try with latest upstream too in a test instance and see if I can rep... [09:24:08] jouncebot: next [09:24:08] In 1 hour(s) and 35 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1100) [09:25:56] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_mjolnir-kafka-bulk-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:34] 10SRE, 10observability: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10fgiunchedi) I can't reproduce the bug on 8.1.6 on https://grafana.monitoring.wmflabs.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&var-site=&var-cache_type=&var-status_type=1&v... [09:33:58] (03PS1) 10Muehlenhoff: Pass --exact-match to the grafana import hook [puppet] - 10https://gerrit.wikimedia.org/r/726860 [09:34:09] (03PS2) 10Muehlenhoff: Pass --exact-match to the grafana import hook [puppet] - 10https://gerrit.wikimedia.org/r/726860 [09:35:08] (03CR) 10Filippo Giunchedi: [C: 03+1] Pass --exact-match to the grafana import hook [puppet] - 10https://gerrit.wikimedia.org/r/726860 (owner: 10Muehlenhoff) [09:35:48] (03PS1) 10Ayounsi: esams/knams, advertise 185.15.59.0/24 instead of 185.15.58.0/23 [homer/public] - 10https://gerrit.wikimedia.org/r/726861 (https://phabricator.wikimedia.org/T288505) [09:36:18] (03CR) 10Muehlenhoff: [C: 03+2] Pass --exact-match to the grafana import hook [puppet] - 10https://gerrit.wikimedia.org/r/726860 (owner: 10Muehlenhoff) [09:36:35] (03CR) 10jerkins-bot: [V: 04-1] esams/knams, advertise 185.15.59.0/24 instead of 185.15.58.0/23 [homer/public] - 10https://gerrit.wikimedia.org/r/726861 (https://phabricator.wikimedia.org/T288505) (owner: 10Ayounsi) [09:40:39] (03CR) 10Ayounsi: "Example diff on cr3-esams:" [homer/public] - 10https://gerrit.wikimedia.org/r/726861 (https://phabricator.wikimedia.org/T288505) (owner: 10Ayounsi) [09:43:28] (03CR) 10Ayounsi: "The Jenkins error seems like a jsonschema bug." [homer/public] - 10https://gerrit.wikimedia.org/r/726861 (https://phabricator.wikimedia.org/T288505) (owner: 10Ayounsi) [09:46:25] (03PS1) 10Jelto: hiera::role::common::kubernetes add helm3 deploy users [labs/private] - 10https://gerrit.wikimedia.org/r/726862 (https://phabricator.wikimedia.org/T251305) [09:54:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host copernicium.wikimedia.org [09:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:03] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [09:57:46] jouncebot: nowandnext [09:57:46] No deployments scheduled for the next 1 hour(s) and 2 minute(s) [09:57:46] In 1 hour(s) and 2 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1100) [09:59:13] (03CR) 10Urbanecm: [C: 03+2] Delete gettingstarted-with-category-suggestions dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726743 (https://phabricator.wikimedia.org/T235752) (owner: 10Gergő Tisza) [09:59:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host copernicium.wikimedia.org [09:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:55] (03Merged) 10jenkins-bot: Delete gettingstarted-with-category-suggestions dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726743 (https://phabricator.wikimedia.org/T235752) (owner: 10Gergő Tisza) [10:01:51] !log urbanecm@deploy1002 Synchronized dblists/: 01633739462f3bf09ae4e50b955454921ea4fbf9: Delete gettingstarted-with-category-suggestions dblist (T235752; 1/2) (duration: 01m 04s) [10:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:57] T235752: Undeploy the GettingStarted extension - https://phabricator.wikimedia.org/T235752 [10:04:21] !log urbanecm@deploy1002 Synchronized wmf-config/: 01633739462f3bf09ae4e50b955454921ea4fbf9: Delete gettingstarted-with-category-suggestions dblist (T235752; 2/2) (duration: 01m 05s) [10:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:08] (03PS9) 10Rishabhbhat: Add $wgSitename and $wgMetaNamespace for kswiki and kswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/720320 (https://phabricator.wikimedia.org/T289752) [10:15:28] (03PS1) 10Elukey: Upgrade stat100[5,8] to ROCm 4.2 [puppet] - 10https://gerrit.wikimedia.org/r/726864 (https://phabricator.wikimedia.org/T287267) [10:16:27] (03CR) 10Elukey: [C: 03+2] Upgrade stat100[5,8] to ROCm 4.2 [puppet] - 10https://gerrit.wikimedia.org/r/726864 (https://phabricator.wikimedia.org/T287267) (owner: 10Elukey) [10:18:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org [10:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org [10:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:24] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_amd_rocm_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:09] (03PS1) 10Volans: dhcp: use IP address instead of DNS name [software/spicerack] - 10https://gerrit.wikimedia.org/r/726867 [10:25:24] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:53] (03CR) 10Ayounsi: [C: 03+1] dhcp: use IP address instead of DNS name [software/spicerack] - 10https://gerrit.wikimedia.org/r/726867 (owner: 10Volans) [10:30:18] (03CR) 10JMeybohm: [C: 03+1] hiera::role::common::kubernetes add helm3 deploy users [labs/private] - 10https://gerrit.wikimedia.org/r/726862 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [10:31:04] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Let me know if you need another +1 when the CI issue is fixed." [homer/public] - 10https://gerrit.wikimedia.org/r/726861 (https://phabricator.wikimedia.org/T288505) (owner: 10Ayounsi) [10:34:48] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:05] !log disable puppet on gitlab1001 to test puppetized code on GitLab replica - T283076 [10:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:12] T283076: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 [10:51:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/726851 (owner: 10Muehlenhoff) [10:52:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [10:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:16] (03CR) 10Jelto: [V: 03+1 C: 03+2] profile::gitlab start using gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/724430 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [10:54:23] (03CR) 10Jbond: [C: 03+1] dhcp: use IP address instead of DNS name [software/spicerack] - 10https://gerrit.wikimedia.org/r/726867 (owner: 10Volans) [10:58:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [10:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1100). [11:00:05] Juan_90264: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] o/ [11:00:37] jouncebot: Answer: oof ouch my thumb hurts [11:00:40] * urbanecm doesn't see Juan [11:02:20] appears to have on-wiki config [11:02:32] (03CR) 10Volans: "Just minor/nits inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [11:02:46] (03CR) 10Urbanecm: [C: 03+1] "LGTM, wasn't deployed due to scheduling developer not present" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726603 (https://phabricator.wikimedia.org/T290820) (owner: 10Juan90264) [11:02:51] it’s a bit unusual that the discussion for enabling it happened on a user talk page [11:02:57] but it looks like the wiki only has two main active users [11:03:01] so I’d say it’s okay [11:03:08] yeah, likely [11:03:17] I'd still like to have Juan though, as a principle [11:03:21] yeah [11:03:32] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726603 (https://phabricator.wikimedia.org/T290820) (owner: 10Juan90264) [11:03:55] did leave 'em a message the other day [11:04:09] and they successfully requested deployment in the evening B&C [11:06:03] (03CR) 10Volans: [C: 03+2] dhcp: use IP address instead of DNS name [software/spicerack] - 10https://gerrit.wikimedia.org/r/726867 (owner: 10Volans) [11:10:06] (03PS3) 10Majavah: acme_chief: add openstack certs [puppet] - 10https://gerrit.wikimedia.org/r/726585 (https://phabricator.wikimedia.org/T267194) [11:13:19] (03Merged) 10jenkins-bot: dhcp: use IP address instead of DNS name [software/spicerack] - 10https://gerrit.wikimedia.org/r/726867 (owner: 10Volans) [11:17:58] (03PS1) 10Jbond: stdlib: update stdlib from version 7.0.1 to 8.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) [11:32:53] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/726875 [11:36:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: Add CAA records [dns] - 10https://gerrit.wikimedia.org/r/726745 (owner: 10Majavah) [11:38:20] (03PS2) 10Jgiannelos: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/726875 [11:39:00] (03CR) 10Arturo Borrero Gonzalez: "The DNS patch was merged. If @valentin +1 this I can take care of merging this." [puppet] - 10https://gerrit.wikimedia.org/r/726585 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [11:39:24] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/726875 (owner: 10Jgiannelos) [11:42:13] (03PS31) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) [11:42:15] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.0.4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/726876 [11:42:17] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [11:43:32] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v1.0.4 [software/spicerack] - 10https://gerrit.wikimedia.org/r/726876 (owner: 10Volans) [11:43:39] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/726875 (owner: 10Jgiannelos) [11:46:42] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2003.codfw.wmnet [11:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:42] (03PS2) 10Jbond: stdlib: update stdlib from version 7.0.1 to 8.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) [11:47:44] (03PS1) 10Jbond: sslcert: switch to using ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726878 (https://phabricator.wikimedia.org/T264276) [11:48:42] (03CR) 10Volans: [C: 03+1] "Almost there, really, last nits but LGTM to start testing it. No need for another pass." [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [11:48:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31503/console" [puppet] - 10https://gerrit.wikimedia.org/r/726878 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [11:49:21] 10SRE, 10LDAP-Access-Requests: Grant Access to (some Superset dashboards) for - https://phabricator.wikimedia.org/T292575 (10LSobanski) p:05Triage→03Medium This is likely a case for https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Dashboards_in_Superset_/_Hive_interfaces_(like... [11:49:51] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Enable pregeneration cronjob to all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/726562 (owner: 10Jgiannelos) [11:49:59] (03PS2) 10Jgiannelos: tegola-vector-tiles: Enable pregeneration cronjob to all envs [deployment-charts] - 10https://gerrit.wikimedia.org/r/726562 [11:53:28] (03PS32) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) [11:53:42] (03CR) 10Jbond: "thanks, fixed" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [11:54:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] sslcert: switch to using ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726878 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [11:55:32] !log esams - Advertise 185.15.59.0/24 instead of 185.15.58.0/23 - T288505 - T283050 [11:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:40] T283050: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 [11:56:06] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts testvm2003.codfw.wmnet [11:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:13] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2003.codfw.wmnet` - testvm2003.codfw.wmnet (**PASS**) - Downtimed host on Icing... [11:56:21] (03CR) 10Jbond: "this will generate a large diff due to https://github.com/puppetlabs/puppetlabs-stdlib/pull/1196" [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [11:56:28] (03CR) 10Ema: [C: 03+2] ATS: make backend ram cache size configurable [puppet] - 10https://gerrit.wikimedia.org/r/726607 (https://phabricator.wikimedia.org/T286502) (owner: 10Ema) [11:57:45] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] esams/knams, advertise 185.15.59.0/24 instead of 185.15.58.0/23 [homer/public] - 10https://gerrit.wikimedia.org/r/726861 (https://phabricator.wikimedia.org/T288505) (owner: 10Ayounsi) [11:59:08] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [11:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:31] (03PS3) 10Jbond: stdlib: update stdlib from version 7.0.1 to 8.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) [11:59:33] (03PS1) 10Jbond: r_lang: fix use of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726882 [12:02:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:49] (03PS1) 10Urbanecm: Deploy Growth mentor dashboard to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726884 (https://phabricator.wikimedia.org/T278920) [12:07:10] (03CR) 10Urbanecm: [C: 04-2] "DNM (for now)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726884 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [12:09:10] (03PS1) 10Urbanecm: viwiki: Disable mentor dashboard backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726885 (https://phabricator.wikimedia.org/T278920) [12:09:15] jouncebot: nowandnext [12:09:15] No deployments scheduled for the next 5 hour(s) and 50 minute(s) [12:09:16] In 5 hour(s) and 50 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1800) [12:09:16] In 5 hour(s) and 50 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1800) [12:09:31] (03CR) 10Urbanecm: [C: 03+2] viwiki: Disable mentor dashboard backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726885 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [12:10:09] (03PS1) 10Ema: beta: lower ATS ram cache size to 128M [puppet] - 10https://gerrit.wikimedia.org/r/726886 (https://phabricator.wikimedia.org/T286502) [12:10:16] (03Merged) 10jenkins-bot: viwiki: Disable mentor dashboard backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726885 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [12:12:32] 10SRE, 10LDAP-Access-Requests: Grant Access to (some Superset dashboards) for - https://phabricator.wikimedia.org/T292575 (10Ottomata) @LSobanski correct! I'm not sure if we need another official ticket to promote this access to include analytics-privatedata-users though. As long as @E... [12:13:03] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1aa67d4846f39f59127a835cb7a8ed2974506025: viwiki: Disable mentor dashboard backend (T278920) (duration: 01m 06s) [12:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:09] T278920: Mentor dashboard: V1 desktop - https://phabricator.wikimedia.org/T278920 [12:16:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2004.codfw.wmnet [12:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:42] !log wikiadmin@10.64.0.164(viwiki)> delete from growthexperiments_mentee_data; # cleanup after disabling mentor dashboard backend [12:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:10] (03PS1) 10Jelto: profiles/hiera::gitlab fix ssl configuration [puppet] - 10https://gerrit.wikimedia.org/r/726888 (https://phabricator.wikimedia.org/T283076) [12:18:24] !log pool mw1455 mw1422 [12:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:21] (03CR) 10Ema: [C: 03+2] beta: lower ATS ram cache size to 128M [puppet] - 10https://gerrit.wikimedia.org/r/726886 (https://phabricator.wikimedia.org/T286502) (owner: 10Ema) [12:20:51] 10SRE, 10LDAP-Access-Requests: Grant Access to (some Superset dashboards) for - https://phabricator.wikimedia.org/T292575 (10LSobanski) @Jrbranaa can you please approve? [12:20:54] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31505/console" [puppet] - 10https://gerrit.wikimedia.org/r/726888 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [12:21:03] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:51] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [12:23:55] (03PS2) 10Jelto: profiles/hiera::gitlab fix ssl configuration [puppet] - 10https://gerrit.wikimedia.org/r/726888 (https://phabricator.wikimedia.org/T283076) [12:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:05] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:35] (03CR) 10Muehlenhoff: [C: 03+2] acmechief: Remove mx2002 [puppet] - 10https://gerrit.wikimedia.org/r/723422 (https://phabricator.wikimedia.org/T286911) (owner: 10Muehlenhoff) [12:24:54] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31506/console" [puppet] - 10https://gerrit.wikimedia.org/r/726888 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [12:25:39] (03CR) 10Volans: [C: 03+1] "LGTM for start real life testing!" [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [12:26:07] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:10] (03CR) 10Jcrespo: "Thanks for the review MVernon- I will send an amend with the changes soon. Not too worried if those are minor, easy to fix changes, I have" [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [12:26:30] (03PS1) 10Volans: Upstream release v1.0.4 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/726890 [12:26:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2004.codfw.wmnet [12:26:59] (03CR) 10Volans: [C: 03+2] Upstream release v1.0.4 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/726890 (owner: 10Volans) [12:26:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:05] (03CR) 10Jcrespo: "Maybe also some load testing on a passive DC, with help of service ops?" [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [12:28:30] (03CR) 10Hashar: "I have fixed the issue and synced this Puppet change with the Ansible playbook at Patchset 5 ( https://gerrit.wikimedia.org/r/c/operations" [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) (owner: 10Hashar) [12:28:41] (03PS5) 10Hashar: gitlab: enable Content-Security-Policy reporting [puppet] - 10https://gerrit.wikimedia.org/r/725012 (https://phabricator.wikimedia.org/T285363) [12:29:33] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:53] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Patch-For-Review: Figure out why deployment-cache-text06 keeps crashing - https://phabricator.wikimedia.org/T286502 (10ema) 05Open→03Resolved a:03ema After lowering the amount of memory used for the ATS backend ram cache, there's now some more availa... [12:31:57] (03CR) 10Muehlenhoff: Remove Parsoid jessie debs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725670 (owner: 10Muehlenhoff) [12:33:00] (03Merged) 10jenkins-bot: Upstream release v1.0.4 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/726890 (owner: 10Volans) [12:36:09] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10fgiunchedi) + sre-swift-storage for awareness [12:36:13] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10fgiunchedi) As far as Prometheus goes the `svc` CNAME in PoPs is there for symmetry reasons with eqiad/codfw (where prometheus is LVS'd), so prometheus... [12:39:24] (03PS1) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 [12:40:55] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:15] (03CR) 10Jgiannelos: "Some context around this issue. In our previous tests on staging we didn't use envoy for DB connections proxying as we had only direct con" [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (owner: 10Jgiannelos) [12:44:56] 10SRE, 10Inuka-Team, 10KaiOS-Wikipedia-app, 10Traffic: Many KaiOS devices can't access WMF websites and can't use Wikipedia app - https://phabricator.wikimedia.org/T292632 (10SBisson) [12:45:24] (03CR) 10Hashar: "The discussion followed on the Phabricator task: T291870#7388329" [puppet] - 10https://gerrit.wikimedia.org/r/724695 (https://phabricator.wikimedia.org/T291870) (owner: 10Hashar) [12:45:43] (03PS2) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) [12:48:49] !log uploaded spicerack_1.0.4 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [12:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:42] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:11] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:28] (03CR) 10Muehlenhoff: ganeti: Run a monthly cluster rebalancing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725779 (owner: 10Alexandros Kosiaris) [12:55:49] (03PS1) 10David Caro: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) [12:58:39] (03CR) 10jerkins-bot: [V: 04-1] toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [13:01:42] (03CR) 10Jbond: [C: 03+2] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [13:05:14] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [13:05:44] (03PS1) 10Jbond: require_packages: update to use installed vs present [puppet] - 10https://gerrit.wikimedia.org/r/726895 [13:05:46] (03PS1) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [13:06:33] (03Abandoned) 10Jbond: require_packages: update to use installed vs present [puppet] - 10https://gerrit.wikimedia.org/r/726895 (owner: 10Jbond) [13:06:45] (03CR) 10jerkins-bot: [V: 04-1] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [13:06:51] (03PS2) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [13:07:32] (03CR) 10jerkins-bot: [V: 04-1] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [13:09:07] (03PS2) 10David Caro: toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) [13:09:34] (03CR) 10David Caro: [C: 04-1] "This is still not ready" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [13:11:46] (03CR) 10jerkins-bot: [V: 04-1] toolforge: new add_grid_webgrid_generic_node recipe [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/726894 (https://phabricator.wikimedia.org/T292465) (owner: 10David Caro) [13:11:53] (03PS1) 10Volans: schemas: adapt to JSON Schema 2020-12 [homer/public] - 10https://gerrit.wikimedia.org/r/726897 [13:12:21] XioNoX, topranks: ^^^ this should fix homer's CI [13:12:31] volans: link? :) [13:12:44] ah, got it [13:13:34] volans: "type": "object", and "additionalProperties" : false, are gone? [13:14:17] it's a ref, I'm not sure why they were there [13:14:30] we are referring the other schema file [13:14:46] cool thanks for sorting that out :) [13:14:51] (03CR) 10Volans: "See also https://json-schema.org/draft/2020-12/release-notes.html" [homer/public] - 10https://gerrit.wikimedia.org/r/726897 (owner: 10Volans) [13:15:22] XioNoX: but please double check if that makes sense, and doesn't loosen the schema check [13:15:50] volans: make sens, thx! [13:16:30] (03CR) 10Ayounsi: [C: 03+1] schemas: adapt to JSON Schema 2020-12 [homer/public] - 10https://gerrit.wikimedia.org/r/726897 (owner: 10Volans) [13:16:33] (03PS3) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [13:16:37] Lucas WMDE: Could it deploy even after backport hours? 726603 in Gerrit? [13:16:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/726888 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [13:17:02] Juan_90264: usually not, please add it to a backport window where you can be present [13:17:24] (03CR) 10jerkins-bot: [V: 04-1] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [13:18:34] (03CR) 10Jelto: [V: 03+1 C: 03+2] profiles/hiera::gitlab fix ssl configuration [puppet] - 10https://gerrit.wikimedia.org/r/726888 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [13:23:04] Ok... I can only be "Evening backport window", and I'm not able to make time for the "European mid-day backport window" I can only access 1 hour or 2 hours. [13:24:02] I could even go at "Morning backport window" time, but I already have an appointment at this time [13:26:38] Lucas_WMDE: ⇈⇈⇈⇈ [13:27:52] 10SRE, 10Inuka-Team, 10KaiOS-Wikipedia-app, 10Traffic: Many KaiOS devices can't access WMF websites and can't use Wikipedia app - https://phabricator.wikimedia.org/T292632 (10Vgutierrez) p:05Triage→03Low ack, thanks for the heads up @SBisson. I'm not familiar with KaiOS so forgive me if it's a stupid... [13:28:13] (IcingaOverload) firing: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [13:28:18] Juan_90264: I don’t particularly mind which window you choose, the important thing is that you’re around [13:28:35] if one day doesn’t work well for you then it can also be postponed to another day [13:32:11] (03PS4) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [13:32:52] (03CR) 10jerkins-bot: [V: 04-1] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [13:33:11] RECOVERY - Disk space on aqs1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aqs1004&var-datasource=eqiad+prometheus/ops [13:33:30] (03PS4) 10Filippo Giunchedi: pontoon: auto generate service certificates [puppet] - 10https://gerrit.wikimedia.org/r/725838 [13:33:37] (03PS1) 10Elukey: Add audio packages to support DSE hackathon on stat100x [puppet] - 10https://gerrit.wikimedia.org/r/726903 (https://phabricator.wikimedia.org/T292306) [13:34:27] (03PS2) 10Elukey: Add audio packages to support DSE hackathon on stat100x [puppet] - 10https://gerrit.wikimedia.org/r/726903 (https://phabricator.wikimedia.org/T292306) [13:35:43] (03PS1) 10Volans: sre.experimental.reimage: support latest Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/726904 [13:36:20] (03CR) 10Volans: [C: 03+2] schemas: adapt to JSON Schema 2020-12 [homer/public] - 10https://gerrit.wikimedia.org/r/726897 (owner: 10Volans) [13:36:59] (03Merged) 10jenkins-bot: schemas: adapt to JSON Schema 2020-12 [homer/public] - 10https://gerrit.wikimedia.org/r/726897 (owner: 10Volans) [13:37:55] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/726903 (https://phabricator.wikimedia.org/T292306) (owner: 10Elukey) [13:38:13] (IcingaOverload) resolved: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [13:40:25] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10DAbad) @Ottomata can we close this ticket out now? Or is there work left? [13:42:35] 10SRE, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, and 6 others: Migrated Server-side EventLogging events recording http.client_ip as 127.0.0.1 - https://phabricator.wikimedia.org/T288853 (10Ottomata) There is still work, I haven't deployed the config change. Sorry about that. I got caught up in t... [13:44:25] (03CR) 10Muehlenhoff: require_packages: update all uses of require_packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [13:45:12] (03CR) 10Elukey: Add audio packages to support DSE hackathon on stat100x (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726903 (https://phabricator.wikimedia.org/T292306) (owner: 10Elukey) [13:45:47] 10SRE, 10Language-Team (Language-2021-October-December): Remove Matxin Key from Production - https://phabricator.wikimedia.org/T292635 (10KartikMistry) [13:46:57] (03PS1) 10Filippo Giunchedi: o11y: tune IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/726909 [13:47:15] (03CR) 10Ssingh: [C: 03+1] cache::haproxy: Manage request/response headers [puppet] - 10https://gerrit.wikimedia.org/r/720274 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [13:49:03] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/725838 (owner: 10Filippo Giunchedi) [13:49:52] (03CR) 10Elukey: [C: 03+2] Add audio packages to support DSE hackathon on stat100x [puppet] - 10https://gerrit.wikimedia.org/r/726903 (https://phabricator.wikimedia.org/T292306) (owner: 10Elukey) [13:50:00] (03CR) 10Herron: [C: 03+2] admin: update lmata ssh key [puppet] - 10https://gerrit.wikimedia.org/r/726682 (https://phabricator.wikimedia.org/T292583) (owner: 10Herron) [13:50:02] (03CR) 10Muehlenhoff: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/726903 (https://phabricator.wikimedia.org/T292306) (owner: 10Elukey) [13:51:55] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Jelto) I rolled out the puppetised changes to `gitlab2001` (gitlab-replica). Apart from a [minor ssl fix](https://gerrit.wiki... [13:52:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update ssh key for lmata - https://phabricator.wikimedia.org/T292583 (10herron) 05Open→03Resolved a:03herron The new key has been merged and will propagate via puppet over the next 30m, resolving! [13:55:28] (03PS7) 10Bearloga: statistics::product_analytics: create and prepare [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) [13:56:42] (03CR) 10Bearloga: "@Ottomata: do you have any concerns with this as it is?" [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [13:57:50] (03PS5) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [13:58:28] (03CR) 10jerkins-bot: [V: 04-1] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [13:58:34] (03CR) 10Jbond: [C: 03+1] sre.experimental.reimage: support latest Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/726904 (owner: 10Volans) [13:59:37] (03CR) 10Ottomata: "I don't think so! I just am not paying attention because we are doing a Hackathon! Sorry about that! I can merge but I won't have a lot" [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [14:02:53] (03PS6) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [14:03:29] (03CR) 10jerkins-bot: [V: 04-1] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [14:04:16] (03PS1) 10Ema: cache: exclude single backend experiment from pooled ATS backends [puppet] - 10https://gerrit.wikimedia.org/r/726912 (https://phabricator.wikimedia.org/T288106) [14:05:00] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: support latest Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/726904 (owner: 10Volans) [14:06:55] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_amd_rocm_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:56] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/726912 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [14:07:46] (03Merged) 10jenkins-bot: sre.experimental.reimage: support latest Spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/726904 (owner: 10Volans) [14:08:59] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:51] (03PS33) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) [14:11:40] (03PS7) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [14:12:16] (03CR) 10jerkins-bot: [V: 04-1] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [14:13:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31512/console" [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [14:13:50] (03PS8) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [14:14:06] (03PS8) 10Bearloga: statistics::product_analytics: create and prepare [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) [14:14:50] (03CR) 10Jbond: require_packages: update all uses of require_packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [14:16:11] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/i18n/pcs (Get i18n strings for the Page Content Service) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:16:52] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/31510/" [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [14:18:03] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:18:07] (03CR) 10Jbond: [C: 03+2] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (https://phabricator.wikimedia.org/T284079) (owner: 10Jbond) [14:20:02] (03CR) 10Jbond: [C: 03+2] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [14:20:05] (03PS2) 10Ema: cache: exclude single backend experiment from pooled ATS backends [puppet] - 10https://gerrit.wikimedia.org/r/726912 (https://phabricator.wikimedia.org/T288106) [14:20:15] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10cmooney) @wiki_willy One option that Dell do seem to have are the Mellanox ConnextX-3 and ConnectX-4 QSFP+ based cards. With the right module we can in theory do 4x10G off... [14:22:30] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/726912 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [14:26:08] (03PS1) 10Giuseppe Lavagetto: kubernetes: add unified data structure for user tokens [labs/private] - 10https://gerrit.wikimedia.org/r/726913 [14:26:14] (03PS1) 10Giuseppe Lavagetto: kubernetes: remove redundant data [labs/private] - 10https://gerrit.wikimedia.org/r/726914 [14:26:18] (03PS3) 10Ema: cache: exclude single backend experiment from pooled ATS backends [puppet] - 10https://gerrit.wikimedia.org/r/726912 (https://phabricator.wikimedia.org/T288106) [14:26:26] (03PS1) 10Giuseppe Lavagetto: kubernetes: do not repeat user tokens. [puppet] - 10https://gerrit.wikimedia.org/r/726915 [14:26:34] (03CR) 10MSantos: [C: 03+1] "I think it's ready to be merged" [puppet] - 10https://gerrit.wikimedia.org/r/722825 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [14:27:05] (03CR) 10MSantos: [C: 03+1] tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [14:28:56] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/726912 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [14:36:32] 10SRE, 10Traffic, 10Patch-For-Review: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10ema) >>! In T288106#7405557, @gerritbot wrote: > Change 726912 had a related patch set uploaded (by Ema; author: Ema): > %%%[operations/puppet@production] cache: exclude single backen... [14:38:47] 10SRE-swift-storage, 10TimedMediaHandler-Transcode: Intermittent transcode failure 'An unknown error occurred in storage backend "local-swift-codfw".' - https://phabricator.wikimedia.org/T201090 (10Yann) Again upload failed for 714 MB PDF file from IA via upload-by-url. Error message: Request from 2a01:cb15:80... [14:39:05] (03PS4) 10Jbond: stdlib: update stdlib from version 7.0.1 to 8.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) [14:39:14] (03CR) 10Alexandros Kosiaris: ganeti: Run a monthly cluster rebalancing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725779 (owner: 10Alexandros Kosiaris) [14:39:31] (03PS3) 10Alexandros Kosiaris: ganeti: Run a monthly cluster rebalancing [puppet] - 10https://gerrit.wikimedia.org/r/725779 [14:41:46] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 on premise, I guess we need a fleet wide PCC and we are good to go." [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [14:42:43] 10SRE-swift-storage, 10TimedMediaHandler-Transcode: Intermittent transcode failure 'An unknown error occurred in storage backend "local-swift-codfw".' - https://phabricator.wikimedia.org/T201090 (10Yann) >>! In T201090#7403715, @TheDJ wrote: > @Yann that seems like a different error that should be filed as a s... [14:45:50] (03CR) 10Vgutierrez: [C: 03+1] "$ host -t caa wikimediacloud.org" [puppet] - 10https://gerrit.wikimedia.org/r/726585 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [14:56:19] (03PS1) 10Jbond: rsyslog: switch to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726919 [15:01:02] (03CR) 10Jbond: [C: 03+2] rsyslog: switch to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726919 (owner: 10Jbond) [15:05:49] (03CR) 10Herron: [C: 03+1] icinga: remove alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/724771 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [15:05:55] PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:43] (03CR) 10Herron: [C: 03+1] o11y: tune IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/726909 (owner: 10Filippo Giunchedi) [15:08:34] (03PS1) 10Jbond: mailman3::web: dont try to install the packagae twice [puppet] - 10https://gerrit.wikimedia.org/r/726923 [15:10:07] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [15:12:34] (03CR) 10Herron: o11y: port alertmanager alerts (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/724761 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [15:12:47] (03CR) 10Jbond: [C: 03+2] mailman3::web: dont try to install the packagae twice [puppet] - 10https://gerrit.wikimedia.org/r/726923 (owner: 10Jbond) [15:20:48] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Add logstash-output-opensearch plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/713713 (owner: 10Cwhite) [15:27:51] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10BBlack) I'm pretty sure you're right about the 2x PCIe limitation on these servers, unfortunately. What I'm not so sure about, is whether the (currently BIOS-disabled) "on... [15:35:59] !log installer spicerack 1.0.4 on cumin2002 [15:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:34] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10BBlack) Also note that lvs1013-16 are ~4 years old now. I don't think they were scheduled for refresh this year, but they probably would be next year (and by then we would... [15:37:21] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1002.eqiad.wmnet [15:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:33] (03PS1) 10Jbond: ferm: switch to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726928 [15:39:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/726928 (owner: 10Jbond) [15:41:08] (03CR) 10Jbond: [C: 03+2] ferm: switch to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726928 (owner: 10Jbond) [15:44:35] jouncebot next [15:44:35] In 2 hour(s) and 15 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1800) [15:44:36] In 2 hour(s) and 15 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1800) [15:45:20] (03PS9) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [15:45:34] (03PS5) 10Jbond: stdlib: update stdlib from version 7.0.1 to 8.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) [15:45:47] !log 1.38.0-wmf.3 train (T281167): proceeding to deploy backports for T292589 [15:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:56] T281167: 1.38.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T281167 [15:45:56] T292589: CI failure on Scribunto: Still uses removed ParserOptions::getUser() - https://phabricator.wikimedia.org/T292589 [15:46:26] (03CR) 10Brennen Bearnes: [C: 03+2] Replace deprecated ParserOptions::getUser with ::getUserIdentity [extensions/Scribunto] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726597 (https://phabricator.wikimedia.org/T292589) (owner: 10Jforrester) [15:46:33] (03CR) 10Brennen Bearnes: [C: 03+2] Replace deprecated ParserOptions::getUser with ::getUserIdentity [extensions/Scribunto] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726596 (https://phabricator.wikimedia.org/T292589) (owner: 10Jforrester) [15:51:59] 10SRE-swift-storage, 10TimedMediaHandler-Transcode: Intermittent transcode failure 'An unknown error occurred in storage backend "local-swift-codfw".' - https://phabricator.wikimedia.org/T201090 (10Yann) https://archive.org/details/da-capo-press-camera-notes-v-3-4-1899-1901 also failed via chunked-upload proto... [15:52:16] 10SRE, 10Release-Engineering-Team: Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10dancy) [15:59:26] 10SRE-swift-storage, 10TimedMediaHandler-Transcode: Intermittent transcode failure 'An unknown error occurred in storage backend "local-swift-codfw".' - https://phabricator.wikimedia.org/T201090 (10Yann) Also from https://commons.wikimedia.org/wiki/Special:UploadStash , if I click on (view thumbnail), I get `... [16:01:57] !log volans@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1002.eqiad.wmnet [16:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:03] (03Merged) 10jenkins-bot: Replace deprecated ParserOptions::getUser with ::getUserIdentity [extensions/Scribunto] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726597 (https://phabricator.wikimedia.org/T292589) (owner: 10Jforrester) [16:06:21] (03Merged) 10jenkins-bot: Replace deprecated ParserOptions::getUser with ::getUserIdentity [extensions/Scribunto] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726596 (https://phabricator.wikimedia.org/T292589) (owner: 10Jforrester) [16:06:38] RECOVERY - Check systemd state on ms-be1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:07:16] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10Cmjohnson) @Kormat db1127 DIMM is on-site, I need to take the server offline to replace [16:08:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:28] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10jcrespo) @Cmjohnson Kormat is away today, if you give me enough time I can put if offline for you. :-) [16:10:59] (03PS1) 10Majavah: apple_search: New chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/726933 [16:11:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:32] (03PS1) 10BryanDavis: toolhub: Bump container version to 2021-10-06-000718-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/726934 [16:13:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10jcrespo) > they didn't have HDD as a pre-selected option to replace :-( ///me crosses fingers.// With the RAID 6 we can lose any other disk, unlike the RAID10, so we have so... [16:15:48] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [16:17:36] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:18:29] DannyS712, Pchelolo: deploying that ParserOptions::getUser() patch - is there mwdebug testing that can/should be done here? [16:18:47] no, I don't think so. I donno how to reproduce [16:18:56] k, going ahead. [16:19:08] 10SRE, 10ops-eqiad: Degraded RAID on db1126 - https://phabricator.wikimedia.org/T292325 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson The disk has been replaced and is rebuilding. Resolving this task, re-open if the issue comes back Firmware state: Rebuild [16:19:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:40] PROBLEM - Host clouddb1020.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:26:44] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:28:49] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.2/extensions/Scribunto/includes/engines/LuaCommon/LanguageLibrary.php: Backport: [[gerrit:726597|Replace deprecated ParserOptions::getUser with ::getUserIdentity (T292589)]] (duration: 01m 10s) [16:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:56] T292589: CI failure on Scribunto: Still uses removed ParserOptions::getUser() - https://phabricator.wikimedia.org/T292589 [16:30:22] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:31:26] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: hw maintenance [16:31:29] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: hw maintenance [16:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:35] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10ops-monitoring-bot) Icinga downtime set by jynus@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: hw maintenance ` db1127.eqiad.wmnet ` [16:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:42] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2021-10-06-000718-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/726934 (owner: 10BryanDavis) [16:35:21] !log stopping db1127 for hw maintenance T292366 [16:35:23] !log brennen@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/Scribunto/includes/engines/LuaCommon/LanguageLibrary.php: Backport: [[gerrit:726596|Replace deprecated ParserOptions::getUser with ::getUserIdentity (T292589)]] (duration: 01m 04s) [16:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:28] T292366: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 [16:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:34] T292589: CI failure on Scribunto: Still uses removed ParserOptions::getUser() - https://phabricator.wikimedia.org/T292589 [16:36:22] (03PS1) 10Aklapper: phabricator weekly changes email: Fix query for dashboard panel changes [puppet] - 10https://gerrit.wikimedia.org/r/726936 (https://phabricator.wikimedia.org/T292062) [16:39:05] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2021-10-06-000718-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/726934 (owner: 10BryanDavis) [16:40:16] (03PS4) 10Arturo Borrero Gonzalez: acme_chief: add openstack certs [puppet] - 10https://gerrit.wikimedia.org/r/726585 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [16:40:29] 10SRE, 10SRE-swift-storage, 10ops-eqiad: swift - ms-be1059 - device sdi:3 unavailable - https://phabricator.wikimedia.org/T292486 (10Cmjohnson) I replaced what I think was /dev/sdi. The server did not show any amber led to let me know which disk was failed. [16:40:44] RECOVERY - Host clouddb1020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [16:41:17] 10SRE, 10SRE-swift-storage, 10ops-eqiad: swift - ms-be1059 - device sdi:3 unavailable - https://phabricator.wikimedia.org/T292486 (10Cmjohnson) It will need to be added back to the array. It was slot 8 that was replaced [16:41:57] !log bd808@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'toolhub' for release 'main' . [16:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:52] RECOVERY - Host clouddb1020 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:42:58] (03CR) 10Majavah: acme_chief: add openstack certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726585 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [16:43:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson CPU1 replaced, the bios updated during the reboot, no... [16:43:08] !log 1.38.0-wmf.3 train (T281167): unblocked, rolling to group0 [16:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:16] T281167: 1.38.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T281167 [16:44:45] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10jcrespo) @Cmjohnson you can proceed- the host is poweredoff, according to racadm, but I didn't power it off- either it crashed or something happened before I could stop mysql cleanly. I will know more when it comes up... [16:45:06] (03PS1) 10Brennen Bearnes: group0 wikis to 1.38.0-wmf.3 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726937 [16:45:08] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.38.0-wmf.3 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726937 (owner: 10Brennen Bearnes) [16:45:22] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10jcrespo) a:03Cmjohnson [16:45:51] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.3 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726937 (owner: 10Brennen Bearnes) [16:46:52] PROBLEM - Host cloudcephosd1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:47:09] !log bd808@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [16:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:19] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.3 refs T281167 [16:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:43] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) >>>! In T292322#7403338, @Legoktm wrote: >> @fgiunchedi I'd appreciate your input on how this would potentially interact with swift, specifically: >> * Can we... [16:52:42] PROBLEM - Check systemd state on ms-be1054 is CRITICAL: CRITICAL - degraded: The following units failed: session-207563.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:48] !log bd808@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'toolhub' for release 'main' . [16:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:08] (03CR) 10Legoktm: apple_search: New chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/726933 (owner: 10Majavah) [16:54:45] 10SRE, 10Inuka-Team, 10KaiOS-Wikipedia-app, 10Traffic: Many KaiOS devices can't access WMF websites and can't use Wikipedia app - https://phabricator.wikimedia.org/T292632 (10SBisson) >>! In T292632#7405463, @Vgutierrez wrote: > ... does that mean that WiFi-only devices are unsupported and missing security... [16:58:07] (03PS2) 10Majavah: apple_search: New chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/726933 [16:58:25] (03CR) 10Majavah: apple_search: New chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/726933 (owner: 10Majavah) [17:01:42] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=mysql file=device_smart.prom instance=clouddb1020 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [17:03:51] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) Attempting to swap the raid controller today, took the controller from 1022 and put it in 1021. Fingers crossed this is the issue so we can go to Dell... [17:05:16] 10SRE, 10ops-eqiad, 10DBA: Bad ram on db1127 - https://phabricator.wikimedia.org/T292366 (10Cmjohnson) 05Open→03Resolved DIMM replaced, cleared the error logs, everything looks good from my end. @jynus I am resolving the task to remove from our queue. [17:06:40] PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10Cmjohnson) our ticket was declined, I opened a ticket for backup1001 and the error is on a disk shelf. [17:10:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:10] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:15:12] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:16:06] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/31515/" [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [17:16:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:00] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10cmooney) Thanks for the detailed response @BBlack > And yes, the tunnel option is also risky/complex/painful, but we'll have to weigh that against all the above. Without... [17:22:42] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [17:23:40] RECOVERY - Host cloudcephosd1022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [17:24:40] RECOVERY - Check systemd state on ms-be1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:26:12] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install cloudcephosd102[1-4].eqiad.wmnet - https://phabricator.wikimedia.org/T284471 (10Cmjohnson) Good news, swapped the raid controller and 1021 went through the full installation without an issue, I put the raid controller from 1021 into 1022 and... [17:33:57] 10SRE, 10Wikimedia-Mailing-lists: Upgrade lists.wikimedia.org to next Mailman/hyperkitty/postorius versions - https://phabricator.wikimedia.org/T286217 (10Legoktm) Here's the dependency diff for Mailman Core 3.3.5 from 3.3.4: `lang=diff @@ -111,16 +111,16 @@ case second 'm'. Any other spelling is incorrect.""... [17:51:50] (03CR) 10Herron: [C: 03+1] logstash: test moving the k8s parsing to earlier in the pipeline [puppet] - 10https://gerrit.wikimedia.org/r/726671 (https://phabricator.wikimedia.org/T292099) (owner: 10Cwhite) [17:55:29] 10SRE, 10MediaWiki-General, 10Platform Engineering Code Jam, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar): Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10Pchelolo)... [17:57:35] jouncebot: now [17:57:35] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [17:57:40] jouncebot: next [17:57:40] In 0 hour(s) and 2 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1800) [17:57:40] In 0 hour(s) and 2 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1800) [17:58:47] I'm adding a quick patch to the Deployments calendar, please hold [18:00:05] brennen and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1800). [18:00:05] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1800). [18:00:05] No Gerrit patches in the queue for this window AFAICS. [18:01:11] jouncebot: refresh [18:01:12] I refreshed my knowledge about deployments. [18:01:19] 2 patches for B&C [18:01:34] RECOVERY - MegaRAID on db1126 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:01:46] RECOVERY - Check systemd state on ms-be1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:00] ping for mbsantos re disabling statics maps for eswiki - scheduled for deployment on this window [18:06:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Christina Macholan - https://phabricator.wikimedia.org/T292515 (10Sbodington) approved [18:12:10] (03PS3) 10Majavah: apple-search: New chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/726933 (https://phabricator.wikimedia.org/T289224) [18:13:39] (03CR) 10Legoktm: [C: 03+2] [viwikibooks] Set $wgRestrictDisplayTitle to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721839 (https://phabricator.wikimedia.org/T289837) (owner: 10MarcoAurelio) [18:14:26] (03CR) 10Legoktm: [C: 03+2] [eswiki] Disable static mapframes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723689 (https://phabricator.wikimedia.org/T291736) (owner: 10MarcoAurelio) [18:15:01] oh, I probably shouldn't have +2'd both [18:15:13] (03PS4) 10Legoktm: [viwikibooks] Set $wgRestrictDisplayTitle to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721839 (https://phabricator.wikimedia.org/T289837) (owner: 10MarcoAurelio) [18:15:16] (03CR) 10Legoktm: [C: 03+2] "..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721839 (https://phabricator.wikimedia.org/T289837) (owner: 10MarcoAurelio) [18:16:04] what size of t-shirt do you wear? It's for sending you the 'I broke the wiki' one :) [18:16:07] (03Merged) 10jenkins-bot: [viwikibooks] Set $wgRestrictDisplayTitle to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/721839 (https://phabricator.wikimedia.org/T289837) (owner: 10MarcoAurelio) [18:16:39] I already have one :p [18:17:15] * majavah thought you only got a sticker [18:17:20] hauskatze: live on mwdebug1001 for testing [18:17:32] checking viwikibooks [18:17:43] (03PS4) 10Legoktm: [eswiki] Disable static mapframes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723689 (https://phabricator.wikimedia.org/T291736) (owner: 10MarcoAurelio) [18:19:22] legoktm, majavah : hmm, {{DISPLAYTITLE}} seems not to work well... I'm still checking [18:19:44] oh, nevermind, caching issue [18:19:52] https://vi.wikibooks.org/wiki/Th%C3%A0nh_vi%C3%AAn:MarcoAurelio/Sandbox [18:20:14] {{DISPLAYTITLE:Test}} makes the page title 'Test' as expected [18:20:31] works for me too [18:20:51] great [18:21:15] (03CR) 10Legoktm: [C: 03+2] [eswiki] Disable static mapframes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723689 (https://phabricator.wikimedia.org/T291736) (owner: 10MarcoAurelio) [18:21:19] syncing [18:22:00] (03Merged) 10jenkins-bot: [eswiki] Disable static mapframes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723689 (https://phabricator.wikimedia.org/T291736) (owner: 10MarcoAurelio) [18:22:24] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: viwikibooks: Set $wgRestrictDisplayTitle to false (T289837) (duration: 01m 21s) [18:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:32] T289837: Set $wgRestrictDisplayTitle to false on Vietnamese Wikibooks - https://phabricator.wikimedia.org/T289837 [18:22:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:10] hauskatze: static mapframes patch live on mwdebug1001 now [18:23:17] checking [18:25:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:47] Not seeing anything change [18:25:48] hmm [18:26:33] let me log-in and out [18:27:40] hauskatze: I think you need to purge the page [18:28:00] yup, I'm trying with [[:es:Madrid]] [18:28:04] (03CR) 10Dzahn: "I was about to say, thanks, lgtm, I had oticed this on cumin . but I see in compiler now: " Could not find resource 'Exec[compile fragmen" [puppet] - 10https://gerrit.wikimedia.org/r/726851 (owner: 10Muehlenhoff) [18:28:39] legoktm: yay, purge work [18:28:53] maps can now be zoomed in- and out- without clicking on them [18:29:10] (03CR) 10Dzahn: Don't include rsync::server for absented rsync modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726851 (owner: 10Muehlenhoff) [18:29:47] nice [18:30:08] so it won't take effect immediately, it'll require pages to be edited or the parser cache to expire [18:30:51] just please make sure people don't go around purging every page... [18:31:15] (03CR) 10Dzahn: Don't include rsync::server for absented rsync modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726851 (owner: 10Muehlenhoff) [18:31:38] (03CR) 10Dzahn: [C: 04-1] Don't include rsync::server for absented rsync modules [puppet] - 10https://gerrit.wikimedia.org/r/726851 (owner: 10Muehlenhoff) [18:31:57] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: eswiki: Disable static mapframes (T291736) (duration: 01m 17s) [18:32:02] hauskatze: ^^ all set [18:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:04] T291736: Disable static maps and thumbnails for es.wikipedia - https://phabricator.wikimedia.org/T291736 [18:32:07] legoktm: I intended to tell no one for now [18:32:38] legoktm: thanks so much :) [18:32:53] :) yw [18:33:11] (03CR) 10Dzahn: "thanks! yea, it does. It's more about deploying this and wanting to let wmcs know." [puppet] - 10https://gerrit.wikimedia.org/r/726729 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [18:34:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:21] eh, wait, disabling static thumbnails is a feature? [18:34:39] Does this mean it's loading that whole OOUI and map backend interaction and canvas rendering on page load? [18:35:12] https://es.wikipedia.org/wiki/Madrid has an example in the infobox [18:35:24] yeah.. [18:35:37] ok, I guess we'll just not monitor eswiki for performance. [18:35:55] I don't think it loaded all of OOUI [18:36:03] >>> mw.loader.getState('oojs-ui'); [18:36:03] "registered" [18:36:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:53] mw.loader.findReady('oojs-ui-widgets') [18:36:53] 0: Array [ "ext.kartographer.dialog" ] [18:36:57] oojs-ui-core as well [18:37:17] and -styles, and -windows [18:37:22] that's half a meg or so [18:37:44] ohhh [18:38:11] findReady is my custom script. mw.inspeect() is another way to see it more easily [18:38:28] the top 10 largest modules on that page are mapbox and ooui modules [18:39:22] anyway, perhaps kartographer could be improved to be more usable from its static version. e.g. HiDPI and feeling more interactable. I'm assuming the community vote came from it not being clear that the feautres are available, it feels too indirect. [18:39:29] good reason for product to think about :) [18:39:58] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/31517/an-web1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/726728 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:40:11] hauskatze: I don't think this is acceptable from a performance perspective :/ [18:41:18] legoktm: No idea, if it is not okay, please revert. I adviced them that performance could be a blocker. MSantos said on patch that impact was probably not significant, that's why I scheduled it [18:41:55] (03PS1) 10Legoktm: Revert "[eswiki] Disable static mapframes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726950 [18:42:06] (03CR) 10Legoktm: [C: 03+2] Revert "[eswiki] Disable static mapframes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726950 (owner: 10Legoktm) [18:42:50] (03Merged) 10jenkins-bot: Revert "[eswiki] Disable static mapframes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726950 (owner: 10Legoktm) [18:44:33] !log legoktm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Revert disabling static mapframes on eswiki (duration: 01m 14s) [18:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:10] 10SRE, 10SRE-swift-storage, 10ops-eqiad: swift - ms-be1059 - device sdi:3 unavailable - https://phabricator.wikimedia.org/T292486 (10Dzahn) Thanks @Cmjohnson ! @Papaul @fgiunchedi looking at the linked tickets that had similar cases it seemed as if just replacing the disk auto-fixed things, or was there a... [18:47:52] legoktm: I'm wondering why is it (loading large data on every pageview) acceptable for ie. commons, where wgKartographerStaticMapframe is false. Sorry if I'm missing something! [18:49:10] I was about to say something similar, because people mentioned that 'if commons can have it why can't we' and I know the people that proposed the voting is going to ask me [18:49:16] (03CR) 10Dzahn: [C: 03+1] profiles/hiera::gitlab fix ssl configuration [puppet] - 10https://gerrit.wikimedia.org/r/726888 (https://phabricator.wikimedia.org/T283076) (owner: 10Jelto) [18:49:18] (03CR) 10Michael DiPietro: [C: 03+2] maintain-views.yaml: Remove afl_filter from the view [puppet] - 10https://gerrit.wikimedia.org/r/723808 (https://phabricator.wikimedia.org/T291806) (owner: 10Marostegui) [18:50:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:52] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/31518/" [puppet] - 10https://gerrit.wikimedia.org/r/726726 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [18:51:02] hauskatze: you can always say "go to the sysadmins" :-). It's not you who made the decision, so you shouldn't be blamed for it. [18:51:16] broadly, not all wikis are equal in terms of audience, page views/traffic and goals. [18:51:31] urbanecm: I told them already :D [18:51:36] good :) [18:51:58] "changes can be declined for any reason or no reason at all" [18:52:00] my impression is that Commons is more editor focused and so the perf penalty is somewhat acceptable. I don't think that makes sense for someone who wants to learn about Madrid and most of the client-side bandwith is used to load a map in the infobox [18:52:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:41] I suspect (with no evidence) the real reason is that Commons has had these interactive maps via a gadget, and switching it off would be a regression [18:54:55] legoktm: I suspect this isn't just something regarding [[:es:Madrid]] but the whole pages using ? [18:55:07] my guess (also with no evidence) is that it was done because of the Data namespace [18:56:01] hauskatze: what do you mean? [18:56:30] legoktm: the performance issues would be there for all pages using , not just that single page right? [18:56:39] yes [18:56:41] that page is particularly long, with lots of images, etc [18:56:44] ah, right [18:57:00] well, I guess this is a question for the performance team [18:57:16] I am already logged out, but if someone could tag them on the task for their review I'd appreciate it [18:57:48] (03CR) 10Jforrester: "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726743 (https://phabricator.wikimedia.org/T235752) (owner: 10Gergő Tisza) [18:58:45] kitchen time, bbl [18:58:50] o/ [19:00:04] brennen and jeena: (Dis)respected human, time to deploy MediaWiki train - American Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1900). Please do the needful. [19:00:14] weeeeee [19:00:32] i mean o/ [19:01:11] !log 1.38.0-wmf.3 train (T281167): still unblocked after triage meeting, rolling to group1 [19:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:18] T281167: 1.38.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T281167 [19:01:45] (03PS1) 10Brennen Bearnes: group1 wikis to 1.38.0-wmf.3 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726977 [19:01:47] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.38.0-wmf.3 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726977 (owner: 10Brennen Bearnes) [19:02:30] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.3 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726977 (owner: 10Brennen Bearnes) [19:03:54] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [19:04:20] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.3 refs T281167 [19:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:45] that alert seems unrelated given its in codfw [19:04:50] ack [19:05:23] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.3 refs T281167 (duration: 01m 03s) [19:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:46] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:06:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:54] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726977 (owner: 10Brennen Bearnes) [19:10:55] (03CR) 10Legoktm: "Thanks, I don't really know we (probably me) added this." [puppet] - 10https://gerrit.wikimedia.org/r/726923 (owner: 10Jbond) [19:12:37] (03CR) 10Legoktm: "Currently icinga is alerting on every check with "NRPE: Unable to read output"." [puppet] - 10https://gerrit.wikimedia.org/r/726649 (owner: 10Volans) [19:14:50] (03PS1) 10Legoktm: Revert "uwsgi: restore unicode output for NRPE check" [puppet] - 10https://gerrit.wikimedia.org/r/726951 [19:14:52] legoktm: not all [19:14:53] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=debmonitor1002&service=debmonitor+uWSGI+web+app [19:14:58] which one? [19:15:10] ores, graphite-web, puppetboard, coal [19:15:21] just looking at icinga/alerts [19:15:31] https://icinga.wikimedia.org/alerts I mean [19:16:16] legoktm: it's the strech hosts [19:16:21] where it's /bin/systemctl [19:16:26] :| [19:16:28] we can either remove the absolute path [19:16:33] or revert to use service [19:16:35] meh [19:16:39] either of those [19:16:45] but it's not the revert of that change [19:16:47] the solution [19:16:53] * legoktm nods [19:16:59] legoktm: here dinner is just ready [19:17:08] shouldn't /bin/systemctl work on buster+bullseye too? [19:17:24] ah... right [19:17:27] that should work [19:17:32] (03Abandoned) 10Legoktm: Revert "uwsgi: restore unicode output for NRPE check" [puppet] - 10https://gerrit.wikimedia.org/r/726951 (owner: 10Legoktm) [19:17:36] give me a minute [19:17:41] sure, thanks [19:18:43] (03PS1) 10Legoktm: uwsgi: Use /bin/systemctl for non merged-/usr hosts [puppet] - 10https://gerrit.wikimedia.org/r/726978 [19:19:07] how's that? [19:19:09] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/726978 (owner: 10Legoktm) [19:19:17] sorry for not have spotted that mishap earlier [19:20:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Jclark-ctr) kubernetes1018 A6 U28 Port26 Cableid# 1949 kubernetes1019 B3 U29 Port 25 Cableid# 1925 kubernetes1020 C3 U11 Port9 Cableid# 2865 kubernetes1021 D3... [19:20:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [19:21:13] it happens :) [19:21:17] (03CR) 10Legoktm: [C: 03+2] uwsgi: Use /bin/systemctl for non merged-/usr hosts [puppet] - 10https://gerrit.wikimedia.org/r/726978 (owner: 10Legoktm) [19:21:27] volans: enjoy your dinner! [19:21:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10Jclark-ctr) cloudmetrics1003 A6 U29 Port29 Cableid#1952 cloudmetrics1004 C5 U29 Port34 Cableid#3315 [19:22:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10Jclark-ctr) [19:24:07] (03PS1) 10Michael DiPietro: remove table from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/726980 (https://phabricator.wikimedia.org/T292043) [19:24:19] (03CR) 10Jbond: mailman3::web: dont try to install the packagae twice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726923 (owner: 10Jbond) [19:24:38] (03CR) 10Legoktm: "The problem was not actually this, volans on IRC pointed out the issue was the path of /usr/bin/systemctl, which does not work on stretch " [puppet] - 10https://gerrit.wikimedia.org/r/726649 (owner: 10Volans) [19:26:48] (03CR) 10Krinkle: [C: 03+1] Eventlogging: Remove unused RUM Speed Index. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog) [19:30:31] (03CR) 10Bstorm: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/726980 (https://phabricator.wikimedia.org/T292043) (owner: 10Michael DiPietro) [19:31:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson @wiki_willy can you help clarify racking requirements for these host? @cmjohnson1 T... [19:34:35] (03CR) 10Michael DiPietro: [C: 03+2] remove table from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/726980 (https://phabricator.wikimedia.org/T292043) (owner: 10Michael DiPietro) [19:34:40] as far as staticMapframe not being enabled on non-Wikipedias it seems to be a case of "mapframe was enabled before static mapframe existed, nothing burned down, so why change it" [19:35:35] I can't find any justification for why it was only enabled on Wikipedias, considering T148070 said "all mapframes" [19:35:36] T148070: Use maps snapshot service until user interacts (click/mouseover?) - https://phabricator.wikimedia.org/T148070 [19:39:59] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10cmooney) > I know in the R720s, R730s etc the "on-board" network ports are on a daughterboard which is swappable. From everything I've seen online this isn't the case with... [19:40:02] all the uwsgi alerts have cleared [19:44:35] legoktm: great, thanks for the fix [19:44:36] ! [20:00:04] brennen and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T1900). [20:00:04] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T2000). [20:15:51] (03CR) 10Dzahn: [V: 03+1] mediawiki/geoip: add option to also pull new MaxMind databases from master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [20:20:29] (03CR) 10Ottomata: "That is only used by the old eventlogging-processor to ensure that we don't try to process events that have been migrated to Event Platfor" [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog) [20:22:43] (03PS5) 10Dzahn: mediawiki/geoip: add option to also pull new MaxMind databases from master [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) [20:23:00] (03CR) 10Dzahn: mediawiki/geoip: add option to also pull new MaxMind databases from master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [20:23:15] (03CR) 10jerkins-bot: [V: 04-1] mediawiki/geoip: add option to also pull new MaxMind databases from master [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [20:27:43] (03PS6) 10Dzahn: mediawiki/geoip: add option to also pull new MaxMind databases from master [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) [20:33:04] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/31521/" [puppet] - 10https://gerrit.wikimedia.org/r/726094 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [20:34:51] (03CR) 10Krinkle: [C: 03+1] Remove unused RUM-SpeedIndex. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726854 (https://phabricator.wikimedia.org/T286700) (owner: 10Phedenskog) [20:36:04] (03PS1) 10Dzahn: Revert "mediawiki/geoip: add option to also pull new MaxMind databases from master" [puppet] - 10https://gerrit.wikimedia.org/r/726952 [20:36:16] (03CR) 10Dzahn: [C: 03+2] Revert "mediawiki/geoip: add option to also pull new MaxMind databases from master" [puppet] - 10https://gerrit.wikimedia.org/r/726952 (owner: 10Dzahn) [20:40:36] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.04602 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:40:45] ^ grrr..puppet issue you could not see in compiler (wrapped exception) and it was touching mediawiki::common [20:40:51] but I reverted [20:43:29] !log [cumin1001:~] $ sudo cumin -b 5 -p 95 'mw2*' 'run-puppet-agent -q --failed-only' [20:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:09] sigh, that's all of wtp and parse and more because mediawiki::common is so .. common.. but it's going to recover. also don't want to overload the master [20:50:22] !log global puppet failure - revert is merged, puppet run will recover on next run everywhere. partially forcing with cumin, partially letting it recover naturally [20:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:05] (03CR) 10Dzahn: "this wasn't visible in compiler (wrapped exception) but caused global puppet run issues https://phabricator.wikimedia.org/P17430" [puppet] - 10https://gerrit.wikimedia.org/r/726952 (owner: 10Dzahn) [21:06:03] 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Papaul) We are seeing this issue because all those hosts are running an old firmware version for the IDRAC. Upgrading... [21:07:34] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003977 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:08:59] ^ it's over [21:11:09] (03PS1) 10Dzahn: Revert "Revert "mediawiki/geoip: add option to also pull new MaxMind databases from master"" [puppet] - 10https://gerrit.wikimedia.org/r/726954 [21:16:29] (03PS2) 10Dzahn: Revert "Revert "mediawiki/geoip: add option to also pull new MaxMind databases from master"" [puppet] - 10https://gerrit.wikimedia.org/r/726954 [22:23:16] !log temp. disabling puppet on an-worker*, mw* [22:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:13] (IcingaOverload) firing: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [22:24:44] PROBLEM - Check systemd state on ms-be1031 is CRITICAL: CRITICAL - degraded: The following units failed: session-207937.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:16] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "mediawiki/geoip: add option to also pull new MaxMind databases from master"" [puppet] - 10https://gerrit.wikimedia.org/r/726954 (owner: 10Dzahn) [22:30:47] !log re-enabling puppet on mw*, an-worker* after deploying gerrit:726954. no issue this time [22:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:04] (03PS1) 10Volans: sre.experimental.reimage: update Netbox data [cookbooks] - 10https://gerrit.wikimedia.org/r/726990 [22:33:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10wiki_willy) Hi @nskaggs & @aborrero - let us know what you decide, so we can make sure these servers are properly placed. Than... [22:34:05] 10SRE, 10observability: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10RLazarus) >>! In T292606#7404873, @fgiunchedi wrote: > Thanks for the report -- I'm assuming this is a new bug post-upgrade of Grafana yesterday I'd definitely believe this turns out to be tr... [22:35:33] (03PS1) 10Dzahn: mediawiki: roll out maxmind dbs for ipinfo on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/726991 (https://phabricator.wikimedia.org/T288844) [22:39:13] (IcingaOverload) resolved: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [22:40:02] (03PS3) 10Juan90264: Adding and use wordmark in ckbwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726955 (https://phabricator.wikimedia.org/T288368) [22:43:24] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/31523/" [puppet] - 10https://gerrit.wikimedia.org/r/726991 (https://phabricator.wikimedia.org/T288844) (owner: 10Dzahn) [22:46:57] (03CR) 10Cwhite: [C: 03+2] logstash: test moving the k8s parsing to earlier in the pipeline [puppet] - 10https://gerrit.wikimedia.org/r/726671 (https://phabricator.wikimedia.org/T292099) (owner: 10Cwhite) [22:50:43] (03PS4) 10Juan90264: Adding and use wordmark in ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726955 (https://phabricator.wikimedia.org/T288368) [22:54:12] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) @phuedx The 2 new databases GeoIP2-Anonymous-IP.mmdb and GeoIP2-Enterprise.mmdb we got with the new licens... [22:58:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10Bstorm) In our team meeting today, we figured that a straight refresh of cloudmetrics1001/2 as the systems were provisioned pre... [22:59:13] I arrived! Two patches awaiting deployment [22:59:50] urbanc [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211006T2300). [23:00:05] Juan_90264: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:29] urbanecm: I'm online and live here, don't forget to deploy my patches [23:02:19] ... [23:03:26] Any online deployers? [23:04:29] I can do it if no-one else is around. [23:05:42] James_F: Thank you for your help, I believe Urbanecm is not online right now [23:05:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10wiki_willy) Cool, thanks for confirming @Bstorm ...we'll definitely miss working with ya, and wish you all the best! >>! In T2... [23:06:12] (03CR) 10Jforrester: [C: 03+2] Enable NewUserMessage for ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726603 (https://phabricator.wikimedia.org/T290820) (owner: 10Juan90264) [23:08:22] RECOVERY - Check systemd state on ms-be1031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:46] (03PS1) 10Eric Gardner: Add a new "all assessments" option to MediaSearch assessments dropdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726993 (https://phabricator.wikimedia.org/T285349) [23:12:29] (03PS4) 10Jforrester: Enable NewUserMessage for ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726603 (https://phabricator.wikimedia.org/T290820) (owner: 10Juan90264) [23:12:33] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726603 (https://phabricator.wikimedia.org/T290820) (owner: 10Juan90264) [23:13:32] Thanks for rebased [23:14:56] (03Merged) 10jenkins-bot: Enable NewUserMessage for ptwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726603 (https://phabricator.wikimedia.org/T290820) (owner: 10Juan90264) [23:16:04] (03PS5) 10Jforrester: Adding and use wordmark in ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726955 (https://phabricator.wikimedia.org/T288368) (owner: 10Juan90264) [23:16:56] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:726603|Enable NewUserMessage for ptwikivoyage (T290820)]] (duration: 01m 05s) [23:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:01] T290820: add NewUserMessage to ptwikivoyage - https://phabricator.wikimedia.org/T290820 [23:17:05] (03CR) 10Jforrester: [C: 03+2] Adding and use wordmark in ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726955 (https://phabricator.wikimedia.org/T288368) (owner: 10Juan90264) [23:17:59] (03Merged) 10jenkins-bot: Adding and use wordmark in ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726955 (https://phabricator.wikimedia.org/T288368) (owner: 10Juan90264) [23:18:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:06] !log jforrester@deploy1002 Synchronized static/images/mobile/copyright/wikipedia-wordmark-ckb.svg: Config: [[gerrit:726955|Adding and use wordmark in ckbwiki (T288368)]] (duration: 01m 04s) [23:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:11] T288368: Add Central Kurdish Wikipedia Wordmark on Mobile view - https://phabricator.wikimedia.org/T288368 [23:21:12] !log jforrester@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:726955|Adding and use wordmark in ckbwiki (T288368)]] (duration: 01m 04s) [23:21:16] Juan_90264: OK, should be done now. Thanks! [23:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:54] James_F: Resolved, and thanks for deploying! [23:26:13] Happy to help. [23:29:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [23:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:46] James_F: "Archive manually (bot broke?)" MarcoAurelio replaced the file that the bot had already archived [23:33:32] * put again [23:44:51] (03CR) 10Dzahn: [C: 03+2] Remove Parsoid jessie debs [puppet] - 10https://gerrit.wikimedia.org/r/725670 (owner: 10Muehlenhoff) [23:46:46] (03CR) 10Dzahn: "affects releases servers, _not_ parsoid servers" [puppet] - 10https://gerrit.wikimedia.org/r/725670 (owner: 10Muehlenhoff) [23:47:53] (03CR) 10Dzahn: "@Muehlenhoff Should have absented apt::distribution first to make puppet actually remove it or we do it manually?" [puppet] - 10https://gerrit.wikimedia.org/r/725670 (owner: 10Muehlenhoff) [23:50:22] 10SRE, 10SRE-swift-storage, 10ops-eqiad: swift - ms-be1059 - device sdi:3 unavailable - https://phabricator.wikimedia.org/T292486 (10wiki_willy) a:03Cmjohnson [23:50:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10wiki_willy) a:03Cmjohnson [23:57:02] !log releases2002 - rm /srv/org/wikimedia/reprepro/conf/distributions - contains only jessie-mediawiki - see 725670 and EOL of MediaWiki 1.31 [23:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log