[00:00:04] RoanKattouw and Urbanecm: Your horoscope predicts another unfortunate UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T0000). [00:00:04] ebernhardson, subbu, and cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:20] here ... [00:00:51] dancy: note, there was an alert for one mw host timeout: PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:01:09] hello! [00:01:26] dancy: twentyafterfour: hey, just in case, is B&C okay to go? [00:01:42] urbanecm: please see -security [00:03:25] i assume answer is no [00:03:29] based on transcript [00:03:47] (03CR) 10Jdlrobson: [C: 03+1] Deploy sticky header to pilot wikis, launch A/B test. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [00:04:08] (03CR) 10Jdlrobson: [C: 03+1] Deploy sticky header to pilot wikis, launch A/B test. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [00:05:13] since i am not in -security .. i assume i should just reschedule this for tomorrow? [00:06:23] subbu: depends on how fast folks will fix it :) [00:07:28] (03CR) 10Nray: Deploy sticky header to pilot wikis, launch A/B test. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [00:07:59] (03PS7) 10Juan90264: Update bnwikivoyage wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749626 (https://phabricator.wikimedia.org/T298033) (owner: 10MdsShakil) [00:09:20] twentyafterfour: FYI, looking through a few of my old transcripts, I get samples of 39,40, 33, and 36 minutes for `Finished scap: testwikis wikis to ` [00:12:09] Hello? [00:12:20] yeah the docs actually say 'Note: this step may take on the order of 70-80 minutes.' [00:12:21] being ever the optimist, I added a sixth patch to the window [00:12:40] hello Juan_90264 [00:12:55] we're currently waiting on an issue being fixed that blocks all deployments [00:13:15] Urbanecm: Ok [00:16:16] Urbanecm: Does this problem have any forecast to finish being fixed? [00:16:36] please be patient :) [00:17:16] Okay [00:22:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:14] RECOVERY - very high load average likely xfs on ms-be2065 is OK: OK - load average: 73.20, 75.58, 79.42 https://wikitech.wikimedia.org/wiki/Swift [00:37:14] !log twentyafterfour@deploy1002 Synchronized php-1.38.0-wmf.16/extensions/VisualEditor/: fix patch application failure (duration: 01m 09s) [00:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:15] !log twentyafterfour@deploy1002 Synchronized php-1.38.0-wmf.16/includes/content/ContentModelChange.php: fix patch application failure (duration: 01m 07s) [00:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:03] okay, let's go now :) [00:42:08] thanks for the patience everyone [00:42:25] subbu: cjming: Juan_90264: still around? [00:42:36] yes. [00:42:39] great [00:42:43] yes! [00:43:09] (03CR) 10Urbanecm: [C: 03+2] Enable slow-parsoid logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749302 (owner: 10Subramanya Sastry) [00:43:36] TIL we're publishing slow-parse [00:44:25] (03Merged) 10jenkins-bot: Enable slow-parsoid logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749302 (owner: 10Subramanya Sastry) [00:44:46] subbu: can you test it? [00:45:09] i'm trying to see where to look for them in logstash. [00:46:26] with mwdebug? the mwdebug servers screen [00:47:46] or i can probably just sync, it's safe enough [00:47:53] and then you'll see them whenever they normally end up [00:47:57] yes, it is safe. [00:47:58] subbu: let me know what you prefer [00:48:06] so sync? [00:48:10] yes, please. [00:48:23] doing [00:48:23] if they don't show up anywhere, i'll debug tomorrow. [00:48:28] (03PS4) 10Urbanecm: Fix wordmark svgs for strategywiki, viwikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748214 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [00:48:36] (03CR) 10Urbanecm: [C: 03+2] Fix wordmark svgs for strategywiki, viwikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748214 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [00:48:38] sounds good [00:49:20] (03Merged) 10jenkins-bot: Fix wordmark svgs for strategywiki, viwikibooks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748214 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [00:49:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 6c220f0bb86b0d77714ee23d662ea836897e0207: Enable slow-parsoid logs (duration: 01m 08s) [00:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:44] subbu: it's live [00:49:48] thanks. [00:49:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:59] cjming: your patch is at mwdebug1001 [00:50:02] can you have a look? [00:50:07] yup [00:50:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:50:32] urbanecm, yes, i see the logs now. :) [00:50:37] great, syncing [00:50:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:42] urbanecm: looking good [00:51:58] syncing [00:52:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:22] i accidentally started earlier, because the message looked like "log_o_s" now [00:52:29] sorry [00:52:33] !log urbanecm@deploy1002 Synchronized static/images/mobile/copyright/: 7aff17f42eb2ecad94a76c5d93ce467bd6bff39e: Fix wordmark svgs for strategywiki, viwikibooks (T290091; 1/2) (duration: 01m 07s) [00:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:36] T290091: Create logos and prepare new set of pilot wikis for deployment - https://phabricator.wikimedia.org/T290091 [00:53:52] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7aff17f42eb2ecad94a76c5d93ce467bd6bff39e: Fix wordmark svgs for strategywiki, viwikibooks (T290091; 2/2) (duration: 01m 07s) [00:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:58] cjming: it's live :) [00:54:28] so, Juan_90264 did not reply to the ping, so...tgr are you around? [00:54:29] thank you \o/ [00:54:39] urbanecm: o/ [00:54:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:48] (03PS4) 10Urbanecm: GrowthExperiments: Add campaign pattern for JOSA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749598 (https://phabricator.wikimedia.org/T298057) (owner: 10Gergő Tisza) [00:54:52] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Add campaign pattern for JOSA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749598 (https://phabricator.wikimedia.org/T298057) (owner: 10Gergő Tisza) [00:56:16] (03Merged) 10jenkins-bot: GrowthExperiments: Add campaign pattern for JOSA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749598 (https://phabricator.wikimedia.org/T298057) (owner: 10Gergő Tisza) [00:56:40] tgr: pulled to mwdebug1001 [00:56:43] can you have a look? [00:57:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:22] urbanecm: works [00:59:28] syncing [00:59:38] (not quite as expected but that's not an issue with the config patch) [01:01:07] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 34bf91ec2ba1408594bb77745deb6fa7d36ddf8d: GrowthExperiments: Add campaign pattern for JOSA (T298057) (duration: 01m 08s) [01:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:10] T298057: Donors to newcomers: JOSA landing page - https://phabricator.wikimedia.org/T298057 [01:01:18] tgr: live [01:01:21] and, we're done [01:01:58] awesome, i'll ship mine then as well [01:02:25] thanks ebernhardson [01:02:52] (03CR) 10Ebernhardson: [C: 03+2] Move CirrusSearch more_like traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751485 (owner: 10Ebernhardson) [01:04:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:23] (03PS2) 10Ebernhardson: Move CirrusSearch more_like traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751485 [01:06:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:17] (03CR) 10Ebernhardson: [C: 03+2] Move CirrusSearch more_like traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751485 (owner: 10Ebernhardson) [01:08:01] (03Merged) 10jenkins-bot: Move CirrusSearch more_like traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751485 (owner: 10Ebernhardson) [01:11:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:02] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:751485|Move CirrusSearch more_like traffic to eqiad]] (duration: 01m 07s) [01:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:15] is B&C finished? [01:16:36] tgr: And i? [01:17:07] Juan_90264: i just finished mine, can likely deploy yours. Lemme look at it a sec [01:18:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:49] @ebernhardson: Ok [01:22:32] Juan_90264: i just had a look through and it seems like there are still pages in the namespaces, although they might all be redirects. expected? [01:22:40] re: tematica and the related talk [01:23:02] (i'm not particularly familiar with deleting namespaces, but i was expecting all pages would be moved out first so nothing is inaccessible) [01:24:10] (03PS8) 10Ebernhardson: Update bnwikivoyage wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749626 (https://phabricator.wikimedia.org/T298033) (owner: 10MdsShakil) [01:24:20] (03CR) 10Ebernhardson: [C: 03+2] Update bnwikivoyage wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749626 (https://phabricator.wikimedia.org/T298033) (owner: 10MdsShakil) [01:24:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:03] (03Merged) 10jenkins-bot: Update bnwikivoyage wordmark logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749626 (https://phabricator.wikimedia.org/T298033) (owner: 10MdsShakil) [01:29:23] ebernhardson: If the namespace is deleted the pages will end up being redirected to the main domain. But wait? The user who requested the task in Phabricator had said that he had removed all pages from that namespace, are you sure you have any in that namespace? [01:29:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:29] Great merged! [01:30:45] Juan_90264: i was looking at https://quarry.wmcloud.org/query/61257 which says 226 pages in the 104 namespace [01:32:48] ebernhardson https://quarry.wmcloud.org/query/61258 [01:33:01] looks like you're good [01:34:27] Ok that sounds reasonable [01:34:43] (03PS1) 10Ladsgroup: maintenance: Add support for oldimage table metadata refresh [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751526 (https://phabricator.wikimedia.org/T298417) [01:34:47] (03PS4) 10Ebernhardson: Delete Tematica namespace (NS:104) in Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750814 (https://phabricator.wikimedia.org/T298315) (owner: 10Juan90264) [01:34:56] (03PS1) 10Ladsgroup: maintenance: Add support for oldimage table metadata refresh [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751527 (https://phabricator.wikimedia.org/T298417) [01:35:01] Perryprog: Thanks for actually checking for pages [01:35:48] Juan_90264: bnwiki change is pulled to mwdebug1002 if you can test [01:36:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:40] Ok, i will test [01:36:51] yup, just happened to be lurking and was curious :-) [01:40:28] ebernhardson: I tested and approve [01:40:42] Juan_90264: great, syncing [01:40:48] Ok perry [01:40:53] (03CR) 10Ebernhardson: [C: 03+2] Delete Tematica namespace (NS:104) in Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750814 (https://phabricator.wikimedia.org/T298315) (owner: 10Juan90264) [01:41:40] (03Merged) 10jenkins-bot: Delete Tematica namespace (NS:104) in Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/750814 (https://phabricator.wikimedia.org/T298315) (owner: 10Juan90264) [01:41:43] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:749626|Update bnwikivoyage wordmark logo (T298033)]] (duration: 01m 07s) [01:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:47] T298033: Update bnwikivoyage wordmark logo - https://phabricator.wikimedia.org/T298033 [01:42:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:05] !log ebernhardson@deploy1002 Synchronized static/images/mobile/copyright/wikivoyage-wordmark-bn.svg: Config: [[gerrit:749626|Update bnwikivoyage wordmark logo (T298033)]] (duration: 01m 07s) [01:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:00] Juan_90264: ok now we have dropping namespaces on mwdebug1002, not 100% sure what to test [01:47:28] (03CR) 10Ladsgroup: [C: 03+2] maintenance: Add support for oldimage table metadata refresh [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751527 (https://phabricator.wikimedia.org/T298417) (owner: 10Ladsgroup) [01:47:34] (03CR) 10Ladsgroup: [C: 03+2] maintenance: Add support for oldimage table metadata refresh [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751526 (https://phabricator.wikimedia.org/T298417) (owner: 10Ladsgroup) [01:47:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:18] Juan_90264: seems reasonable to my testing, shipping [01:49:11] !log ebernhardson@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:750814|Delete Tematica namespace (NS:104) in Italian Wikivoyage (T298315)]] (duration: 01m 07s) [01:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:14] T298315: Deleting Ns:104 in it:voy - https://phabricator.wikimedia.org/T298315 [01:49:28] ok, with that the backports should be complete i think [01:51:14] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:52:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [01:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:49] (03CR) 10jerkins-bot: [V: 04-1] maintenance: Add support for oldimage table metadata refresh [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751526 (https://phabricator.wikimedia.org/T298417) (owner: 10Ladsgroup) [01:56:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [01:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:47] (03Merged) 10jenkins-bot: maintenance: Add support for oldimage table metadata refresh [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751527 (https://phabricator.wikimedia.org/T298417) (owner: 10Ladsgroup) [02:09:08] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.13/maintenance/refreshImageMetadata.php: Backport: [[gerrit:751527|maintenance: Add support for oldimage table metadata refresh (T298417)]] (duration: 01m 08s) [02:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:12] T298417: Undeleted djvu files show incorrect metadata: 0x0 size, no page number info - https://phabricator.wikimedia.org/T298417 [02:09:25] (03Merged) 10jenkins-bot: maintenance: Add support for oldimage table metadata refresh [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751526 (https://phabricator.wikimedia.org/T298417) (owner: 10Ladsgroup) [02:11:06] ebernhardson: Sorry I'm missing, the changes seem to be working [02:11:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:19] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.16/maintenance/refreshImageMetadata.php: Backport: [[gerrit:751526|maintenance: Add support for oldimage table metadata refresh (T298417)]] (duration: 01m 07s) [02:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:11:23] Thanks ebernhardson! [02:12:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:23] !log running foreachwikiindblist all maintenance/refreshImageMetadata.php --force --verbose --mediatype=OFFICE --oldimage (T298417) [02:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:56] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:52:18] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:07:36] (03CR) 10Andrew Bogott: [C: 03+2] apt sources.list templates: add some comments [puppet] - 10https://gerrit.wikimedia.org/r/751497 (https://phabricator.wikimedia.org/T264311) (owner: 10Andrew Bogott) [03:13:37] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: puppetize /etc/apt/sources.list [puppet] - 10https://gerrit.wikimedia.org/r/751498 (https://phabricator.wikimedia.org/T264311) (owner: 10Andrew Bogott) [03:42:06] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:45:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [03:47:01] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1004 - https://phabricator.wikimedia.org/T298582 (10ops-monitoring-bot) [03:50:49] (WdqsStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 5 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org [05:28:24] (03PS1) 10KartikMistry: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) [05:46:15] (03PS1) 10Marostegui: Revert "db2094: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/751529 [05:48:35] (03CR) 10Marostegui: [C: 03+2] Revert "db2094: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/751529 (owner: 10Marostegui) [05:50:57] (03CR) 10Gergő Tisza: [C: 03+1] snapshot: Dump information about Growth mentorship [puppet] - 10https://gerrit.wikimedia.org/r/740371 (https://phabricator.wikimedia.org/T291966) (owner: 10Urbanecm) [05:56:04] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:01:08] (03CR) 10MdsShakil: [C: 04-1] "they only discussed for adding ak Wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) (owner: 10Amire80) [06:10:37] (03CR) 10Amire80: Add akwiki as an import source for twwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) (owner: 10Amire80) [06:18:32] (03CR) 10MdsShakil: [C: 04-1] "That's seems be ok but you didn't mentioned the ticket number." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) (owner: 10Amire80) [06:25:10] (03CR) 10Amire80: Add akwiki as an import source for twwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) (owner: 10Amire80) [06:28:44] (03CR) 10MdsShakil: [C: 04-1] Add akwiki as an import source for twwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) (owner: 10Amire80) [06:45:13] PROBLEM - Check systemd state on ms-be1061 is CRITICAL: CRITICAL - degraded: The following units failed: session-259999.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:57:13] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:04:11] RECOVERY - Check systemd state on ms-be1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:19] (03PS2) 10Amire80: Add akwiki as an import source for twwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) [07:15:52] (03PS3) 10Amire80: Add akwiki as an import source for twwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) [07:29:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T297191)', diff saved to https://phabricator.wikimedia.org/P18394 and previous config saved to /var/cache/conftool/dbconfig/20220105-072937-marostegui.json [07:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:40] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [07:30:06] (03CR) 10MdsShakil: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) (owner: 10Amire80) [07:30:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T297191)', diff saved to https://phabricator.wikimedia.org/P18395 and previous config saved to /var/cache/conftool/dbconfig/20220105-073046-marostegui.json [07:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 237, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:41:13] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:45:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P18396 and previous config saved to /var/cache/conftool/dbconfig/20220105-074551-marostegui.json [07:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P18397 and previous config saved to /var/cache/conftool/dbconfig/20220105-080055-marostegui.json [08:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:11] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) Updating myself: looks like the error comes from the fact mwmaint is *not* using remote shellbox to execute `scripts/retrieveMetaData.sh` - see https://logstash.wi... [08:16:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T297191)', diff saved to https://phabricator.wikimedia.org/P18398 and previous config saved to /var/cache/conftool/dbconfig/20220105-081600-marostegui.json [08:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:04] T297191: Schema change for dropping page_restrictions.pr_user field on wmf sites - https://phabricator.wikimedia.org/T297191 [08:25:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2087:3316, db2087:3317 for Buster reimage T295965', diff saved to https://phabricator.wikimedia.org/P18399 and previous config saved to /var/cache/conftool/dbconfig/20220105-082529-marostegui.json [08:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:34] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [08:26:20] (03PS1) 10Marostegui: db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/751679 (https://phabricator.wikimedia.org/T295965) [08:27:04] (03CR) 10Marostegui: [C: 03+2] db2087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/751679 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [08:28:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2087.codfw.wmnet with OS bullseye [08:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:42] (03CR) 10DCausse: [C: 03+1] Blazegraph: further relax free allocators check [alerts] - 10https://gerrit.wikimedia.org/r/751513 (https://phabricator.wikimedia.org/T298525) (owner: 10Bking) [08:31:23] (03PS3) 10Juan90264: Change the Traditional Chinese and Simplified Chinese logo for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751530 (https://phabricator.wikimedia.org/T298550) [08:44:40] (03PS1) 10Majavah: P:graphite: move forward_clusters to hiera [puppet] - 10https://gerrit.wikimedia.org/r/751681 (https://phabricator.wikimedia.org/T241285) [08:44:53] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/751681 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [08:52:18] PROBLEM - MariaDB Replica SQL: s7 on clouddb1018 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1141, Errmsg: Error There is no such grant defined for user wikiadmin on host 10.% on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:46] PROBLEM - MariaDB Replica SQL: s7 on clouddb1021 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1141, Errmsg: Error There is no such grant defined for user wikiadmin on host 10.% on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:52:50] PROBLEM - MariaDB Replica SQL: s7 on clouddb1014 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1141, Errmsg: Error There is no such grant defined for user wikiadmin on host 10.% on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:54:05] Amir1: ^ [08:54:16] on it [08:54:42] ran something with replication ugh [08:57:00] RECOVERY - MariaDB Replica SQL: s7 on clouddb1014 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:57:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2087.codfw.wmnet with OS bullseye [08:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:10] RECOVERY - MariaDB Replica SQL: s7 on clouddb1021 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:58:22] should be all fixed now [08:58:40] RECOVERY - MariaDB Replica SQL: s7 on clouddb1018 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:09:36] (03CR) 10Filippo Giunchedi: [C: 03+1] wmflib: add service::get_services_for function [puppet] - 10https://gerrit.wikimedia.org/r/746801 (owner: 10Giuseppe Lavagetto) [09:16:04] (03CR) 10Filippo Giunchedi: "I think we should be able to keep using idp/cas in deployment-prep but point it to idp.wmcloud.org, or said otherwise is deployment-prep g" [puppet] - 10https://gerrit.wikimedia.org/r/751477 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [09:17:43] (03CR) 10Filippo Giunchedi: [C: 03+1] P:prometheus::ops: add prometheus job and ferm rules for gitlab_runner metrics [puppet] - 10https://gerrit.wikimedia.org/r/751452 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [09:18:11] (03CR) 10Majavah: P:graphite: support not using CAS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751477 (https://phabricator.wikimedia.org/T241285) (owner: 10Majavah) [09:18:18] (03CR) 10Filippo Giunchedi: [C: 04-1] "Fails to PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:24:34] !log depool cp5005 to be reimaged as cache::upload_envoy - T271421 [09:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:37] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [09:25:35] (03PS1) 10Filippo Giunchedi: graphite: bump FETCH_TIMEOUT [puppet] - 10https://gerrit.wikimedia.org/r/751686 (https://phabricator.wikimedia.org/T298521) [09:26:48] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33128/console" [puppet] - 10https://gerrit.wikimedia.org/r/751686 (https://phabricator.wikimedia.org/T298521) (owner: 10Filippo Giunchedi) [09:26:51] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp5005 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/751413 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:26:57] (03PS3) 10Vgutierrez: site: Reimage cp5005 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/751413 (https://phabricator.wikimedia.org/T271421) [09:29:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp5005.eqsin.wmnet with OS buster [09:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:14] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5005.eqsin.wmnet with OS buster [09:29:33] jouncebot: nowandnext [09:29:33] No deployments scheduled for the next 2 hour(s) and 30 minute(s) [09:29:34] In 2 hour(s) and 30 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1200) [09:29:56] (03CR) 10Urbanecm: [C: 03+2] Add Zscaler to list of trusted hosts for XFF [extensions/TrustedXFF] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751196 (https://phabricator.wikimedia.org/T298241) (owner: 10Urbanecm) [09:29:58] (03CR) 10Urbanecm: [C: 03+2] Add Zscaler to list of trusted hosts for XFF [extensions/TrustedXFF] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751195 (https://phabricator.wikimedia.org/T298241) (owner: 10Urbanecm) [09:30:13] (03CR) 10Urbanecm: [C: 03+2] MentorFilterHooks: Include only primary mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/750807 (https://phabricator.wikimedia.org/T298031) (owner: 10Urbanecm) [09:32:58] (03Merged) 10jenkins-bot: Add Zscaler to list of trusted hosts for XFF [extensions/TrustedXFF] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751196 (https://phabricator.wikimedia.org/T298241) (owner: 10Urbanecm) [09:33:00] (03Merged) 10jenkins-bot: Add Zscaler to list of trusted hosts for XFF [extensions/TrustedXFF] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751195 (https://phabricator.wikimedia.org/T298241) (owner: 10Urbanecm) [09:33:07] that was quick [09:33:19] !log aokoth@cumin1001 START - Cookbook sre.hosts.decommission for hosts kubestage1001.eqiad.wmnet [09:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:43] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/TrustedXFF/trusted-hosts.php: 010d96b9297825079b3ac84f247c0f80353d42a8: Add Zscaler to list of trusted hosts for XFF (T298241) (duration: 01m 09s) [09:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:46] T298241: Add Zscaler to list of trusted hosts for XFF - https://phabricator.wikimedia.org/T298241 [09:36:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/751465 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:37:05] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.16/extensions/TrustedXFF/trusted-hosts.php: ab8fe9884e3e4d1fa3bdaa1c8a9cab143b4ac565: Add Zscaler to list of trusted hosts for XFF (T298241) (duration: 01m 08s) [09:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/751466 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:38:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751469 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:40:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:27] (KubernetesCalicoDown) firing: kubestage1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [09:45:02] (03CR) 10David Caro: [C: 03+2] logstash:input:syslog: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751127 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:45:19] taavi: I've read T241285 and tbh I don't think it is a good use of time/resources since graphite is deprecated / life support mode only, could we keep using cloudmetrics instead as it is now? [09:45:20] T241285: Deployment-prep should host its own statsd/graphite server - https://phabricator.wikimedia.org/T241285 [09:45:39] (03CR) 10David Caro: [C: 03+2] systemtap::runtime: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751469 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:46:15] (03CR) 10David Caro: [C: 03+2] udp2log:rsyncd: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751466 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:46:55] (03CR) 10David Caro: [C: 03+2] varnish: remove empty init class and unused module [puppet] - 10https://gerrit.wikimedia.org/r/751465 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:47:37] (03CR) 10David Caro: [C: 03+2] role::wmcs::prometheus: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751454 (https://phabricator.wikimedia.org/T238096) (owner: 10David Caro) [09:48:30] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: extend blackbox probes options [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:48:33] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts kubestage1001.eqiad.wmnet [09:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:41] (03PS4) 10Filippo Giunchedi: prometheus: extend blackbox probes options [puppet] - 10https://gerrit.wikimedia.org/r/747835 (https://phabricator.wikimedia.org/T291946) [09:49:32] godog: yeah, that might be the best option here :/ I decided to see how complicated that would be since we were refreshing the cloudmetrics hardware already, and just hit a major roadblock with as grafana-labs/cloudmetrics* can't access the graphite install on cloud vps [09:50:57] (03CR) 10DCausse: [C: 03+2] rdf-streaming-updater: increase capacity for commons [deployment-charts] - 10https://gerrit.wikimedia.org/r/751171 (https://phabricator.wikimedia.org/T262265) (owner: 10DCausse) [09:51:09] taavi: agreed :| thanks for taking a look though -- appreciate your work on deployment-prep [09:51:32] (03Merged) 10jenkins-bot: MentorFilterHooks: Include only primary mentors [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/750807 (https://phabricator.wikimedia.org/T298031) (owner: 10Urbanecm) [09:52:10] (03CR) 10Jbond: "lgtm minor nit see comments" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [09:53:01] (03PS2) 10Urbanecm: pwnwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749264 (https://phabricator.wikimedia.org/T298115) [09:53:05] (03CR) 10Urbanecm: [C: 03+2] pwnwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749264 (https://phabricator.wikimedia.org/T298115) (owner: 10Urbanecm) [09:53:51] (03Merged) 10jenkins-bot: pwnwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749264 (https://phabricator.wikimedia.org/T298115) (owner: 10Urbanecm) [09:53:54] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/GrowthExperiments/includes/Mentorship/Hooks/MentorFilterHooks.php: 24e15e1fd5c7feb2377974ee666c61aef8f82da5: MentorFilterHooks: Include only primary mentors (T298031) (duration: 01m 07s) [09:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:57] T298031: Mentees assigned to others mentors are visible in my recent changes list - https://phabricator.wikimedia.org/T298031 [09:54:17] (03Merged) 10jenkins-bot: rdf-streaming-updater: increase capacity for commons [deployment-charts] - 10https://gerrit.wikimedia.org/r/751171 (https://phabricator.wikimedia.org/T262265) (owner: 10DCausse) [09:57:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:57:42] (03CR) 10David Caro: profile::parsoid::diffserver: remove unused profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751446 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [09:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:24] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 8137ffc33d9de0f0a835223936a93e87504a7358: pwnwiki: Enable Growth features in dark mode (T298115; 1/3) (duration: 01m 07s) [09:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:26] T298115: Deploy Growth features at pwn.wikipedia.org - https://phabricator.wikimedia.org/T298115 [09:58:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:13] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [09:59:36] !log dcausse@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [09:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:44] !log urbanecm@deploy1002 Synchronized wmf-config/config/pwnwiki.yaml: 8137ffc33d9de0f0a835223936a93e87504a7358: pwnwiki: Enable Growth features in dark mode (T298115; 2/3) (duration: 01m 07s) [09:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:57] !log dcausse@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [10:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:10] (03Abandoned) 10David Caro: profile::parsoid::diffserver: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/751446 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:01:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8137ffc33d9de0f0a835223936a93e87504a7358: pwnwiki: Enable Growth features in dark mode (T298115; 3/3) (duration: 01m 07s) [10:01:37] (03CR) 10David Caro: [C: 03+2] docker: remove unused modules/role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/751445 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:32] * urbanecm done [10:02:51] !log dcausse@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [10:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:33] (03CR) 10David Caro: [C: 03+2] p:db:development,r:beta_dashboards: remove unused classes [puppet] - 10https://gerrit.wikimedia.org/r/751430 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:04:09] (03CR) 10David Caro: [C: 03+2] bastionhost::migration: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/751401 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:04:27] (KubernetesCalicoDown) resolved: kubestage1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [10:07:53] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:11:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={cache_envoy,envoy} site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:17:04] (03PS2) 10David Caro: r:wmcs::openstack::codfw1dev::virt: delete unused role [puppet] - 10https://gerrit.wikimedia.org/r/737437 [10:17:45] (03CR) 10David Caro: parsoid: remove unused module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751163 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [10:21:21] (03PS1) 10David Caro: {p,r}:dumps:generation:sever:alldumps: remove usused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751693 (https://phabricator.wikimedia.org/T272559) [10:22:15] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:22:39] (03CR) 10Jgiannelos: "nit: In the commit message its `name_` not `_name`. Other than that it looks ok." [deployment-charts] - 10https://gerrit.wikimedia.org/r/751490 (https://phabricator.wikimedia.org/T288728) (owner: 10MSantos) [10:22:50] (03CR) 10Jgiannelos: [C: 03+1] tegola: place_label i18n fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/751490 (https://phabricator.wikimedia.org/T288728) (owner: 10MSantos) [10:23:08] (03PS1) 10Vgutierrez: cache::envoy: Allow envoyproxy unit to write on /var/cache/ocsp [puppet] - 10https://gerrit.wikimedia.org/r/751694 (https://phabricator.wikimedia.org/T271421) [10:24:10] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33129/console" [puppet] - 10https://gerrit.wikimedia.org/r/751694 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:25:46] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Allow envoyproxy unit to write on /var/cache/ocsp [puppet] - 10https://gerrit.wikimedia.org/r/751694 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:28:16] (03PS1) 10David Caro: {p,r}:gerrit:migration/migration_base: remove unused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751696 (https://phabricator.wikimedia.org/T272559) [10:29:09] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:32:22] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:37:40] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [10:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:58] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:15] (03PS1) 10Vgutierrez: cache::envoy: Allow envoy to read sslcert managed TLS material [puppet] - 10https://gerrit.wikimedia.org/r/751698 (https://phabricator.wikimedia.org/T271421) [10:44:33] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33130/console" [puppet] - 10https://gerrit.wikimedia.org/r/751698 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:45:00] (03PS1) 10David Caro: logstash::puppetreports: remove unused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751699 (https://phabricator.wikimedia.org/T272559) [10:45:25] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Allow envoy to read sslcert managed TLS material [puppet] - 10https://gerrit.wikimedia.org/r/751698 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:47:07] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:48:33] !log CI: switching MediaWiki selenium from php built-in server to Apache # https://gerrit.wikimedia.org/r/751697 [10:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:04] (03PS1) 10Marostegui: Revert "dbproxy2004: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/751532 [10:49:43] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy2004: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/751532 (owner: 10Marostegui) [10:51:19] (03PS1) 10David Caro: profile::nutcracker: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/751701 (https://phabricator.wikimedia.org/T272559) [10:51:25] (03PS1) 10Marostegui: dbproxy2003: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/751702 (https://phabricator.wikimedia.org/T298586) [10:52:02] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [10:52:25] (03CR) 10Marostegui: [C: 03+2] dbproxy2003: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/751702 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [10:52:29] (03CR) 10JMeybohm: [C: 03+1] charts: update charts to api v2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/751070 (https://phabricator.wikimedia.org/T295750) (owner: 10Jelto) [10:52:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2003.codfw.wmnet with OS bullseye [10:52:58] (03CR) 10JMeybohm: [C: 03+1] services: cleanup helmfiles, update SAL logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/737034 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [10:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:00] !log upload cfssl 1.6.1 [10:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:07] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:00:53] (03CR) 10Jbond: [C: 03+1] {p,r}:dumps:generation:sever:alldumps: remove usused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751693 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:01:42] (03CR) 10Jbond: "lgtm but added hasher, how may know if its still useful" [puppet] - 10https://gerrit.wikimedia.org/r/751696 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:02:25] (03CR) 10Jbond: [C: 03+1] logstash::puppetreports: remove unused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751699 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:02:44] (03CR) 10Jbond: [C: 03+1] profile::nutcracker: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/751701 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:05:09] (03PS1) 10David Caro: osm: remove unused profile/role [puppet] - 10https://gerrit.wikimedia.org/r/751703 (https://phabricator.wikimedia.org/T272559) [11:06:46] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:17:35] (03PS1) 10David Caro: product_analytics: remove unused profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/751704 (https://phabricator.wikimedia.org/T272559) [11:18:40] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:19:18] (03PS1) 10Marostegui: install_server: Allow reiage dbproxy2003 [puppet] - 10https://gerrit.wikimedia.org/r/751707 (https://phabricator.wikimedia.org/T298586) [11:20:20] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reiage dbproxy2003 [puppet] - 10https://gerrit.wikimedia.org/r/751707 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [11:20:48] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy2003.codfw.wmnet with OS bullseye [11:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:03] !log aokoth@cumin1001 START - Cookbook sre.hosts.decommission for hosts kubestage1002.eqiad.wmnet [11:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:21] !log updating hive packages in reprepro for log4j update [11:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2003.codfw.wmnet with OS bullseye [11:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:53] (03PS1) 10David Caro: pybal::testing: remove unused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751709 (https://phabricator.wikimedia.org/T272559) [11:27:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:30:26] (KubernetesCalicoDown) firing: kubestage1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:30:45] (03PS2) 10David Caro: product_analytics: remove unused profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/751704 (https://phabricator.wikimedia.org/T272559) [11:30:47] (03PS1) 10David Caro: r_lang::bioc: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) [11:31:12] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:31:13] (03PS1) 10Vgutierrez: sslcert::ocsp: Allow configuring group ownership on /var/cache/ocsp [puppet] - 10https://gerrit.wikimedia.org/r/751711 (https://phabricator.wikimedia.org/T271421) [11:33:41] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33132/console" [puppet] - 10https://gerrit.wikimedia.org/r/751711 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:34:01] !log aokoth@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts kubestage1002.eqiad.wmnet [11:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:41] (KubernetesCalicoDown) resolved: kubestage1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:37:51] (03PS2) 10Vgutierrez: sslcert::ocsp: Allow configuring group ownership on /var/cache/ocsp [puppet] - 10https://gerrit.wikimedia.org/r/751711 (https://phabricator.wikimedia.org/T271421) [11:39:18] (03CR) 10Ema: [C: 03+1] sslcert::ocsp: Allow configuring group ownership on /var/cache/ocsp [puppet] - 10https://gerrit.wikimedia.org/r/751711 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:40:26] (KubernetesCalicoDown) firing: kubestage1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:41:57] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:43:27] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 5 DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33133/console" [puppet] - 10https://gerrit.wikimedia.org/r/751711 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:44:07] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] sslcert::ocsp: Allow configuring group ownership on /var/cache/ocsp [puppet] - 10https://gerrit.wikimedia.org/r/751711 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:47:11] (KubernetesCalicoDown) resolved: kubestage1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:48:24] (03PS1) 10David Caro: r:analytics_test_cluster::{turnilo,webserver}: remove unused roles [puppet] - 10https://gerrit.wikimedia.org/r/751714 (https://phabricator.wikimedia.org/T272559) [11:48:27] (03CR) 10Jelto: [C: 03+2] services: cleanup helmfiles, update SAL logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/737034 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:49:11] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:49:11] (KubernetesCalicoDown) firing: kubestage1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:49:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:49:39] (03PS2) 10Majavah: [wikitech] Drop the 'cloudadmin' user group, no longer used and empty [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575390 (https://phabricator.wikimedia.org/T237890) (owner: 10Jforrester) [11:50:26] (KubernetesCalicoDown) resolved: kubestage1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [11:50:32] (03CR) 10Majavah: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575390 (https://phabricator.wikimedia.org/T237890) (owner: 10Jforrester) [11:51:36] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:52:19] (03Merged) 10jenkins-bot: services: cleanup helmfiles, update SAL logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/737034 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:53:47] (03PS1) 10David Caro: r:cloud_analytics: remove unused roles [puppet] - 10https://gerrit.wikimedia.org/r/751716 (https://phabricator.wikimedia.org/T272559) [11:54:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:54:21] (03PS1) 10Giuseppe Lavagetto: envoy: make the choice of api version explicit [puppet] - 10https://gerrit.wikimedia.org/r/751717 [11:54:25] (03PS1) 10Giuseppe Lavagetto: services_proxy::envoy: add support for v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/751718 [11:54:29] (03CR) 10Jbond: [C: 03+1] osm: remove unused profile/role [puppet] - 10https://gerrit.wikimedia.org/r/751703 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:54:42] (03PS1) 10Vgutierrez: cache::envoy: Add sslcert::ocsp::hook [puppet] - 10https://gerrit.wikimedia.org/r/751719 (https://phabricator.wikimedia.org/T271421) [11:54:52] (03CR) 10Jbond: [C: 03+1] pybal::testing: remove unused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751709 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:55:04] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply on staging [11:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:07] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply on production [11:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5005.eqsin.wmnet with OS buster [11:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:39] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5005.eqsin.wmnet with OS buster completed: - cp5005 (**WARN*... [11:55:42] (03CR) 10Jbond: [C: 03+1] product_analytics: remove unused profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/751704 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:55:48] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33134/console" [puppet] - 10https://gerrit.wikimedia.org/r/751719 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:55:53] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: sync on staging [11:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:21] (03CR) 10Jbond: [C: 03+1] r_lang::bioc: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751710 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:56:30] (03PS1) 10Marostegui: Revert "install_server: Allow reiage dbproxy2003" [puppet] - 10https://gerrit.wikimedia.org/r/751533 [11:56:40] !log rollout cfssl 1.6.1 [11:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:50] jayme: fyi ^^ [11:56:53] (03CR) 10jerkins-bot: [V: 04-1] services_proxy::envoy: add support for v3 configuration [puppet] - 10https://gerrit.wikimedia.org/r/751718 (owner: 10Giuseppe Lavagetto) [11:57:17] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reiage dbproxy2003" [puppet] - 10https://gerrit.wikimedia.org/r/751533 (owner: 10Marostegui) [11:57:19] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33135/console" [puppet] - 10https://gerrit.wikimedia.org/r/751717 (owner: 10Giuseppe Lavagetto) [11:57:22] (03CR) 10Jbond: [C: 03+1] r:analytics_test_cluster::{turnilo,webserver}: remove unused roles [puppet] - 10https://gerrit.wikimedia.org/r/751714 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:57:30] (03PS1) 10David Caro: role::graphite::base: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751720 (https://phabricator.wikimedia.org/T272559) [11:57:33] (03PS1) 104nn1l2: Add artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751721 (https://phabricator.wikimedia.org/T298449) [11:57:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2003.codfw.wmnet with OS bullseye [11:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:48] (03CR) 10Jbond: [C: 03+1] r:cloud_analytics: remove unused roles [puppet] - 10https://gerrit.wikimedia.org/r/751716 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [11:58:18] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [11:59:17] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Add sslcert::ocsp::hook [puppet] - 10https://gerrit.wikimedia.org/r/751719 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1200). Please do the needful. [12:00:04] aharoni and Juan_90264: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:32] here, if any patch authors will appear here [12:01:42] Hi [12:01:54] * urbanecm also waves [12:02:03] (03PS1) 10David Caro: r:kafka::simple::mirror: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751723 (https://phabricator.wikimedia.org/T272559) [12:02:07] I'm around, but i prefer if taavi does the deployment [12:02:43] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:02:54] (03CR) 10Majavah: [C: 03+2] Add artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751721 (https://phabricator.wikimedia.org/T298449) (owner: 104nn1l2) [12:03:37] (03Merged) 10jenkins-bot: Add artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751721 (https://phabricator.wikimedia.org/T298449) (owner: 104nn1l2) [12:04:13] nn1l2: your patch is on mwdebug1001, please test [12:04:22] ok [12:06:10] (03PS1) 10David Caro: role::logstash::elasticsearch: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751724 (https://phabricator.wikimedia.org/T272559) [12:06:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:38] taavi is the backport deployment here or on #wikimedia-releng ? [12:06:52] there's some problem [12:07:02] aharoni: backporting and other deployment stuff happens here [12:07:06] I can't upload https://www.artsobservasjoner.no/MediaLibrary/2020/12/47180ffa-9617-4787-95ca-acc067a27529_image.jpg [12:07:24] !log pool cp5005 running envoyproxy as TLS terminator - T271421 [12:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:27] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [12:07:34] nn1l2: that's www.artsobservasjoner.no, not just artsobservasjoner.no what the patch added [12:07:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:08:06] so the patch is wrong, and I should fix it [12:08:22] yes please [12:08:36] Can we postpone this to the end of D&B [12:08:46] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [12:09:02] taavi my patch https://gerrit.wikimedia.org/r/c/749890/ is in the list. Should be quite simple to test. [12:09:14] aharoni: one patch at a time please [12:09:23] no problem :) [12:09:49] nn1l2: I'll need to revert that patch if we don't sync it immediately (which is fine too) [12:09:53] (03PS1) 10David Caro: role::mariadb: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) [12:10:01] so let's revert that and you can make a fixed patch to deploy later? [12:10:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:10:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:29] So please revert, sorry! [12:10:30] (03CR) 10Jbond: [C: 03+1] role::graphite::base: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751720 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:10:36] sure, no worries [12:10:47] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:10:51] (03CR) 10Jbond: [C: 03+1] r:kafka::simple::mirror: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751723 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:11:03] (03PS1) 10Majavah: Revert "Add artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751534 [12:11:08] (03CR) 10Majavah: [C: 03+2] Revert "Add artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751534 (owner: 10Majavah) [12:11:11] (03CR) 10Jbond: [C: 03+1] role::logstash::elasticsearch: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751724 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:11:43] aharoni: your patch is up next! [12:11:54] (03Merged) 10jenkins-bot: Revert "Add artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751534 (owner: 10Majavah) [12:12:00] (03PS4) 10Majavah: Add akwiki as an import source for twwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) (owner: 10Amire80) [12:12:18] (03CR) 10Majavah: [C: 03+2] "diffConfig looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) (owner: 10Amire80) [12:12:31] (03CR) 10Jbond: [C: 03+1] role::mariadb: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751725 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:13:02] (03Merged) 10jenkins-bot: Add akwiki as an import source for twwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749890 (https://phabricator.wikimedia.org/T298296) (owner: 10Amire80) [12:13:17] (03PS1) 10David Caro: role::mariadb::proxy: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751726 (https://phabricator.wikimedia.org/T272559) [12:13:32] aharoni: your patch is on mwdebug1001, can you test please? [12:14:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:45] looking [12:14:54] tested, looks good [12:15:02] thanks, syncing then [12:16:08] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:16:18] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:749890|Add akwiki as an import source for twwiki (T298296)]] (duration: 01m 09s) [12:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:21] T298296: Add Akan (ak) Wikipedia as import source for the Twi (tw) Wikipedia - https://phabricator.wikimedia.org/T298296 [12:16:24] that should be live now [12:16:29] anyone have anything else to deploy? [12:17:51] urbanecm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/575390/ technically still has a -1, so I don't think we can't deploy it yet? [12:18:08] (03PS1) 10David Caro: role::memcached: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751727 (https://phabricator.wikimedia.org/T272559) [12:18:39] taavi: we can wait for James to comment, yes. [12:18:51] (03PS1) 10Marostegui: install_server: Allow dbproxy2* reimage [puppet] - 10https://gerrit.wikimedia.org/r/751728 (https://phabricator.wikimedia.org/T298586) [12:18:52] yeah, in that case [12:18:57] !log UTC morning deploys done [12:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:34] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:20:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:27] (03CR) 10Marostegui: [C: 03+2] install_server: Allow dbproxy2* reimage [puppet] - 10https://gerrit.wikimedia.org/r/751728 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [12:21:32] (03CR) 10Jelto: [V: 03+1 C: 03+2] deployment_server: remove obsolete value helmBinary [puppet] - 10https://gerrit.wikimedia.org/r/751067 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:21:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [12:23:43] hi, i got some "upstream connect error or disconnect/reset before headers …" never seen before, just for record [12:23:54] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:24:10] PROBLEM - Varnish HTTP text-frontend - port 80 on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [12:24:26] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [12:24:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:48] (03PS2) 10Jelto: deployment_server: remove obsolete value helmBinary [puppet] - 10https://gerrit.wikimedia.org/r/751067 (https://phabricator.wikimedia.org/T251305) [12:25:51] (03PS1) 10David Caro: role::openldap::labtest: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751730 (https://phabricator.wikimedia.org/T272559) [12:26:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [12:26:59] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33136/console" [puppet] - 10https://gerrit.wikimedia.org/r/751067 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [12:28:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:31:16] (03PS1) 10David Caro: role::prometheus::labs_project: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751731 (https://phabricator.wikimedia.org/T272559) [12:31:56] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:34:15] (03PS1) 10David Caro: service::deploy::scap: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751732 (https://phabricator.wikimedia.org/T272559) [12:34:59] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:36:31] (03PS1) 10David Caro: service::packages: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751734 (https://phabricator.wikimedia.org/T272559) [12:37:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:37:21] (03PS1) 10Marostegui: dbproxy200[1,2]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/751735 (https://phabricator.wikimedia.org/T298586) [12:38:09] (03CR) 10Marostegui: [C: 03+2] dbproxy200[1,2]: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/751735 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [12:38:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2002.codfw.wmnet with OS bullseye [12:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:50] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:42:31] (03PS1) 10David Caro: ssh: remove unused class [puppet] - 10https://gerrit.wikimedia.org/r/751736 (https://phabricator.wikimedia.org/T272559) [12:43:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:45:06] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:45:38] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:46:41] Is B&C still going on? [12:47:55] I can't push my fixed patch to https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/751721/ probably because it is merged. It says it's closed. How can I open it? [12:49:14] RECOVERY - Varnish HTTP text-frontend - port 80 on cp3064 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.163 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:49:15] you can't ammend to a merged patch. You need to upload a new one. [12:50:51] But It was wrong and reverted. Do I still need to upload a new patch? [12:52:18] (03PS1) 10David Caro: statsd: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751737 (https://phabricator.wikimedia.org/T272559) [12:52:57] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [12:53:05] (03CR) 10Jbond: [C: 03+1] role::memcached: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751727 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:53:11] yes. it got merged the way it was and then reverted. If you wan't the patch applied with the issue fixed, then yes, you need to upload it as a new patch. [12:53:18] (03CR) 10Jbond: [C: 03+1] role::openldap::labtest: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751730 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:53:43] (03CR) 10Jbond: [C: 03+1] role::prometheus::labs_project: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751731 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:54:00] (03CR) 10Jbond: [C: 03+1] service::deploy::scap: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751732 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:54:38] (03CR) 10Jbond: [C: 03+1] service::packages: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751734 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:55:11] Thanks zabe [12:55:41] (03CR) 10Jbond: [C: 03+1] ssh: remove unused class [puppet] - 10https://gerrit.wikimedia.org/r/751736 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [12:57:33] (03PS1) 10David Caro: stunnel: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751738 (https://phabricator.wikimedia.org/T272559) [12:58:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [13:01:45] (03PS1) 10David Caro: systemd::preset: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751739 (https://phabricator.wikimedia.org/T272559) [13:03:20] (03CR) 10David Caro: [C: 03+2] r:wmcs::openstack::codfw1dev::virt: delete unused role [puppet] - 10https://gerrit.wikimedia.org/r/737437 (owner: 10David Caro) [13:04:17] (03CR) 10David Caro: [C: 03+2] logstash::puppetreports: remove unused role/profile [puppet] - 10https://gerrit.wikimedia.org/r/751699 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [13:04:48] (03CR) 10David Caro: [C: 03+2] role::graphite::base: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751720 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [13:05:18] (03CR) 10David Caro: [C: 03+2] role::openldap::labtest: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751730 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [13:05:44] (03CR) 10David Caro: [C: 03+2] ssh: remove unused class [puppet] - 10https://gerrit.wikimedia.org/r/751736 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [13:06:16] (03CR) 10JMeybohm: [C: 03+1] deployment_server: remove obsolete value helmBinary [puppet] - 10https://gerrit.wikimedia.org/r/751067 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:06:56] (03CR) 10David Caro: [C: 03+2] role::logstash::elasticsearch: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751724 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [13:07:04] (03PS2) 10David Caro: role::logstash::elasticsearch: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751724 (https://phabricator.wikimedia.org/T272559) [13:10:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2002.codfw.wmnet with OS bullseye [13:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy2001.codfw.wmnet with OS bullseye [13:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:07] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [13:15:53] (03CR) 10Jelto: [V: 03+1 C: 03+2] deployment_server: remove obsolete value helmBinary [puppet] - 10https://gerrit.wikimedia.org/r/751067 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:22:43] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply on main [13:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:08] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: sync on main [13:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:26] (03PS1) 10Kosta Harlan: CentralAuth: Remove config that was deprecated in 1.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751740 [13:32:27] (03PS1) 10Urbanecm: Add more Zscaler ranges [extensions/TrustedXFF] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751535 (https://phabricator.wikimedia.org/T298241) [13:32:38] (03PS2) 10Urbanecm: Add more Zscaler ranges [extensions/TrustedXFF] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751535 (https://phabricator.wikimedia.org/T298241) [13:32:45] (03PS1) 10Urbanecm: Add more Zscaler ranges [extensions/TrustedXFF] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751536 (https://phabricator.wikimedia.org/T298241) [13:32:49] jouncebot: nowandnext [13:32:50] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [13:32:50] In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1400) [13:32:58] (03CR) 10Urbanecm: [C: 03+2] Add more Zscaler ranges [extensions/TrustedXFF] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751536 (https://phabricator.wikimedia.org/T298241) (owner: 10Urbanecm) [13:33:02] (03CR) 10Urbanecm: [C: 03+2] Add more Zscaler ranges [extensions/TrustedXFF] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751535 (https://phabricator.wikimedia.org/T298241) (owner: 10Urbanecm) [13:33:44] !log delete echo keys from objectchange in frwiki (T272512) [13:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:47] T272512: Apply outstanding schema changes for "objectcache" tables in production (exptime, flags, modtoken) - https://phabricator.wikimedia.org/T272512 [13:35:05] (03Merged) 10jenkins-bot: Add more Zscaler ranges [extensions/TrustedXFF] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751536 (https://phabricator.wikimedia.org/T298241) (owner: 10Urbanecm) [13:35:21] (03PS1) 10Majavah: Backport support for header authentication [debs/karma] - 10https://gerrit.wikimedia.org/r/751743 [13:35:39] (03Merged) 10jenkins-bot: Add more Zscaler ranges [extensions/TrustedXFF] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751535 (https://phabricator.wikimedia.org/T298241) (owner: 10Urbanecm) [13:37:07] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [13:37:45] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/TrustedXFF/: d35e36f4deb7a8e2a454769f4b2d72e45318fcc9: Add more Zscaler ranges (T298241) (duration: 01m 09s) [13:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:48] T298241: Add Zscaler to list of trusted hosts for XFF - https://phabricator.wikimedia.org/T298241 [13:38:55] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.16/extensions/TrustedXFF/: ce7113b99712ac7ce4112cff720c669f618df6eb: Add more Zscaler ranges (T298241) (duration: 01m 09s) [13:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:05] * urbanecm done [13:40:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:11] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:40:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy2001.codfw.wmnet with OS bullseye [13:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:22] (03PS1) 10Marostegui: Revert "install_server: Allow dbproxy2* reimage" [puppet] - 10https://gerrit.wikimedia.org/r/751537 [13:46:10] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow dbproxy2* reimage" [puppet] - 10https://gerrit.wikimedia.org/r/751537 (owner: 10Marostegui) [13:47:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2087:3316, db2087:3317 after reimage T295965', diff saved to https://phabricator.wikimedia.org/P18402 and previous config saved to /var/cache/conftool/dbconfig/20220105-134827-marostegui.json [13:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:31] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [13:48:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:48:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:10] (03CR) 10Filippo Giunchedi: [C: 03+1] role::prometheus::labs_project: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751731 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [13:53:35] (03PS1) 104nn1l2: Add www.artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751746 (https://phabricator.wikimedia.org/T298449) [14:00:04] twentyafterfour and hashar: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1400). [14:03:40] (03PS2) 10MSantos: tegola: place_label i18n fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/751490 (https://phabricator.wikimedia.org/T288728) [14:04:40] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:11:00] (03CR) 10Bking: [C: 03+2] Blazegraph: further relax free allocators check [alerts] - 10https://gerrit.wikimedia.org/r/751513 (https://phabricator.wikimedia.org/T298525) (owner: 10Bking) [14:12:58] (03Merged) 10jenkins-bot: Blazegraph: further relax free allocators check [alerts] - 10https://gerrit.wikimedia.org/r/751513 (https://phabricator.wikimedia.org/T298525) (owner: 10Bking) [14:21:15] (03CR) 10Jelto: [V: 03+1 C: 04-1] "depends on Ic64e507215aa1a7e154b44d9a26ad9ac791741fb" [puppet] - 10https://gerrit.wikimedia.org/r/751452 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:22:34] (03CR) 10Jelto: [V: 03+1 C: 04-1] P:prometheus::ops: add prometheus job and ferm rules for gitlab_runner metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751452 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:24:59] (03CR) 10Southparkfan: Add WMCS specific cloud role for syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [14:29:52] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: imposm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:14] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:49] (03CR) 10David Caro: [C: 03+2] role::prometheus::labs_project: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751731 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [14:39:31] (03CR) 10Vgutierrez: [C: 03+1] envoy: make the choice of api version explicit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751717 (owner: 10Giuseppe Lavagetto) [14:39:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [14:48:51] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Create stewards-elections@lists.wikimedia.org - https://phabricator.wikimedia.org/T298615 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup [14:48:52] (03CR) 104nn1l2: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751538 (https://phabricator.wikimedia.org/T298451) (owner: 104nn1l2) [14:50:58] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [14:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:14] jouncebot: next [14:51:14] In 2 hour(s) and 8 minute(s): Toolhub (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1700) [14:51:17] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] graphite: bump FETCH_TIMEOUT [puppet] - 10https://gerrit.wikimedia.org/r/751686 (https://phabricator.wikimedia.org/T298521) (owner: 10Filippo Giunchedi) [14:53:05] (03PS4) 104nn1l2: Add data.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751538 (https://phabricator.wikimedia.org/T298451) [14:54:34] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:27] (03CR) 10Zabe: graphite: whisper_cleanup: migrate cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [15:05:02] (03PS1) 10AOkoth: kubernetes: remove kubestage1001 & kubestage1002 [puppet] - 10https://gerrit.wikimedia.org/r/751752 (https://phabricator.wikimedia.org/T293729) [15:09:45] 10SRE, 10Toolhub, 10serviceops, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10akosiaris) >>! In T280881#7556653, @bd808 wrote: >>>! In T280881#7555760, @akosiaris wrote: >> @bd808, any news on those? > > It was not clear to... [15:13:36] (03CR) 10Jforrester: [wikitech] Drop the 'cloudadmin' user group, no longer used and empty (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/575390 (https://phabricator.wikimedia.org/T237890) (owner: 10Jforrester) [15:14:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] wmflib: add service::get_services_for function [puppet] - 10https://gerrit.wikimedia.org/r/746801 (owner: 10Giuseppe Lavagetto) [15:14:10] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC - https://phabricator.wikimedia.org/T298619 (10CDanis) [15:14:21] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC - https://phabricator.wikimedia.org/T298619 (10CDanis) p:05Triage→03High [15:33:50] (03CR) 10Jbond: [C: 03+1] "i think this came about from a refactor of rsync[1] which ultimately got abandoned. the only other attempt to use it also got abandoned[2" [puppet] - 10https://gerrit.wikimedia.org/r/751738 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:35:16] (03CR) 10Jbond: Add WMCS specific cloud role for syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [15:35:41] (03CR) 10Jbond: [C: 03+1] role::mariadb::proxy: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751726 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:37:25] (03CR) 10Jbond: [C: 03+1] nginx::ssl: remove orphan template [puppet] - 10https://gerrit.wikimedia.org/r/751389 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:43:44] 10SRE, 10Discovery-Search (Current work): Consider filesystem/disk based improvements on WQDS servers - https://phabricator.wikimedia.org/T298570 (10Gehel) [15:45:56] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC - https://phabricator.wikimedia.org/T298619 (10CDanis) Here's the PromQL query that statograph runs to scrape data: [[ https://gerrit.wikimedia.org/g/operatio... [15:50:28] (03CR) 10David Caro: systemd::preset: remove unused define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751739 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:51:17] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC - https://phabricator.wikimedia.org/T298619 (10CDanis) There's also a separate issue here, which is that statograph is getting stuck on the interval where dat... [15:51:39] (03CR) 10David Caro: [C: 03+2] statsd: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751737 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:52:09] (03CR) 10David Caro: [C: 03+2] nginx::ssl: remove orphan template [puppet] - 10https://gerrit.wikimedia.org/r/751389 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:53:22] (03CR) 10David Caro: stunnel: remove unused module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751738 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:53:25] (03CR) 10David Caro: [C: 03+2] stunnel: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751738 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:54:26] (03CR) 10Jbond: [C: 03+1] systemd::preset: remove unused define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751739 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:55:17] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [15:55:44] (03CR) 10David Caro: [C: 03+2] systemd::preset: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751739 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [15:56:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [15:56:58] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 (10dcaro) [16:02:08] (03CR) 10Jbond: [C: 03+2] realm.pp: this block still causes issues with pcc and puppet lookup [puppet] - 10https://gerrit.wikimedia.org/r/749746 (owner: 10Jbond) [16:02:55] (03PS5) 10Clare Ming: Deploy sticky header to pilot wikis, launch A/B test. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) [16:03:43] (03PS1) 10AntiCompositeNumber: Add it namespace aliases in scn [extensions/Scribunto] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751540 (https://phabricator.wikimedia.org/T297844) [16:06:56] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:07:02] (03PS1) 10JMeybohm: admin_ng: Make the cfssl-issuer return bundles [deployment-charts] - 10https://gerrit.wikimedia.org/r/751760 (https://phabricator.wikimedia.org/T294560) [16:09:50] (03PS1) 10Vgutierrez: cache::envoy: Increase upstream response timeout [puppet] - 10https://gerrit.wikimedia.org/r/751762 (https://phabricator.wikimedia.org/T271421) [16:10:45] (03CR) 10Vgutierrez: [C: 03+2] cache::envoy: Increase upstream response timeout [puppet] - 10https://gerrit.wikimedia.org/r/751762 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:12:08] (03PS6) 10Southparkfan: Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) [16:12:55] (03CR) 10Andrew Bogott: [C: 03+1] profile::ceph::common: remove unused profile [puppet] - 10https://gerrit.wikimedia.org/r/751403 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:12:57] (03CR) 10Southparkfan: Add WMCS specific cloud role for syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [16:13:03] (03CR) 10jerkins-bot: [V: 04-1] Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [16:13:08] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC - https://phabricator.wikimedia.org/T298619 (10colewhite) Per the linked upstream issue, Logstash uses Joda which uses [[ https://www.joda.org/joda-time/apido... [16:13:28] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: add drmrs to striker's trusted proxies [puppet] - 10https://gerrit.wikimedia.org/r/749854 (owner: 10Majavah) [16:15:35] (03CR) 10Andrew Bogott: "@ladsgroup, can you comment about whether you still want this? As far as I can tell no one objects but we're also not sure you want what y" [labs/private] - 10https://gerrit.wikimedia.org/r/748699 (owner: 10Ladsgroup) [16:17:15] (03CR) 10Filippo Giunchedi: [C: 04-1] graphite: whisper_cleanup: migrate cron to systemd timer job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:18:32] (03PS7) 10Southparkfan: Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) [16:19:13] (03CR) 10Andrew Bogott: [C: 03+1] labs_lvm:swap: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751103 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [16:20:57] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Make the cfssl-issuer return bundles [deployment-charts] - 10https://gerrit.wikimedia.org/r/751760 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [16:23:30] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [16:23:45] !log andrew@deploy1002 Started deploy [horizon/deploy@5e57e78]: sudo panel update [16:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:14] (03CR) 10Andrew Bogott: Add ownership annotations for WMCS services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/732307 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [16:25:23] (03Merged) 10jenkins-bot: admin_ng: Make the cfssl-issuer return bundles [deployment-charts] - 10https://gerrit.wikimedia.org/r/751760 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [16:26:36] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:51] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:38] !log andrew@deploy1002 Finished deploy [horizon/deploy@5e57e78]: sudo panel update (duration: 03m 53s) [16:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:19] (03PS1) 10Cwhite: logstash: update weekly indexes to use weekyear pattern syntax [puppet] - 10https://gerrit.wikimedia.org/r/751765 (https://phabricator.wikimedia.org/T298619) [16:35:21] (03PS1) 10Cwhite: prometheus: update affected es-exporter configs to use weekyear [puppet] - 10https://gerrit.wikimedia.org/r/751766 (https://phabricator.wikimedia.org/T298619) [16:36:07] !log andrew@deploy1002 Started deploy [horizon/deploy@5e57e78]: sudo panel update (codfw1dev) [16:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:15] !log andrew@deploy1002 Finished deploy [horizon/deploy@5e57e78]: sudo panel update (codfw1dev) (duration: 02m 08s) [16:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:58] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [16:40:08] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:41] (03Merged) 10jenkins-bot: admin_ng: Create Certificates for ingressgateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/737975 (https://phabricator.wikimedia.org/T295385) (owner: 10JMeybohm) [16:48:38] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:32] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:56] (03CR) 10Jgiannelos: [C: 03+2] tegola: place_label i18n fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/751490 (https://phabricator.wikimedia.org/T288728) (owner: 10MSantos) [16:51:58] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:22] (03Merged) 10jenkins-bot: tegola: place_label i18n fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/751490 (https://phabricator.wikimedia.org/T288728) (owner: 10MSantos) [16:56:06] jouncebot now [16:56:06] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [16:56:34] Hey all - I'd like to get a sec patch for T298581 deployed to wmf.13 and wmf.16 right now unless there are any objections. [16:56:56] ok w/ me. [16:59:14] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:05] bd808: Dear deployers, time to do the Toolhub deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1700). [17:04:23] !log sbassett@deploy1002 Synchronized php-1.38.0-wmf.13/extensions/MobileFrontend/includes/specials/SpecialMobileContributions.php: Deploy security fix for T298581 (duration: 01m 08s) [17:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:38] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:51] !log Deployed security fix for T298581 to wmf.16 [17:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:55] + [17:15:32] (03PS2) 10Dzahn: add parsoid-rt-tests.wikimedia.org to alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/749574 (https://phabricator.wikimedia.org/T266509) [17:15:53] (03PS1) 10Clare Ming: Don't use ts-ignore. It is hiding real errors [skins/Vector] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751543 (https://phabricator.wikimedia.org/T297119) [17:16:22] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:03] (03CR) 10Nray: [C: 03+1] Don't use ts-ignore. It is hiding real errors [skins/Vector] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751543 (https://phabricator.wikimedia.org/T297119) (owner: 10Clare Ming) [17:17:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:18] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2021-12-23-121200-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/749220 (https://phabricator.wikimedia.org/T271490) (owner: 10BryanDavis) [17:19:53] !log andrew@deploy1002 Started deploy [horizon/deploy@15efe04]: sudo panel update (codfw1dev) [17:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:50] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:14] (03CR) 10Dzahn: "I would suggest to ask analytics before deleting this." [puppet] - 10https://gerrit.wikimedia.org/r/751737 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:21:47] !log andrew@deploy1002 Finished deploy [horizon/deploy@15efe04]: sudo panel update (codfw1dev) (duration: 01m 54s) [17:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:55] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Cmjohnson) @ayounsi connected mgmt port em0 to itself ge-0/0/0 [17:21:58] !log andrew@deploy1002 Started deploy [horizon/deploy@15efe04]: sudo panel update [17:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:03] (03CR) 10Dzahn: "I'd say let joe confirm this since he created the class" [puppet] - 10https://gerrit.wikimedia.org/r/751709 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:23:21] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2021-12-23-121200-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/749220 (https://phabricator.wikimedia.org/T271490) (owner: 10BryanDavis) [17:25:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Cmjohnson) @Andrew these serves have 12 2TB disks, they failed during the raid setup, the first 2 disks are raid 1 and the remainder is ra... [17:25:58] !log andrew@deploy1002 Finished deploy [horizon/deploy@15efe04]: sudo panel update (duration: 04m 00s) [17:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:11] (03CR) 10Ahmon Dancy: [C: 03+1] Define git::daemon class and use it in profile::mediawiki::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/751481 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [17:27:54] (03CR) 10David Caro: "Adding someone from analytics" [puppet] - 10https://gerrit.wikimedia.org/r/751737 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:28:44] (03CR) 10Dzahn: [C: 03+2] "self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/749574 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [17:31:54] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:57] (03CR) 10Herron: [C: 03+1] "LGTM pending hiera defaults outlined in jbond's comment" [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [17:34:11] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1004 - https://phabricator.wikimedia.org/T298582 (10wiki_willy) a:03Cmjohnson [17:35:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10wiki_willy) a:03Papaul [17:36:22] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:39] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10wiki_willy) a:03RobH [17:36:39] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply on main [17:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:08] 10SRE, 10ops-ulsfo: Update PDUs name-server config - https://phabricator.wikimedia.org/T295668 (10wiki_willy) a:03RobH [17:39:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Andrew) @Cmjohnson is there not a hw raid controller? [17:42:13] !log btullis@deploy1002 Started deploy [analytics/superset/deploy@09094de]: Deployment for something important [17:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:49] !log btullis@deploy1002 Started deploy [analytics/superset/deploy@09094de]: Deployment of Superset 1.3.2 to staging [17:42:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:00] !log btullis@deploy1002 Finished deploy [analytics/superset/deploy@09094de]: Deployment of Superset 1.3.2 to staging (duration: 03m 11s) [17:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:09] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: sync on main [17:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:44] !log btullis@deploy1002 Started deploy [analytics/superset/deploy@09094de]: Deployment of Superset 1.3.2 to staging [17:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:24] !log btullis@deploy1002 Finished deploy [analytics/superset/deploy@09094de]: Deployment of Superset 1.3.2 to staging (duration: 00m 39s) [17:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:03] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply on main [17:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:23] !log andrew@deploy1002 Started deploy [horizon/deploy@b300fa6]: minor code format update [17:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:56] (03CR) 10Andrew Bogott: [C: 03+1] lshell: remove unused module [puppet] - 10https://gerrit.wikimedia.org/r/751130 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:53:15] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: sync on main [17:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:23] (03CR) 10Andrew Bogott: [C: 03+1] sonofagridengine: cleanup unused classes [puppet] - 10https://gerrit.wikimedia.org/r/751456 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:53:42] (03CR) 10Andrew Bogott: [C: 03+1] r:wmcs:paws:k8s:etcd: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751463 (https://phabricator.wikimedia.org/T188912) (owner: 10David Caro) [17:54:04] (03CR) 10Andrew Bogott: [C: 03+1] p:wmcs::nfs::misc/misc_backup/backup_keys: remove unused profiles [puppet] - 10https://gerrit.wikimedia.org/r/751460 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [17:55:33] !log andrew@deploy1002 Finished deploy [horizon/deploy@b300fa6]: minor code format update (duration: 04m 09s) [17:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:44] (03CR) 10Andrew Bogott: [C: 03+1] r:wmcs:openstack:eqiad1:cumin_controller: remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/751461 (https://phabricator.wikimedia.org/T234462) (owner: 10David Caro) [17:56:55] (03PS1) 10Bartosz Dziewoński: Disable querying the 'wikieditor' change tag temporarily [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751544 (https://phabricator.wikimedia.org/T298225) [17:57:12] (03Abandoned) 10Esanders: Enable reply tool by default on wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748727 (https://phabricator.wikimedia.org/T297535) (owner: 10Esanders) [17:57:15] (03Abandoned) 10Esanders: Enable reply tool by default on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/748780 (https://phabricator.wikimedia.org/T297534) (owner: 10Esanders) [17:57:30] !log btullis@deploy1002 Started deploy [analytics/superset/deploy@09094de]: Deployment of Superset 1.3.2 to production [17:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:44] 10SRE, 10Parsoid-Tests, 10Traffic, 10serviceops, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) @ssastry @Arlolra I self-merged that traffic change and https://parsoid-rt-tests.wikimedia.org/static/style.css is NOT 404 anymor... [17:57:59] !log btullis@deploy1002 Finished deploy [analytics/superset/deploy@09094de]: Deployment of Superset 1.3.2 to production (duration: 00m 29s) [17:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:52] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:24] 10SRE, 10ops-eqiad: Degraded RAID on dumpsdata1004 - https://phabricator.wikimedia.org/T298582 (10ArielGlenn) I guess the service owner means me, so any time in the next 5 days is fine; if you want to do it later than that, just give a heads up. [18:02:02] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply on main [18:02:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:22] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply on main [18:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:34] 10SRE, 10ops-ulsfo: Update PDUs name-server config - https://phabricator.wikimedia.org/T295668 (10RobH) a:05RobH→03ayounsi So in the past I never set DNS nameservers on the PDU network settings. To that point, 22 has the old nameserver on it, but pdu 23 in ulsfo has the dns server entries blank. Do we ne... [18:03:19] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync on main [18:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:32] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply on main [18:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:14] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync on main [18:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:36] 10SRE, 10ops-ulsfo: Update PDUs name-server config - https://phabricator.wikimedia.org/T295668 (10RobH) [18:05:37] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) [18:06:32] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply on main [18:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:40] 10SRE, 10Parsoid-Tests, 10Traffic, 10serviceops, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) 05Open→03Resolved Perfect! It works now. Thanks! [18:06:59] 10SRE, 10Parsoid-Tests, 10Traffic, 10serviceops, and 2 others: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) Great! thanks for confirming. and sorry for the delay in between [18:07:04] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync on main [18:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:32] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:02] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply on main [18:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:13] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33137/phab1001.eqiad.wmnet/change.phab1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:18:38] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: sync on main [18:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:40] !log Toolhub: ran `poetry run ./manage.py migrate` against m5-master [18:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:33] (03PS4) 10Dzahn: phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) [18:24:42] (03PS5) 10Dzahn: phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) [18:27:35] (03CR) 10Dzahn: "just adding you kind of FYI because it is so similar to git-ssh on gitlab discussion in some ways. so this is for the history that came be" [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:29:44] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:41] (03CR) 10Dzahn: "also happy if we follow-up with another change that drops the array part, soft shutdown would be to first reduce the number of IPs it list" [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:32:31] (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/33138/phab1001.eqiad.wmnet/change.phab1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:34:07] (03PS6) 10Clare Ming: Deploy sticky header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) [18:34:11] (03CR) 10Dzahn: [V: 04-1 C: 04-1] "well.. that doesn't work . failed to parse template" [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:36:48] (03CR) 10Nray: [C: 03+1] Deploy sticky header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [18:37:55] (03PS1) 10Razzi: clouddb: depool clouddb1013 [puppet] - 10https://gerrit.wikimedia.org/r/751779 (https://phabricator.wikimedia.org/T298505) [18:37:57] (03CR) 10Jforrester: [C: 03+1] "Neat." [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [18:37:59] (03PS3) 10Jforrester: Be strict on undefined variables such as seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [18:38:26] (03CR) 10jerkins-bot: [V: 04-1] Be strict on undefined variables such as seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [18:38:36] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:39:53] (03PS7) 10Clare Ming: Deploy sticky header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) [18:40:20] (03CR) 10Andrew Bogott: [C: 04-1] "I'm not 100% clear on this but I think what you want is to uncomment the clouddb1017 lines in dbproxy1019.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/751779 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [18:40:27] (03CR) 10Nray: [C: 03+1] Deploy sticky header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [18:40:35] (03CR) 10jerkins-bot: [V: 04-1] Be strict on undefined variables such as seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [18:40:47] (03CR) 10Andrew Bogott: [C: 03+2] P::quarry: add prometheus multiprocess support [puppet] - 10https://gerrit.wikimedia.org/r/750449 (owner: 10Majavah) [18:40:49] (03PS6) 10Dzahn: phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) [18:41:24] (03CR) 10jerkins-bot: [V: 04-1] phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:42:59] (03CR) 10Paladox: phabricator: move vcs firewall rules to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:43:40] (03PS7) 10Dzahn: phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) [18:43:51] (03CR) 10Dzahn: phabricator: move vcs firewall rules to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:44:17] (03CR) 10jerkins-bot: [V: 04-1] phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:44:25] (03CR) 10Dzahn: phabricator: move vcs firewall rules to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [18:45:15] (03PS8) 10Dzahn: phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) [18:46:04] (03CR) 10Herron: [V: 03+2 C: 03+2] add initial logstash latency panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/737413 (owner: 10Herron) [18:46:59] (03PS9) 10Dzahn: phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) [18:51:30] (03PS10) 10Dzahn: phabricator: move vcs firewall rules to profile [puppet] - 10https://gerrit.wikimedia.org/r/751510 (https://phabricator.wikimedia.org/T114209) [18:54:44] (03PS1) 10Herron: show grizzly in dashboard name [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/751782 [18:56:40] (03CR) 10Herron: [V: 03+2 C: 03+2] show grizzly in dashboard name [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/751782 (owner: 10Herron) [19:00:04] RoanKattouw and Urbanecm: May I have your attention please! UTC evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1900) [19:00:04] nn1l2, cjming, and AntiComposite: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:04] twentyafterfour and hashar: That opportune time is upon us again. Time for a Train log triage with CPT deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1900). [19:00:13] o/ [19:00:13] i can deploy today [19:00:15] hi [19:00:20] o/ [19:00:26] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:39] (03CR) 10Urbanecm: [C: 03+2] Don't use ts-ignore. It is hiding real errors [skins/Vector] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751543 (https://phabricator.wikimedia.org/T297119) (owner: 10Clare Ming) [19:00:44] hey [19:00:49] and hey taavi [19:01:15] hello AntiComposite, may I ask why are we backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/751540? [19:01:33] (note a change of i18n requires full scap, which takes some time) [19:01:37] the patch that changed the namespace name was merged before the branch cut [19:01:39] (03PS1) 10Cwhite: beta-logs: use provided openjdk java 11 [puppet] - 10https://gerrit.wikimedia.org/r/751784 [19:01:41] (03PS1) 10Cwhite: opensearch: clean up unused hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/751785 [19:01:44] (03CR) 10Urbanecm: [C: 03+2] Add www.artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751746 (https://phabricator.wikimedia.org/T298449) (owner: 104nn1l2) [19:01:55] (03CR) 10Dzahn: "I hesitate about this one. https://openstack-browser.toolforge.org/puppetclass/profile::maps::osm_master shows we still have an OSM master" [puppet] - 10https://gerrit.wikimedia.org/r/751703 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [19:01:58] the alias patch was not merged until after [19:02:43] (03Merged) 10jenkins-bot: Add www.artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751746 (https://phabricator.wikimedia.org/T298449) (owner: 104nn1l2) [19:02:43] AntiComposite: so the issue is onwiki links being broken? [19:02:46] yes [19:03:06] got it [19:03:13] (03CR) 10Urbanecm: [C: 03+2] Add it namespace aliases in scn [extensions/Scribunto] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751540 (https://phabricator.wikimedia.org/T297844) (owner: 10AntiCompositeNumber) [19:03:14] let's do it then [19:03:30] nn1l2: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/751746 is now at mwdebug1001 [19:03:31] can you test? [19:03:58] give a sec [19:04:22] (03PS1) 10Herron: update one more reference to template draft [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/751787 [19:04:25] (03PS2) 10Cwhite: opensearch: clean up disused hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/751785 [19:04:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:47] (03CR) 10Dzahn: "Yea, I have been wondering this myself, as the one who created and later shut down the machines. But the message I got from the security t" [puppet] - 10https://gerrit.wikimedia.org/r/751165 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [19:06:01] nn1l2: and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/751538 conflicts now -- please fix the conflict :)) [19:06:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:18] 10SRE: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) [19:06:27] LGTM https://commons.wikimedia.org/wiki/File:Japansk_sj%C3%B8pungDidemnum_vexillum.jpg [19:06:31] great, syncing [19:06:40] 10SRE, 10Fundraising-Backlog: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10jgleeson) [19:06:46] hi cjming, just in case, does the config patch depend on the backport in any way? [19:07:07] OK, let's postpone that to tomorrow [19:07:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:21] sure [19:07:26] not strictly but we need the backport to prevent blowing a gazillion errors [19:07:36] urbanecm: ^^ [19:07:44] okay, so we want "backport first then the config patch" cjming [19:07:47] sounds good to me [19:08:04] nn1l2: but do feel free to do the rebase now, we've time :)) [19:08:16] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: aff4ac32f37d21ac0b70c62adc54756eb1e2d2b0: Add www.artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Commons (T298449) (duration: 01m 08s) [19:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:20] T298449: Add https://www.artsobservasjoner.no to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T298449 [19:08:24] nn1l2: and first patch live [19:08:33] thanks [19:08:46] np [19:09:02] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:04] (03CR) 10Dzahn: Make fix-staging-perms also fix /srv/patches permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747187 (owner: 10Urbanecm) [19:11:18] (03CR) 1020after4: [C: 03+2] "Unblock the train." [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751544 (https://phabricator.wikimedia.org/T298225) (owner: 10Bartosz Dziewoński) [19:11:27] (03PS2) 10Razzi: clouddb: depool clouddb1014 [puppet] - 10https://gerrit.wikimedia.org/r/751779 (https://phabricator.wikimedia.org/T298505) [19:11:45] twentyafterfour: hey! It's a B&C window happening right now [19:12:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:59] (03CR) 10Urbanecm: Make fix-staging-perms also fix /srv/patches permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747187 (owner: 10Urbanecm) [19:15:38] 10SRE, 10Fundraising-Backlog: Fundraising-Tech engineers unable to ACK icinga alerts on fr-tech host groups - https://phabricator.wikimedia.org/T298649 (10Dzahn) @jgleeson I see in the screenshot you are logged in as "Jgleeson". Try (in a new browser session since there is no logout button) to login instead as... [19:17:46] twentyafterfour: can you please confirm you saw my message above? [19:19:51] (03Merged) 10jenkins-bot: Don't use ts-ignore. It is hiding real errors [skins/Vector] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/751543 (https://phabricator.wikimedia.org/T297119) (owner: 10Clare Ming) [19:19:57] (03Merged) 10jenkins-bot: Add it namespace aliases in scn [extensions/Scribunto] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751540 (https://phabricator.wikimedia.org/T297844) (owner: 10AntiCompositeNumber) [19:20:13] (03CR) 10Andrew Bogott: clouddb: depool clouddb1014 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751779 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [19:20:30] (03PS5) 104nn1l2: Add data.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751538 (https://phabricator.wikimedia.org/T298451) [19:20:40] (03CR) 10jerkins-bot: [V: 04-1] Add data.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751538 (https://phabricator.wikimedia.org/T298451) (owner: 104nn1l2) [19:21:34] (03CR) 10Urbanecm: [C: 04-2] "preventing conflict during an ongoing deployment, pinged at IRC, but no response" [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751544 (https://phabricator.wikimedia.org/T298225) (owner: 10Bartosz Dziewoński) [19:22:00] cjming: your backport is at mwdebug1001 -- can you have a look? [19:22:31] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10nskaggs) [19:22:39] (03PS3) 10Razzi: clouddb: depool clouddb1014 [puppet] - 10https://gerrit.wikimedia.org/r/751779 (https://phabricator.wikimedia.org/T298505) [19:22:42] urbanecm: yup - we're gtf [19:22:48] gtf? [19:22:48] *gtg [19:22:54] sorry [19:23:06] np [19:23:32] WTF? OMG! TMD TLA. ARG! [19:23:32] (03PS4) 10Razzi: clouddb: depool clouddb1014 [puppet] - 10https://gerrit.wikimedia.org/r/751779 (https://phabricator.wikimedia.org/T298505) [19:23:39] lol [19:23:45] AntiComposite: I will need a translator when talking to you :)) [19:23:54] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10nskaggs) [19:24:06] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/751779 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [19:24:16] urbanecm: not planning to deploy that until train window starts [19:24:22] urbanecm: I think they are complaining about too many three-letter acronyms [19:24:53] twentyafterfour: well, I'm git fetch'ing in that directory when doing backports :-). So if it merged, it'd be unevitably at the deployment host [19:25:18] https://en.wikipedia.org/wiki/Wikipedia:WTF%3F_OMG!_TMD_TLA._ARG! [19:25:33] urbanecm: I don't think it would be harmful if that happens [19:25:39] (03CR) 10Razzi: [C: 03+2] clouddb: depool clouddb1014 [puppet] - 10https://gerrit.wikimedia.org/r/751779 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [19:25:56] twentyafterfour: well, it'd mean I'll sync it w/o really wanting to [19:26:09] (I'll be starting a scap sync-world in the next few minutes) [19:26:30] 10SRE, 10Data-Services, 10cloud-services-team (Kanban): labstore1006 spontaneous reboot - https://phabricator.wikimedia.org/T217473 (10nskaggs) [19:26:47] 10SRE, 10DC-Ops, 10cloud-services-team (Hardware): labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286 (10nskaggs) 05Stalled→03Resolved [19:27:25] urbanecm: since it's already merged in wmf.13 and wmf.16 is currently only on test wikis then it shouldn't be consequential [19:27:28] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.13/skins/Vector/resources/skins.vector.es6/stickyHeader.js: f6424f32611bce8d9e95c369c28e2f787e2cdf75: Dont use ts-ignore. It is hiding real errors (T297119) (duration: 01m 08s) [19:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:35] T297119: Uncaught TypeError: observer.unobserve is not a function - https://phabricator.wikimedia.org/T297119 [19:28:03] (03CR) 10Herron: [V: 03+2 C: 03+2] update one more reference to template draft [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/751787 (owner: 10Herron) [19:28:21] twentyafterfour: i still prefer not deploying something that appears at the host without me doing anything, if that makes sense :) [19:28:50] sure no problem [19:29:29] AntiComposite: syncing yours. I have no way to trigger i18n update at a single server, so I'm just starting it [19:29:36] actually.. [19:29:44] let me do cjming's config first [19:29:48] (03PS8) 10Urbanecm: Deploy sticky header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [19:29:51] .16 isn't on any scn wiki yet, so I can't test it anyway [19:29:51] (03CR) 10Urbanecm: [C: 03+2] Deploy sticky header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [19:30:00] right [19:30:35] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Hardware): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10nskaggs) [19:30:38] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:57] (03Merged) 10jenkins-bot: Deploy sticky header [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747981 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [19:31:02] !log reload haproxy on dbproxy1018 for T298505 [19:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:05] T298505: Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 [19:31:12] twentyafterfour: fyi removed the -2, as both backports i have are at the deployment host [19:31:42] cjming: your config is at mwdebug1001, can you test please? [19:31:48] yessir [19:32:03] woohoo - looks great [19:32:09] that's great -- syncing [19:33:50] (03Merged) 10jenkins-bot: Disable querying the 'wikieditor' change tag temporarily [core] (wmf/1.38.0-wmf.16) - 10https://gerrit.wikimedia.org/r/751544 (https://phabricator.wikimedia.org/T298225) (owner: 10Bartosz Dziewoński) [19:34:24] (03PS1) 10Herron: remove draft tag from grizzly dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/751791 [19:34:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f2da5befc75b4f93ca4a11393a533b7dc97316ef: Deploy sticky header (T295976) (duration: 01m 42s) [19:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:34] T295976: Deploy sticky header to pilot wikis and launch A/B test - https://phabricator.wikimedia.org/T295976 [19:34:41] cjming: and live :) [19:34:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:52] urbanecm: thanks so much! [19:34:55] np [19:35:25] !log urbanecm@deploy1002 Started scap: 485e72bada5243755daab981f5a9ecd35e5b134e: Add it namespace aliases in scn (T297844) [19:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:28] T297844: Namespace "module" name in Sicilian - https://phabricator.wikimedia.org/T297844 [19:35:54] AntiComposite: syncing now [19:35:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:14] thanks urbanecm [19:36:30] twentyafterfour: not syncing yours though. Will ping you once I'm done :) [19:37:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:32] (03CR) 1020after4: [C: 03+1] service::deploy::scap: remove unused define [puppet] - 10https://gerrit.wikimedia.org/r/751732 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [19:38:36] (03CR) 10SBassett: Make fix-staging-perms also fix /srv/patches permissions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747187 (owner: 10Urbanecm) [19:38:41] (03CR) 10SBassett: [C: 03+1] Make fix-staging-perms also fix /srv/patches permissions [puppet] - 10https://gerrit.wikimedia.org/r/747187 (owner: 10Urbanecm) [19:38:53] (03PS6) 104nn1l2: Add data.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751538 (https://phabricator.wikimedia.org/T298451) [19:39:02] (03CR) 10jerkins-bot: [V: 04-1] Add data.nhm.ac.uk to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751538 (https://phabricator.wikimedia.org/T298451) (owner: 104nn1l2) [19:41:12] !log reload haproxy on dbproxy1019 (previously incorrectly reloaded dbproxy1018) for T298505 [19:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:16] T298505: Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 [19:41:30] (03PS6) 10Ebernhardson: sre.wdqs: Integrate wcqs with wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) [19:41:32] (03CR) 10Ebernhardson: sre.wdqs: Integrate wcqs with wdqs cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/745629 (https://phabricator.wikimedia.org/T293638) (owner: 10Ebernhardson) [19:41:42] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:42:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:05] !log urbanecm@deploy1002 Finished scap: 485e72bada5243755daab981f5a9ecd35e5b134e: Add it namespace aliases in scn (T297844) (duration: 11m 40s) [19:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:08] T297844: Namespace "module" name in Sicilian - https://phabricator.wikimedia.org/T297844 [19:47:10] AntiComposite: and, done [19:47:13] thanks [19:47:16] np [19:47:40] nn1l2: do you plan to finish the rebase? or do we reschedule? [19:47:56] let's reschedule [19:47:59] okay [19:48:02] then we're done [19:48:07] twentyafterfour: over to you [19:48:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:48:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:19] jouncebot: now [19:50:19] For the next 0 hour(s) and 9 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1900) [19:50:19] For the next 0 hour(s) and 9 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T1900) [19:52:54] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:54:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:18] (03PS1) 10Clare Ming: Disable sticky header, A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751793 (https://phabricator.wikimedia.org/T295976) [19:59:15] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1003/33142/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/751481 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [19:59:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] twentyafterfour and hashar: Your horoscope predicts another unfortunate MediaWiki train - Utc-7+Utc-0 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T2000). [20:04:38] (03PS1) 10JHathaway: hieradata: fix incorrect yaml [puppet] - 10https://gerrit.wikimedia.org/r/751794 [20:05:17] is deployment time over? [20:05:35] as in "can I touch the deployment server", heh [20:05:42] twentyafterfour: ^ [20:06:05] (03CR) 10Ori: [C: 03+1] "Dead code is a liability, so if this is unused it is good to delete it with extreme prejudice. Everything is in source control and can alw" [puppet] - 10https://gerrit.wikimedia.org/r/751737 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [20:06:34] (03CR) 10JHathaway: exim: add the ability to silently drop senders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [20:06:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:06:50] mutante: I was about to roll the train [20:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:12] mutante: I can hold off if you need to do something [20:07:41] twentyafterfour: nah nah,, you go ahead as normal [20:09:07] dancy: adding that git-daemon but not during deploy, will do it tomorrow or so, disabling puppet first and applying only codfw etc [20:09:24] Thx [20:09:47] Deployment shouldn't take too long. [20:09:56] (i.e, rolling the train forward, that is) [20:11:04] !log twentyafterfour@deploy1002 Synchronized php-1.38.0-wmf.16/includes/changetags/ChangeTags.php: unblock the train, refs T293957 (duration: 01m 09s) [20:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:07] T293957: 1.38.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T293957 [20:11:11] (03CR) 10Herron: [V: 03+2 C: 03+2] remove draft tag from grizzly dashboards [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/751791 (owner: 10Herron) [20:12:02] (03PS1) 1020after4: group0 wikis to 1.38.0-wmf.16 refs T293957 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751795 [20:12:04] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.38.0-wmf.16 refs T293957 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751795 (owner: 1020after4) [20:12:43] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.16 refs T293957 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751795 (owner: 1020after4) [20:12:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:00] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:14:23] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.16 refs T293957 [20:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:25] I'm going to let it sit on group0 for a bit and then roll out to group1 [20:16:45] (03CR) 10Jdlrobson: [C: 04-1] "I've fixed this on wiki: https://fr.wikipedia.org/w/index.php?title=MediaWiki%3AVector.css&type=revision&diff=189594256&oldid=184622499" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751793 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [20:16:54] (03Abandoned) 10Clare Ming: Disable sticky header, A/B test for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751793 (https://phabricator.wikimedia.org/T295976) (owner: 10Clare Ming) [20:18:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:19:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:07] ok everything seems to be in order. rolling out wmf.16 to group1 wikis [20:21:27] (03PS1) 1020after4: group1 wikis to 1.38.0-wmf.16 refs T293957 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751796 [20:21:29] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.38.0-wmf.16 refs T293957 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751796 (owner: 1020after4) [20:22:07] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.16 refs T293957 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751796 (owner: 1020after4) [20:23:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:53] !log twentyafterfour@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.16 refs T293957 [20:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:56] T293957: 1.38.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T293957 [20:25:01] !log twentyafterfour@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.16 refs T293957 (duration: 01m 07s) [20:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:48] 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10Antoine_Quhen) [20:28:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:18] (03PS1) 10Razzi: Revert "clouddb: depool clouddb1014" [puppet] - 10https://gerrit.wikimedia.org/r/751798 (https://phabricator.wikimedia.org/T298505) [20:30:29] ok everything still looking good on group 1 [20:31:21] mutante: I'm done deploying for today [20:31:23] 10SRE-Access-Requests: Requesting access to the data engineering team resources for Antoine Qu'hen - https://phabricator.wikimedia.org/T298657 (10Antoine_Quhen) [20:31:26] (03CR) 10Razzi: "I'm adding you as a reviewer for posterity but since there's going to be 3 more patches like this and they're pretty simple and low risk (" [puppet] - 10https://gerrit.wikimedia.org/r/751798 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [20:32:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:08] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:33] (03CR) 10Razzi: [C: 03+2] Revert "clouddb: depool clouddb1014" [puppet] - 10https://gerrit.wikimedia.org/r/751798 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [20:38:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:35] !log reload haproxy on dbproxy1019 to repool clouddb1014 for T298505 [20:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:39] T298505: Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 [20:42:34] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:44:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:47] (03PS1) 10Razzi: Re-depool dbproxy1019 to update localuser table which I forgot [puppet] - 10https://gerrit.wikimedia.org/r/751800 (https://phabricator.wikimedia.org/T298505) [20:50:28] (03CR) 10jerkins-bot: [V: 04-1] Re-depool dbproxy1019 to update localuser table which I forgot [puppet] - 10https://gerrit.wikimedia.org/r/751800 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [20:50:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:00] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:56:10] (03CR) 10Dzahn: [V: 03+1 C: 03+2] Define git::daemon class and use it in profile::mediawiki::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/751481 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [20:59:27] !Log deploy1002 - puppet disabled, deploy2002 deploying change to add git daemon for T298165 , firewall issue with ferm [20:59:28] T298165: mediawiki-multiversion image builder should also poll private and security patches git repositories - https://phabricator.wikimedia.org/T298165 [20:59:51] (03PS1) 10Majavah: P:mediawiki::deployment: fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/751805 [20:59:54] mutante: ^ [21:00:04] twentyafterfour and hashar: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T2000). [21:00:04] chrisalbon and accraze: That opportune time is upon us again. Time for a Services – Graphoid / ORES deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T2100). [21:00:10] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/751805 (owner: 10Majavah) [21:00:34] (03CR) 10Dzahn: [C: 03+2] P:mediawiki::deployment: fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/751805 (owner: 10Majavah) [21:01:08] (03PS2) 10Dzahn: P:mediawiki::deployment: fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/751805 (https://phabricator.wikimedia.org/T298165) (owner: 10Majavah) [21:02:05] (03CR) 10Dzahn: "needed https://gerrit.wikimedia.org/r/c/operations/puppet/+/751805" [puppet] - 10https://gerrit.wikimedia.org/r/751481 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [21:02:27] (03CR) 10Dzahn: [V: 03+2 C: 03+2] P:mediawiki::deployment: fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/751805 (https://phabricator.wikimedia.org/T298165) (owner: 10Majavah) [21:03:10] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:03:14] twentyafterfour: thanks! ack [21:03:17] taavi: thanks! ack :) [21:03:21] that was quick. on it [21:03:26] pushing the fix through [21:03:38] and active server not affected [21:03:54] because my gut told me to only apply in codfw first [21:05:02] (03CR) 10Dzahn: "thanks for this follow-up to https://gerrit.wikimedia.org/r/c/operations/puppet/+/751481" [puppet] - 10https://gerrit.wikimedia.org/r/751805 (https://phabricator.wikimedia.org/T298165) (owner: 10Majavah) [21:05:17] (03CR) 10Dzahn: "puppet and ferm working on deploy2002 now" [puppet] - 10https://gerrit.wikimedia.org/r/751805 (https://phabricator.wikimedia.org/T298165) (owner: 10Majavah) [21:05:26] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:06:32] (03CR) 10Dzahn: "root@deploy2002:/# iptables -L | grep git" [puppet] - 10https://gerrit.wikimedia.org/r/751481 (https://phabricator.wikimedia.org/T298165) (owner: 10Ahmon Dancy) [21:09:16] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:12:50] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:13:06] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:15:22] dancy: DONE https://phabricator.wikimedia.org/T298165#7600041 [21:26:16] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:00] mutante: Thank you! I'll test it. [21:28:46] (03CR) 10Ahmon Dancy: "Thanks Majavah!" [puppet] - 10https://gerrit.wikimedia.org/r/751805 (https://phabricator.wikimedia.org/T298165) (owner: 10Majavah) [21:34:15] (03CR) 10Hashar: "That was used by Daniel Zahn when migrating Gerrit to a new server. It is really a one off usage so I guess we can indeed remove it ;)" [puppet] - 10https://gerrit.wikimedia.org/r/751696 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [21:37:34] Is there a datacenterless way to refer to deployment.{eqiad,codfw}.wmnet ? [21:38:20] (03CR) 10Dzahn: "given https://phabricator.wikimedia.org/T243027 I disagree that we will never migrate Gerrit again" [puppet] - 10https://gerrit.wikimedia.org/r/751696 (https://phabricator.wikimedia.org/T272559) (owner: 10David Caro) [21:39:10] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751545 (https://phabricator.wikimedia.org/T131924) (owner: 10MarcoAurelio) [21:40:00] dancy: At one point the eqiad alias would target codfw when that dc was primary, but I don't know if that dirty magic still exists or not. [21:40:12] PROBLEM - Host ping3002 is DOWN: PING CRITICAL - Packet loss = 100% [21:40:18] RECOVERY - Host ping3002 is UP: PING OK - Packet loss = 0%, RTA = 81.22 ms [21:40:52] meaning that `ssh deployment.eqiad.wmnet` would actually land you on the codfw server when that was approriate [21:41:06] gotcha. That is pretty filthy [21:42:36] and also very convenient, so I'm sticking with `deployment.eqiad.wmnet` for the time being. [21:43:26] jouncebot now [21:43:26] For the next 0 hour(s) and 16 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T2000) [21:43:26] For the next 0 hour(s) and 16 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220105T2100) [21:44:12] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:56] dancy: both service names point to deploy1002.eqiad.wmnet right now, so ... maybe that's still intended [21:45:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:13] oh perfect. Thanks for the info bd808! [21:47:24] dancy: but also.. neither has changed since https://gerrit.wikimedia.org/r/c/operations/dns/+/635113 10 months ago, so when codfw was the primary DC it looks they both still pointed to eqiad [21:49:08] deployment is special (tm) [21:49:16] for releases you have releases.discovery.wmnet [21:49:29] for mwmaint you have mwmaint.discovery.wmnet [21:49:34] but you don't have that for deploy [21:49:42] :-( [21:49:53] that is because those hosts are hosting websites behind ATS [21:50:00] and for that we created the discovery names [21:50:20] and what bd808 said about "hack" during dc-switchover [21:50:46] depends how people doing the switch-over agree to do it with deployers etc [21:50:52] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:51:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:51:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:26] we could add deploy.discovery though. it's just one more thing that needs to be updated in DNS [21:51:40] question is what is more confusion [21:51:44] bd808: we didn't switch over the deployment server last time [21:51:45] I guess [21:52:12] afaik deployment.eqiad is supposed to always point to the active deployment server, even in codfw [21:53:06] Alright. Thanks everyone. [21:53:14] basically for everything that also hosts websites behind caching layer.. you can use the discovery name you see being used in ./hieradata/common/profile/trafficserver/backend.yaml [21:55:26] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [22:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:36] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:07:00] PROBLEM - Host ping3002 is DOWN: PING CRITICAL - Packet loss = 100% [22:07:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [22:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:14] RECOVERY - Host ping3002 is UP: PING OK - Packet loss = 0%, RTA = 81.13 ms [22:10:26] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:10:44] 10SRE, 10Gerrit, 10serviceops: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) [22:14:54] 10SRE, 10Gerrit, 10serviceops: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn) @LSobanski @akosiaris This one is for consideration in a manager meeting with releng. [22:15:38] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:23] (03CR) 10Dzahn: [C: 03+2] Make fix-staging-perms also fix /srv/patches permissions [puppet] - 10https://gerrit.wikimedia.org/r/747187 (owner: 10Urbanecm) [22:22:06] thanks mutante [22:22:41] (03CR) 10Dzahn: [C: 03+1] gitlab_runner: use config template for registering new runners [puppet] - 10https://gerrit.wikimedia.org/r/747539 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [22:24:03] (03CR) 10Dzahn: [C: 03+1] P:prometheus::ops: add prometheus job and ferm rules for gitlab_runner metrics [puppet] - 10https://gerrit.wikimedia.org/r/751452 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [22:24:36] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:26:48] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:06] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:34:45] (03PS1) 10Ahmon Dancy: WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 [22:34:46] (03CR) 10jerkins-bot: [V: 04-1] WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [22:36:13] (03PS2) 10Ahmon Dancy: WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 [22:36:47] (03CR) 10jerkins-bot: [V: 04-1] WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [22:36:54] PROBLEM - Host ping3002 is DOWN: PING CRITICAL - Packet loss = 100% [22:37:30] RECOVERY - Host ping3002 is UP: PING OK - Packet loss = 0%, RTA = 81.16 ms [22:38:29] (03PS3) 10Ahmon Dancy: WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 [22:39:05] (03CR) 10jerkins-bot: [V: 04-1] WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [22:43:43] (03PS4) 10Ahmon Dancy: WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 [22:45:01] (03CR) 10Kaganer: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/751545 (https://phabricator.wikimedia.org/T131924) (owner: 10MarcoAurelio) [22:45:30] (03CR) 10jerkins-bot: [V: 04-1] WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [22:47:04] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:24] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:53:56] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:56:27] (03PS5) 10Ahmon Dancy: WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 [22:58:20] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:06:56] (03PS6) 10Ahmon Dancy: WIP: Refactor contint::zuul::git_daemon to use new git::daemon class [puppet] - 10https://gerrit.wikimedia.org/r/751816 [23:07:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:09:52] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:15:09] (03PS2) 10Razzi: Re-depool dbproxy1019 to update localuser table which I forgot [puppet] - 10https://gerrit.wikimedia.org/r/751800 (https://phabricator.wikimedia.org/T298505) [23:15:28] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:16:11] (03CR) 10Razzi: [C: 03+2] Re-depool dbproxy1019 to update localuser table which I forgot [puppet] - 10https://gerrit.wikimedia.org/r/751800 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [23:16:22] (03PS7) 10Ahmon Dancy: Refactor git-daemon use in profile::zuul::merger [puppet] - 10https://gerrit.wikimedia.org/r/751816 [23:16:32] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:17:02] PROBLEM - Host dns3002 is DOWN: PING CRITICAL - Packet loss = 100% [23:17:43] Urbanecm: You will be available on the "UTC late backport window"? [23:18:02] Likely no. [23:18:38] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:18:39] Someone to replace him? [23:18:46] RECOVERY - Host dns3002 is UP: PING OK - Packet loss = 0%, RTA = 81.10 ms [23:19:36] joucebot nowandnext [23:19:43] jouncebot nowandnext [23:19:43] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [23:19:43] In 0 hour(s) and 40 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T0000) [23:20:10] hmm. I will be out by then. [23:20:34] I will now put the change to deployment [23:21:18] RoanKattouw: Available? [23:21:30] PROBLEM - Host lvs3007 is DOWN: PING CRITICAL - Packet loss = 100% [23:21:44] RECOVERY - Host lvs3007 is UP: PING OK - Packet loss = 0%, RTA = 82.19 ms [23:21:48] (03CR) 10Ahmon Dancy: [C: 03+1] "pcc results: https://puppet-compiler.wmflabs.org/pcc-worker1001/33147/" [puppet] - 10https://gerrit.wikimedia.org/r/751816 (owner: 10Ahmon Dancy) [23:24:07] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10tstarling) I was not able to reproduce that error. ` [2240][tstarling@mwmaint1002:/home/oblivian]$ mwscript eval.php --wiki=enwiki --ignore-errors > $command = MediaWi... [23:26:09] !log run sudo maintain-views --databases centralauth --debug --replace-all on clouddb1014 for T298505 [23:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:13] T298505: Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 [23:29:50] Seriously no one will be available in the "UTC late backport window"? [23:29:52] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:12] Juan_90264: please be patient [23:30:35] Okay [23:30:56] Juan_90264: you haven't put a change on calendar [23:33:09] RhinosF1: Published edition [23:35:56] Reedy, twentyafterfour: any of you willing to? [23:39:18] PROBLEM - Host ping3002 is DOWN: PING CRITICAL - Packet loss = 100% [23:41:12] RECOVERY - Host ping3002 is UP: PING OK - Packet loss = 0%, RTA = 81.15 ms [23:42:06] (03PS1) 10Razzi: clouddb: repool clouddb1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/751823 (https://phabricator.wikimedia.org/T298505) [23:43:12] (03CR) 10Razzi: [C: 03+2] clouddb: repool clouddb1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/751823 (https://phabricator.wikimedia.org/T298505) (owner: 10Razzi) [23:46:24] (03PS1) 10Razzi: clouddb: depool clouddb1018 to update views [puppet] - 10https://gerrit.wikimedia.org/r/751824 (https://phabricator.wikimedia.org/T298505) [23:49:22] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:50:34] !log sudo systemctl reload haproxy on dbproxy1019 to repool clouddb1014 for T298505 [23:50:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:38] T298505: Recreate views for globaluser table - https://phabricator.wikimedia.org/T298505 [23:54:00] jouncebot: now [23:54:00] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [23:54:10] jouncebot: next [23:54:11] In 0 hour(s) and 5 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220106T0000) [23:57:24] (03CR) 10Cwhite: [C: 03+2] beta-logs: use provided openjdk java 11 [puppet] - 10https://gerrit.wikimedia.org/r/751784 (owner: 10Cwhite)