[00:00:05] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T0000). [00:03:04] (03CR) 10Dzahn: "rm /srv/org/wikimedia/reprepro/conf/distributions on releases2002 but not releases1002" [puppet] - 10https://gerrit.wikimedia.org/r/725670 (owner: 10Muehlenhoff) [00:04:36] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-var-lib-grafana.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:50] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:16] !log [grafana2001:~] $ sudo systemctl start rsync-var-lib-grafana because of "PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded" because of some race condition where a file vanished during sync [00:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:50] afk [01:15:05] (03PS25) 10Juan90264: Adding and use wordmark in trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) [01:16:07] (03CR) 10jerkins-bot: [V: 04-1] Adding and use wordmark in trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [01:16:37] 10SRE, 10observability: Develop tooling for quickly parsing 5xx and sampled-1000 logs - https://phabricator.wikimedia.org/T292682 (10Legoktm) [01:18:48] 704170: Adding and use wordmark in trwikiquote | https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/704170 [01:18:51] I don't understand why jenkinsbot is downvoting without having bugs in the code to be added. I've already done a clear code review and found no errors. Does anyone know what might be making the bot vote negative? [01:20:45] (03CR) 10Juan90264: Adding and use wordmark in trwikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [01:22:18] (03CR) 10Legoktm: "This is fantastic, thank you for working on it!" [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [01:23:55] Legoktm: Are you online here on IRC? [01:24:56] Juan_90264: yes [01:25:30] Legoktm: Great, could you help me with the problem I mentioned above? [01:25:42] if you look at the comment from jenkins-bot it says: [01:25:45] (03PS1) 10Tim Starling: Factor out $_SERVER['SERVERNAME'] references for performance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727002 [01:25:47] (03PS1) 10Tim Starling: Enable Parsoid API everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727003 [01:25:48] > https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-test-docker/13882/console : FAILURE in 21s [01:26:02] scrolling down https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-test-docker/13882/console you'll see [01:26:08] 18:15:38 1) StaticSettingsTest::testLogos [01:26:08] 18:15:38 trwikiquote has non-existent wordmark logo in wmgSiteLogoWordmark [01:26:08] 18:15:38 Failed asserting that file "/src/tests/multiversion/../../static/images/mobile/copyright/wikiquote-wordmark-tr.svg" exists. [01:27:21] (03PS2) 10Tim Starling: Enable Parsoid API everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727003 [01:27:46] Legoktm: WOW, error that went unnoticed in my review. Thank you and I will see how to fix this error [01:28:58] ok, I think I figured it out [01:29:24] somehow you added a directory named static [01:29:25] let me fix it [01:29:54] (03PS26) 10Legoktm: Adding and use wordmark in trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [01:31:23] I honestly had no idea you could do that [01:33:11] This one I would never be able to notice, again thank you [01:33:27] this is why we have jenkins :) [01:34:38] (03PS1) 10Jforrester: PageSlotDiffRendererTest::testGetDiff: Skip as new wikidiff2 breaks this test [extensions/ProofreadPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726959 (https://phabricator.wikimedia.org/T292676) [01:34:51] (03PS1) 10Jforrester: PageSlotDiffRendererTest::testGetDiff: Skip as new wikidiff2 breaks this test [extensions/ProofreadPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726960 (https://phabricator.wikimedia.org/T292676) [01:35:02] (03PS1) 10Jforrester: api-testing: Adjust DiffCompare expected outcome to cope with new wikidiff2 output [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726961 (https://phabricator.wikimedia.org/T292676) [01:35:19] (03PS1) 10Jforrester: api-testing: Adjust DiffCompare expected outcome to cope with new wikidiff2 output [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726962 (https://phabricator.wikimedia.org/T292676) [01:35:35] Jenkins is incredibly helpful, spotting mistakes that a human can fix [01:36:38] (03CR) 10Subramanya Sastry: [C: 03+1] Factor out $_SERVER['SERVERNAME'] references for performance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727002 (owner: 10Tim Starling) [01:37:22] (03CR) 10jerkins-bot: [V: 04-1] PageSlotDiffRendererTest::testGetDiff: Skip as new wikidiff2 breaks this test [extensions/ProofreadPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726960 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [01:38:21] (03CR) 10Subramanya Sastry: [C: 03+1] Enable Parsoid API everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727003 (owner: 10Tim Starling) [01:45:58] (03CR) 10jerkins-bot: [V: 04-1] api-testing: Adjust DiffCompare expected outcome to cope with new wikidiff2 output [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726962 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [02:23:50] (03CR) 10Tim Starling: [C: 03+2] Factor out $_SERVER['SERVERNAME'] references for performance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727002 (owner: 10Tim Starling) [02:23:58] (03CR) 10Tim Starling: [C: 03+2] Enable Parsoid API everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727003 (owner: 10Tim Starling) [02:24:40] (03Merged) 10jenkins-bot: Factor out $_SERVER['SERVERNAME'] references for performance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727002 (owner: 10Tim Starling) [02:24:44] (03Merged) 10jenkins-bot: Enable Parsoid API everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727003 (owner: 10Tim Starling) [02:27:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:23] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: enable Parsoid API everywhere (duration: 01m 04s) [02:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:03] (03PS1) 10Jforrester: Allow composer-plugin-api ^2.0 for migration [vendor] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726963 (https://phabricator.wikimedia.org/T266421) [02:36:22] (03PS2) 10Jforrester: PageSlotDiffRendererTest::testGetDiff: Skip as new wikidiff2 breaks this test [extensions/ProofreadPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726960 (https://phabricator.wikimedia.org/T292676) [02:36:35] (03CR) 10Jforrester: "recheck" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726962 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [02:51:56] (03CR) 10jerkins-bot: [V: 04-1] Allow composer-plugin-api ^2.0 for migration [vendor] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726963 (https://phabricator.wikimedia.org/T266421) (owner: 10Jforrester) [02:53:49] (03CR) 10jerkins-bot: [V: 04-1] PageSlotDiffRendererTest::testGetDiff: Skip as new wikidiff2 breaks this test [extensions/ProofreadPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726960 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [02:57:02] (03PS1) 10Jforrester: build: Suppress phan failure [extensions/ProofreadPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726964 [02:57:11] (03PS2) 10Jforrester: build: Suppress phan failure [extensions/ProofreadPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726964 [03:02:05] (03CR) 10Gergő Tisza: [C: 03+1] growthexperiments: Run updateMenteeData.php in parallel [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [03:02:31] (03CR) 10Gergő Tisza: [C: 03+1] growthexperiments: Remove absented systemd job [puppet] - 10https://gerrit.wikimedia.org/r/725286 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [03:44:04] (03CR) 10Juan90264: [C: 03+1] Change Javanese Wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708065 (https://phabricator.wikimedia.org/T287425) (owner: 10Labdajiwa) [03:46:43] This patch should have been implemented using the "Deployment" page, but I think the developer forgot or doesn't know the page [04:13:53] (03CR) 10Cwhite: o11y: port alertmanager alerts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/724761 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [04:14:24] (03CR) 10Cwhite: [C: 03+1] o11y: tune IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/726909 (owner: 10Filippo Giunchedi) [04:14:49] (03CR) 10Cwhite: [C: 03+1] icinga: remove alertmanager::alerts [puppet] - 10https://gerrit.wikimedia.org/r/724771 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [04:18:21] 10SRE, 10MediaWiki-General, 10Platform Engineering Code Jam, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar): Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10Ladsgroup)... [04:33:36] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cumin2001, cumin1001, cumin2002, ms-be1059, clouddb1020 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:38:38] 10SRE, 10MediaWiki-General, 10Platform Engineering Code Jam, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar): Allow easier ICU transitions in MediaWiki (change how sortkey collation is managed in the categorylinks table) - https://phabricator.wikimedia.org/T263437 (10Pchelolo)... [05:23:13] (IcingaOverload) firing: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [05:38:13] (IcingaOverload) resolved: Checks are taking long to execute - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [05:48:42] !log [Elastic] Performing rolling restarts of `relforge`. `relforge1003` is the master so I'll restart `relforge1004` first to minimize disruption [05:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:34] !log [Elastic] `ryankemper@relforge1004:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad-small-alpha.service && sudo systemctl restart elasticsearch_6@relforge-eqiad.service` [05:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] Enable NamespaceDefaultLabelName for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/726846 (https://phabricator.wikimedia.org/T290476) (owner: 10JMeybohm) [06:05:29] !log [Elastic] Cluster in green status, proceeding to next and final node => `ryankemper@relforge1003:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad-small-alpha.service && sudo systemctl restart elasticsearch_6@relforge-eqiad.service` [06:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:01] (03CR) 10Ladsgroup: "ping 😄" [puppet] - 10https://gerrit.wikimedia.org/r/725435 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [06:21:26] !log [Elastic] Restart of `relforge` complete [06:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:41] (03PS2) 10Giuseppe Lavagetto: Add rsyslog image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/725005 (https://phabricator.wikimedia.org/T288851) [06:39:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mailman3: Drop profile::mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/725435 (https://phabricator.wikimedia.org/T282303) (owner: 10Ladsgroup) [06:44:28] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: ms-be1059, cumin1001, cumin2002, clouddb1020, cumin2001 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:50:14] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 8 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) [06:50:24] https://phabricator.wikimedia.org/T292687 [06:50:52] T292687: Create project tag for me User-Juan90264 | https://phabricator.wikimedia.org/T292687 [06:50:52] T292687: Create project tag for me User-Juan90264 - https://phabricator.wikimedia.org/T292687 [06:51:33] 10SRE, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Wikimedia-Fundraising, and 8 others: DBPerformance warning "Query returned XXXX rows: query: SELECT * FROM `translate_metadata`" - https://phabricator.wikimedia.org/T204026 (10Nikerabbit) It looks like the number-of-rows-retur... [06:56:13] (03CR) 10Muehlenhoff: "Looks good, one comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/725779 (owner: 10Alexandros Kosiaris) [07:01:13] (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: tune IcingaOverload alert [alerts] - 10https://gerrit.wikimedia.org/r/726909 (owner: 10Filippo Giunchedi) [07:01:27] (03PS1) 10ZPapierski: [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) [07:03:55] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [07:07:14] (03PS2) 10ZPapierski: [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) [07:09:43] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add kafka position transfer to wdqs cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/727021 (https://phabricator.wikimedia.org/T276469) (owner: 10ZPapierski) [07:10:40] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:31] (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [alerts] - 10https://gerrit.wikimedia.org/r/724761 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [07:21:44] (03CR) 10Elukey: kubernetes: do not repeat user tokens. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726915 (owner: 10Giuseppe Lavagetto) [07:22:57] (03CR) 10JMeybohm: [C: 03+2] Enable NamespaceDefaultLabelName for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/726846 (https://phabricator.wikimedia.org/T290476) (owner: 10JMeybohm) [07:26:54] (03PS3) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) [07:27:05] (03Merged) 10jenkins-bot: Enable NamespaceDefaultLabelName for main clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/726846 (https://phabricator.wikimedia.org/T290476) (owner: 10JMeybohm) [07:29:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dynamicproxy: replace proxydb-bak cron with systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/726729 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [07:29:46] (03PS4) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) [07:31:19] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [07:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:35] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [07:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:48] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [07:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:17] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [07:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:52] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cumin2002, cumin2001, ms-be1059, cumin1001, clouddb1020 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [07:37:14] 10SRE, 10SRE-swift-storage, 10ops-eqiad: swift - ms-be1059 - device sdi:3 unavailable - https://phabricator.wikimedia.org/T292486 (10fgiunchedi) 05Open→03Resolved Thanks @Dzahn and @Cmjohnson ! I've done the procedure and documented it at https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_... [07:37:33] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [07:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:43] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [07:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:18] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [07:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:45] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [07:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:04] (03CR) 10Giuseppe Lavagetto: kubernetes: do not repeat user tokens. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726915 (owner: 10Giuseppe Lavagetto) [07:46:09] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10fgiunchedi) >>! In T292322#7406310, @Legoktm wrote: >>>>! In T292322#7403338, @Legoktm wrote: >>> @fgiunchedi I'd appreciate your input on how this would potentially in... [07:54:27] (03CR) 10Elukey: kubernetes: do not repeat user tokens. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726915 (owner: 10Giuseppe Lavagetto) [07:57:14] !log re-enabling puppet on ms-be2045 after hw work T290881 [07:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:20] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [08:03:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] acme_chief: add openstack certs [puppet] - 10https://gerrit.wikimedia.org/r/726585 (https://phabricator.wikimedia.org/T267194) (owner: 10Majavah) [08:05:46] PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:56] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:59] (03PS2) 10Jelto: hiera::role::common::kubernetes add helm3 deploy users [labs/private] - 10https://gerrit.wikimedia.org/r/726862 (https://phabricator.wikimedia.org/T251305) [08:12:25] 10SRE, 10Language-Team (Language-2021-October-December): Remove Matxin Key from Production - https://phabricator.wikimedia.org/T292635 (10Nikerabbit) For context: * We found `Could not parse cxserver.conf.mt.Matxin` from cxserver logs * This is what code search finds: https://codesearch.wmcloud.org/search/?q=M... [08:14:48] (03CR) 10Giuseppe Lavagetto: "Please hold on with this, I'm refactoring these data structures right now." [labs/private] - 10https://gerrit.wikimedia.org/r/726862 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:14:54] (03CR) 10JMeybohm: [C: 03+1] hiera::role::common::kubernetes add helm3 deploy users [labs/private] - 10https://gerrit.wikimedia.org/r/726862 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [08:18:01] (03PS10) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [08:18:21] (03PS1) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/727133 [08:20:04] (03CR) 10jerkins-bot: [V: 04-1] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [08:20:17] (03PS1) 10Jgiannelos: tegola-vector-tiles: Fix config key for envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/727134 [08:20:23] (03CR) 10Giuseppe Lavagetto: kubernetes: do not repeat user tokens. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726915 (owner: 10Giuseppe Lavagetto) [08:20:54] (03PS2) 10Jgiannelos: tegola-vector-tiles: Fix config key for envoy timeout [deployment-charts] - 10https://gerrit.wikimedia.org/r/727134 (https://phabricator.wikimedia.org/T283159) [08:21:20] (03PS2) 10Giuseppe Lavagetto: kubernetes: add unified data structure for user tokens [labs/private] - 10https://gerrit.wikimedia.org/r/726913 [08:21:22] (03PS2) 10Giuseppe Lavagetto: kubernetes: remove redundant data [labs/private] - 10https://gerrit.wikimedia.org/r/726914 [08:21:43] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] kubernetes: add unified data structure for user tokens [labs/private] - 10https://gerrit.wikimedia.org/r/726913 (owner: 10Giuseppe Lavagetto) [08:22:02] (03PS2) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/727133 (https://phabricator.wikimedia.org/T283159) [08:22:40] (03PS5) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) [08:24:27] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [08:25:04] (03PS11) 10Jbond: require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) [08:25:13] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 4 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31524/console" [puppet] - 10https://gerrit.wikimedia.org/r/726915 (owner: 10Giuseppe Lavagetto) [08:26:50] (03CR) 10Jgiannelos: "I split the patch in 2 commits to keep it atomic." [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [08:27:31] (03CR) 10Effie Mouzeli: [C: 03+2] Add docs about template, label, and canary conventions [deployment-charts] - 10https://gerrit.wikimedia.org/r/724489 (https://phabricator.wikimedia.org/T291848) (owner: 10Ottomata) [08:27:55] (03CR) 10Jbond: [C: 03+2] require_packages: update all uses of require_packages [puppet] - 10https://gerrit.wikimedia.org/r/726896 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [08:28:18] (03CR) 10Effie Mouzeli: [C: 03+1] tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [08:28:41] (03Abandoned) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/727133 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [08:29:43] (03CR) 10Effie Mouzeli: [C: 03+1] "one nit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/727134 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [08:31:55] (03Merged) 10jenkins-bot: Add docs about template, label, and canary conventions [deployment-charts] - 10https://gerrit.wikimedia.org/r/724489 (https://phabricator.wikimedia.org/T291848) (owner: 10Ottomata) [08:33:22] (03PS1) 10Jbond: wmflib: remove require_package [puppet] - 10https://gerrit.wikimedia.org/r/727137 (https://phabricator.wikimedia.org/T266479) [08:35:11] (03PS3) 10Jgiannelos: tegola-vector-tiles: Fix config keys for envoy [deployment-charts] - 10https://gerrit.wikimedia.org/r/727134 (https://phabricator.wikimedia.org/T283159) [08:35:54] (03CR) 10Jbond: [C: 03+2] wmflib: remove require_package [puppet] - 10https://gerrit.wikimedia.org/r/727137 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [08:36:29] !log imported jenkins 2.303.2 to thirdparty/ci component for buster-wikimedia [08:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:45] (03CR) 10Jgiannelos: tegola-vector-tiles: Fix config keys for envoy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/727134 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [08:37:43] (03PS6) 10Jbond: stdlib: update stdlib from version 7.0.1 to 8.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) [08:49:20] !log mvernon@cumin2002 START - Cookbook sre.experimental.reimage for host ms-be2045.codfw.wmnet [08:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:26] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by mvernon@cumin2002 for host ms-be2045.codfw.wmnet [08:49:42] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10serviceops, 10Service-deployment-requests: New Service Request tegola-vector-tiles - https://phabricator.wikimedia.org/T274390 (10jijiki) [08:49:55] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] kubernetes: do not repeat user tokens. [puppet] - 10https://gerrit.wikimedia.org/r/726915 (owner: 10Giuseppe Lavagetto) [08:51:41] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/727134 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [08:54:23] (03PS30) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [08:54:56] (03PS6) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) [08:56:39] (03CR) 10Jgiannelos: "Added missing `tls.volume` for envoy to be able to mount the config." [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [08:57:51] (03PS1) 10Volans: doc: refactor logging paragraph [software/spicerack] - 10https://gerrit.wikimedia.org/r/727164 [09:00:19] (03PS1) 10Jbond: P:contact::get_owners: sort array before casting [puppet] - 10https://gerrit.wikimedia.org/r/727165 [09:01:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31525/console" [puppet] - 10https://gerrit.wikimedia.org/r/727165 (owner: 10Jbond) [09:02:18] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:contact::get_owners: sort array before casting [puppet] - 10https://gerrit.wikimedia.org/r/727165 (owner: 10Jbond) [09:02:37] RECOVERY - Check systemd state on ms-be1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:49] (03PS7) 10Jgiannelos: tegola-vector-tiles: Use envoy for cronjob pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) [09:07:26] (03Abandoned) 10Effie Mouzeli: common_templates: allow custon routed_via value [deployment-charts] - 10https://gerrit.wikimedia.org/r/722845 (owner: 10Effie Mouzeli) [09:08:27] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2005.codfw.wmnet [09:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Christina Macholan - https://phabricator.wikimedia.org/T292515 (10Kormat) [09:14:54] (03PS1) 10Ladsgroup: jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) [09:15:09] (03PS1) 10Ladsgroup: jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) [09:15:40] (03CR) 10Jgiannelos: "I figured out there are some missing parts for envoy:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [09:15:42] (03PS9) 10Effie Mouzeli: Bump namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/726580 (https://phabricator.wikimedia.org/T280497) [09:17:25] (03PS10) 10Effie Mouzeli: Bump namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/726580 (https://phabricator.wikimedia.org/T280497) [09:19:53] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2004.codfw.wmnet [09:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:38] RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:02] (03CR) 10jerkins-bot: [V: 04-1] jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [09:24:53] (03PS1) 10Muehlenhoff: Add MAC for testvm2005 [puppet] - 10https://gerrit.wikimedia.org/r/727172 [09:25:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2005.codfw.wmnet [09:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:33] (03PS1) 10Kormat: admin: Add cmacholan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/727173 (https://phabricator.wikimedia.org/T292515) [09:25:52] (03CR) 10Muehlenhoff: [C: 03+2] Add MAC for testvm2005 [puppet] - 10https://gerrit.wikimedia.org/r/727172 (owner: 10Muehlenhoff) [09:26:32] !log mvernon@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host ms-be2045.codfw.wmnet [09:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:37] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage started by mvernon@cumin2002 for host ms-be2045.codfw.wmnet completed: - ms-be2045 (**WARN**) - Downtimed on Icinga - Dis... [09:27:11] (03PS2) 10Kormat: admin: Add cmacholan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/727173 (https://phabricator.wikimedia.org/T292515) [09:27:13] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts testvm2004.codfw.wmnet [09:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Create Ganeti test cluster - https://phabricator.wikimedia.org/T286206 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `testvm2004.codfw.wmnet` - testvm2004.codfw.wmnet (**WARN**) - //Host not found on Ici... [09:28:28] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10tstarling) I did consider having swift access in Shellbox, but we didn't have a use case for it, and allowing network access and giving it a swift secret means there is... [09:28:38] (03CR) 10Kormat: [C: 03+2] admin: Add cmacholan to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/727173 (https://phabricator.wikimedia.org/T292515) (owner: 10Kormat) [09:30:02] (03PS4) 10Jcrespo: mariadb: Add easy-to-use wrapper for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/726857 [09:30:28] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Christina Macholan - https://phabricator.wikimedia.org/T292515 (10Kormat) [09:31:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add easy-to-use wrapper for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [09:31:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Christina Macholan - https://phabricator.wikimedia.org/T292515 (10Kormat) 05Open→03Resolved a:03Kormat Hey @CMacholan, this is now done. Let us know if you run into any issues. [09:32:10] (03PS5) 10Jcrespo: mariadb: Add easy-to-use wrapper for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/726857 [09:33:08] (03PS6) 10Jcrespo: mariadb: Add easy-to-use wrapper for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/726857 [09:34:08] 10SRE, 10LDAP-Access-Requests: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10Kormat) [09:34:51] (03CR) 10Jcrespo: mariadb: Add easy-to-use wrapper for pt-kill (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/726857 (owner: 10Jcrespo) [09:35:48] 10SRE, 10LDAP-Access-Requests: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10Kormat) Hey @Lea_WMDE / @karapayneWMDE: can one of you approve this request, please? [09:36:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:55] (03CR) 10jerkins-bot: [V: 04-1] jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [09:40:43] 10SRE-swift-storage, 10serviceops: Allow maps2009/maps1009 (master nodes) access thanos-swift - https://phabricator.wikimedia.org/T292700 (10Jgiannelos) [09:41:54] (03PS1) 10Giuseppe Lavagetto: kubernetes: add jenkins-debug user, limited to the staging masters. [labs/private] - 10https://gerrit.wikimedia.org/r/727213 [09:42:09] (03PS1) 10Giuseppe Lavagetto: kubernetes: allow constraining users to specific subsets of masters [puppet] - 10https://gerrit.wikimedia.org/r/727214 [09:43:00] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] kubernetes: add jenkins-debug user, limited to the staging masters. [labs/private] - 10https://gerrit.wikimedia.org/r/727213 (owner: 10Giuseppe Lavagetto) [09:43:16] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] kubernetes: remove redundant data [labs/private] - 10https://gerrit.wikimedia.org/r/726914 (owner: 10Giuseppe Lavagetto) [09:46:02] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) Running some tests (c=60, ~1.9m URLs) agains mwdebug services, we found 2 issues: 1) Our client was returning the following e... [09:46:32] (03PS31) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [09:56:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes: allow constraining users to specific subsets of masters [puppet] - 10https://gerrit.wikimedia.org/r/727214 (owner: 10Giuseppe Lavagetto) [09:56:52] (03PS2) 10Giuseppe Lavagetto: kubernetes: allow constraining users to specific subsets of masters [puppet] - 10https://gerrit.wikimedia.org/r/727214 [09:56:54] 10SRE-swift-storage: Swift users and their usage - https://phabricator.wikimedia.org/T264291 (10fgiunchedi) [09:58:14] 10SRE, 10LDAP-Access-Requests: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10conny-kawohl_WMDE) @Kormat I am @Deniz_WMDE's manager and approve this request. [10:00:04] mvolz: My dear minions, it's time we take the moon! Just kidding. Time for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T1000). [10:04:06] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10MatthewVernon) 05Resolved→03Open Hi @Papaul We reimaged this host today to try and bring it back into service. After about half an hour of uptime it dropped off the network, and from th... [10:10:27] (03PS1) 10Giuseppe Lavagetto: kubernetes::deployment_server: fix key lookup [puppet] - 10https://gerrit.wikimedia.org/r/727231 [10:10:46] (03PS2) 10Giuseppe Lavagetto: kubernetes::deployment_server: fix key lookup [puppet] - 10https://gerrit.wikimedia.org/r/727231 [10:10:55] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] kubernetes::deployment_server: fix key lookup [puppet] - 10https://gerrit.wikimedia.org/r/727231 (owner: 10Giuseppe Lavagetto) [10:12:46] (03PS1) 10Giuseppe Lavagetto: deployment_server: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/727233 [10:13:04] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] deployment_server: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/727233 (owner: 10Giuseppe Lavagetto) [10:14:19] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [10:15:03] (03CR) 10David Caro: [C: 03+2] base::environment: add types to the parameter [puppet] - 10https://gerrit.wikimedia.org/r/725301 (owner: 10David Caro) [10:17:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Bump namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/726580 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [10:19:53] (03Abandoned) 10Zabe: dynamicproxy: migrate cron of proxydb-bak to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/711230 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [10:20:54] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [10:21:00] 10SRE, 10MW-on-K8s, 10serviceops, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23): Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10jijiki) [10:21:08] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [10:21:12] (03CR) 10Ayounsi: [C: 03+1] sre.experimental.reimage: update Netbox data [cookbooks] - 10https://gerrit.wikimedia.org/r/726990 (owner: 10Volans) [10:21:18] (03PS1) 10Zabe: Change PropertyId to NumericPropertyId [extensions/WikidataPageBanner] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727188 (https://phabricator.wikimedia.org/T289125) [10:34:49] (03CR) 10David Caro: [C: 03+1] stdlib: update stdlib from version 7.0.1 to 8.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [10:35:31] (03PS2) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [10:36:28] (03PS1) 10Jbond: base::auto_restarts: move to profile and make debdeploy support optional [puppet] - 10https://gerrit.wikimedia.org/r/727235 (https://phabricator.wikimedia.org/T289661) [10:37:30] (03CR) 10Jbond: debdeploy: move base::autorestart into debdeploy module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:37:43] (03Abandoned) 10Jbond: debdeploy: move base::autorestart into debdeploy module [puppet] - 10https://gerrit.wikimedia.org/r/725317 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:38:31] (03CR) 10jerkins-bot: [V: 04-1] base::auto_restarts: move to profile and make debdeploy support optional [puppet] - 10https://gerrit.wikimedia.org/r/727235 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:38:34] (03PS32) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [10:45:09] (03PS1) 10Jbond: P::base: create standard directories in base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/727242 (https://phabricator.wikimedia.org/T289661) [10:45:17] (03CR) 10Giuseppe Lavagetto: P:base: move production specific code to there own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [10:50:39] (03PS3) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [10:50:41] (03CR) 10David Caro: base::environment: use only vars inside ::realm ifs (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [10:52:30] (03PS1) 10Jbond: P:rsyslog::kafka_shipping: drop unused parameter [puppet] - 10https://gerrit.wikimedia.org/r/727250 [10:53:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31528/console" [puppet] - 10https://gerrit.wikimedia.org/r/727250 (owner: 10Jbond) [10:57:00] (03CR) 10Jbond: "lgtm, should do a pcc run" [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [10:57:10] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:rsyslog::kafka_shipping: drop unused parameter [puppet] - 10https://gerrit.wikimedia.org/r/727250 (owner: 10Jbond) [10:59:16] (03PS33) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [10:59:48] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Pginer-WMF) [10:59:52] (03PS34) 10Jbond: P:base: move production specific code to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) [11:00:05] Amir1, Lucas_WMDE, and apergos: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for EU Backport and Config training . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T1100). [11:00:22] (03CR) 10Jbond: P:base: move production specific code to there own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/714975 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:05:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [11:06:05] I'm a bit late for scheduling, but maybe someone could take a look at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikidataPageBanner/+/727188? [11:06:13] (03PS1) 10Giuseppe Lavagetto: Remove the other redundant data structure [labs/private] - 10https://gerrit.wikimedia.org/r/727253 [11:06:39] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove the other redundant data structure [labs/private] - 10https://gerrit.wikimedia.org/r/727253 (owner: 10Giuseppe Lavagetto) [11:09:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31529/console" [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [11:10:11] !log update puppet stdlib gerrit:726872 [11:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] stdlib: update stdlib from version 7.0.1 to 8.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/726872 (https://phabricator.wikimedia.org/T264276) (owner: 10Jbond) [11:11:06] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [11:12:24] sorry, I missed the backport ping [11:12:36] zabe: should we deploy your backport? [11:13:30] I think so, especially since you know wikidata stuff a lot better than me [11:13:34] (also, it looks like the second jouncebot message got lost) [11:14:07] No, I was to slow. I added my patch after the window began [11:14:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Change PropertyId to NumericPropertyId [extensions/WikidataPageBanner] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727188 (https://phabricator.wikimedia.org/T289125) (owner: 10Zabe) [11:14:24] doesn’t it usually say “no patches in this window as far as I can see” in that case? [11:14:27] but nevermind [11:14:29] jouncebot: now [11:14:29] For the next 0 hour(s) and 45 minute(s): EU Backport and Config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T1100) [11:14:33] it’s still alive at least [11:15:23] right, it should have said the no patches message [11:19:59] (03PS2) 10Jbond: statistics::web: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726728 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [11:20:29] (03CR) 10Jbond: "I have now removed require_packages everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/726728 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [11:20:51] I also forget to add my patch :/ Any chance to do config change in today's slot? [11:21:11] kart_: sure, add it to the calendar for now and I can take a look [11:21:12] (03PS2) 10KartikMistry: Enable Content and Section Translation to Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725858 (https://phabricator.wikimedia.org/T290238) [11:22:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] Bump namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/726580 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [11:23:05] Lucas_WMDE: added. [11:23:23] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: Puppet clean up Parent task - https://phabricator.wikimedia.org/T267395 (10jbond) [11:24:04] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet Proposal to remove require_package - https://phabricator.wikimedia.org/T266479 (10jbond) 05Open→03Resolved a:03Dzahn I was bold and have now removed require_packages everywhere and dropped it from the repo. The reason i did th... [11:24:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable Content and Section Translation to Kurdish WP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725858 (https://phabricator.wikimedia.org/T290238) (owner: 10KartikMistry) [11:25:10] (03PS1) 10Effie Mouzeli: mwdebug: tune memory, cpu and apcu size [deployment-charts] - 10https://gerrit.wikimedia.org/r/727279 (https://phabricator.wikimedia.org/T280497) [11:25:33] hm, that backport will probably take a while longer [11:25:45] but I don’t think I want to risk doing the config change first, in case CI is faster than expected [11:25:48] so let’s just wait a bit [11:25:55] no need to rush [11:26:38] (03PS2) 10Jbond: base::auto_restarts: move to profile and make debdeploy support optional [puppet] - 10https://gerrit.wikimedia.org/r/727235 (https://phabricator.wikimedia.org/T289661) [11:28:11] (03PS2) 10Effie Mouzeli: mwdebug: tune memory, cpu and apcu size [deployment-charts] - 10https://gerrit.wikimedia.org/r/727279 (https://phabricator.wikimedia.org/T280497) [11:28:20] (03CR) 10jerkins-bot: [V: 04-1] base::auto_restarts: move to profile and make debdeploy support optional [puppet] - 10https://gerrit.wikimedia.org/r/727235 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:30:28] (03PS2) 10Jbond: P::base: create standard directories in base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/727242 (https://phabricator.wikimedia.org/T289661) [11:31:18] (03PS3) 10Effie Mouzeli: mwdebug: tune memory, cpu and apcu size [deployment-charts] - 10https://gerrit.wikimedia.org/r/727279 (https://phabricator.wikimedia.org/T280497) [11:32:00] (03CR) 10jerkins-bot: [V: 04-1] P::base: create standard directories in base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/727242 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:32:05] (03PS3) 10Jbond: base::auto_restarts: move to profile and make debdeploy support optional [puppet] - 10https://gerrit.wikimedia.org/r/727235 (https://phabricator.wikimedia.org/T289661) [11:32:07] (03PS1) 10Jbond: spec tests: fix spec tests poststdlib update [puppet] - 10https://gerrit.wikimedia.org/r/727280 [11:32:40] (03CR) 10Muehlenhoff: "Sounds good, but for consistency let's also move /etc/wikimedia there? (currently defined in base::environments)" [puppet] - 10https://gerrit.wikimedia.org/r/727242 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:33:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/726990 (owner: 10Volans) [11:34:47] (03CR) 10Muehlenhoff: [C: 03+1] P::base: create standard directories in base::standard_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727242 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:34:59] (03Merged) 10jenkins-bot: Change PropertyId to NumericPropertyId [extensions/WikidataPageBanner] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727188 (https://phabricator.wikimedia.org/T289125) (owner: 10Zabe) [11:35:04] yay [11:35:07] Lucas_WMDE: OK. Waiting. [11:36:44] zabe: the fix is on mwdebug1001, can you test it? [11:38:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:32] I haven’t yet managed to reproduce the error on mwdebug, I think [11:39:43] (it should still happen on mwdebug1002, where I haven’t pulled the fix) [11:40:12] tbh I don't really know how to reproduce it [11:41:10] (03PS1) 10Filippo Giunchedi: statsite: switch to python3 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/727293 (https://phabricator.wikimedia.org/T247963) [11:41:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:41:12] (03PS1) 10Filippo Giunchedi: graphite: set settings_module from uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/727294 (https://phabricator.wikimedia.org/T247963) [11:41:14] (03PS1) 10Filippo Giunchedi: statsite: log instance identifier [puppet] - 10https://gerrit.wikimedia.org/r/727295 (https://phabricator.wikimedia.org/T247963) [11:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:25] ah, now I got it on mwdebug1002 [11:41:36] on /wiki/Bushehr [11:41:41] maybe it only affects certain pages [11:41:50] oh, right, I guess it’s a *page* banner innit [11:42:53] I does not break anything on mwdebug1001, but I can't see if the error goes away. [11:42:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Christina Macholan - https://phabricator.wikimedia.org/T292515 (10CMacholan) Looks like it's working -- thank you so much! [11:42:59] ok, looks like it’s only happening on mwdebug1002 but not mwdebug1001 [11:43:03] good enough for me [11:43:42] syncing [11:44:44] !log lucaswerkmeister-wmde@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/WikidataPageBanner/includes/WikidataPageBannerFunctions.php: Backport: [[gerrit:727188|Change PropertyId to NumericPropertyId (T289125, T292667)]] (duration: 01m 05s) [11:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:51] T292667: WikidataPageBanner: Cannot instantiate interface Wikibase\DataModel\Entity\PropertyId - https://phabricator.wikimedia.org/T292667 [11:44:51] T289125: Make PropertyId into an interface and introduce NumericPropertyId - https://phabricator.wikimedia.org/T289125 [11:45:11] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable Content and Section Translation to Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725858 (https://phabricator.wikimedia.org/T290238) (owner: 10KartikMistry) [11:45:23] kart_: ^ [11:45:29] Thanks. [11:46:01] (03Merged) 10jenkins-bot: Enable Content and Section Translation to Kurdish WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/725858 (https://phabricator.wikimedia.org/T290238) (owner: 10KartikMistry) [11:46:12] Lucas_WMDE: thanks for your help :) [11:46:17] np :) [11:46:39] kart_: change is on mwdebug1001, please test [11:47:57] (03CR) 10Volans: [C: 03+2] "Just doc-formatting, self merging" [software/spicerack] - 10https://gerrit.wikimedia.org/r/727164 (owner: 10Volans) [11:48:17] (03PS1) 10Jbond: P:sre::os_updates: onlymanage the rsync server on the active host [puppet] - 10https://gerrit.wikimedia.org/r/727305 [11:48:34] Lucas_WMDE: looks good. [11:48:38] ok thanks [11:48:58] (03CR) 10Jbond: [C: 03+1] Don't include rsync::server for absented rsync modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726851 (owner: 10Muehlenhoff) [11:49:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:09] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:725858|Enable Content and Section Translation to Kurdish WP (T290238)]] (duration: 01m 04s) [11:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:15] T290238: Deploy Content and Section Translation tool to Kurdish Wikipedia - https://phabricator.wikimedia.org/T290238 [11:51:05] (03CR) 10Jbond: [C: 03+2] spec tests: fix spec tests poststdlib update [puppet] - 10https://gerrit.wikimedia.org/r/727280 (owner: 10Jbond) [11:51:10] Thanks again Lucas_WMDE ! [11:51:23] np [11:51:28] but where did the next week in the deployment calendar go? [11:51:33] I thought it was already there a few days ago [11:52:01] :shrug: [11:52:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:28] !log EU backport+config window (aka UTC morning) done [11:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:49] (03PS3) 10Jbond: P::base: create standard directories in base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/727242 (https://phabricator.wikimedia.org/T289661) [11:54:18] (03PS2) 10Jbond: P:sre::os_updates: onlymanage the rsync server on the active host [puppet] - 10https://gerrit.wikimedia.org/r/727305 [11:54:36] Holiday for Lucas_WMDE ? ;) [11:54:42] (03PS3) 10Jbond: P:sre::os_updates: only manage the rsync server on the active host [puppet] - 10https://gerrit.wikimedia.org/r/727305 [11:55:21] 10SRE, 10Language-Team (Language-2021-October-December): Remove Matxin Key from Production - https://phabricator.wikimedia.org/T292635 (10KartikMistry) `cxserver/deploy` repository is deprecated and can be deleted. I need to look what `deployment-prep` is doing here. [11:55:40] (03CR) 10Jbond: [C: 03+2] P::base: create standard directories in base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/727242 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [11:55:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31531/console" [puppet] - 10https://gerrit.wikimedia.org/r/727305 (owner: 10Jbond) [11:56:29] (03Merged) 10jenkins-bot: doc: refactor logging paragraph [software/spicerack] - 10https://gerrit.wikimedia.org/r/727164 (owner: 10Volans) [11:59:29] !log installing openssl security updates for stretch (buster/bullseye already fixed) [11:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:00] 10SRE-swift-storage, 10serviceops: Allow maps2009/maps1009 (master nodes) access thanos-swift - https://phabricator.wikimedia.org/T292700 (10fgiunchedi) My two cents: I think what's needed here is get puppet to write the credentials on the filesystem (with adeguate ownership/permissions) in a format suitable f... [12:11:12] (03CR) 10Ladsgroup: "recheck" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [12:11:20] (03CR) 10Ladsgroup: "recheck" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [12:12:13] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: update Netbox data [cookbooks] - 10https://gerrit.wikimedia.org/r/726990 (owner: 10Volans) [12:15:18] (03Merged) 10jenkins-bot: sre.experimental.reimage: update Netbox data [cookbooks] - 10https://gerrit.wikimedia.org/r/726990 (owner: 10Volans) [12:16:51] !log installing testvm2005 [12:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:34] (03PS1) 10Muehlenhoff: Extend testvm globbing in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/727321 [12:21:04] (03CR) 10Muehlenhoff: [C: 03+2] Extend testvm globbing in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/727321 (owner: 10Muehlenhoff) [12:21:41] (03CR) 10Zabe: "Failure is unrelated. Needs https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/726959 and https://gerrit.wikimedia.org/" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [12:25:10] (03CR) 10Zabe: "https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/726963 should fix the failure." [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [12:25:12] (03CR) 10BBlack: [C: 03+1] "LGTM, and I suspect this works correctly! The complexity is un-ideal, but the end result will either be a reversion of these single-backen" [puppet] - 10https://gerrit.wikimedia.org/r/726912 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [12:34:30] jouncebot: nowandnext [12:34:30] No deployments scheduled for the next 3 hour(s) and 25 minute(s) [12:34:31] In 3 hour(s) and 25 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T1600) [12:34:37] oooh awesome [12:35:24] (03CR) 10Ladsgroup: [C: 03+2] api-testing: Adjust DiffCompare expected outcome to cope with new wikidiff2 output [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726961 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [12:35:49] (03CR) 10Ladsgroup: [C: 03+2] PageSlotDiffRendererTest::testGetDiff: Skip as new wikidiff2 breaks this test [extensions/ProofreadPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726959 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [12:39:15] zabe: let's fix wmf.3 first, then let's look at the mess of wmf.2 [12:40:37] !log volans@cumin2002 START - Cookbook sre.experimental.reimage for host sretest1001.eqiad.wmnet [12:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:40] sure [12:49:40] PROBLEM - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:21] godog, Emperor :/ %%% [12:50:22] ^^^ [12:50:31] apparently didn't last long [12:50:38] unless that was done on purpose [12:51:12] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [12:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:33] I think the mess on wmf.2 is even worse, all patches depend on each other. We probably can't come around some force-merging... [12:55:15] (03PS5) 10Jbond: icinga: add recheck_failed_services function [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 [12:55:50] (03CR) 10Jbond: icinga: add recheck_failed_services function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [12:56:00] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:24] (03CR) 10jerkins-bot: [V: 04-1] PageSlotDiffRendererTest::testGetDiff: Skip as new wikidiff2 breaks this test [extensions/ProofreadPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726959 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [12:56:31] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [12:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:43] (03PS3) 10Jelto: hiera::role::common::kubernetes add helm3 deploy users [labs/private] - 10https://gerrit.wikimedia.org/r/726862 (https://phabricator.wikimedia.org/T251305) [13:00:47] (03CR) 10Volans: icinga: add recheck_failed_services function (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [13:00:54] (03CR) 10Ladsgroup: [C: 03+2] "you got to be kidding me" [extensions/ProofreadPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726959 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [13:00:58] (03CR) 10Jelto: "@Joe could you take a look if rebase to new data structure is ok?" [labs/private] - 10https://gerrit.wikimedia.org/r/726862 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:01:25] volans: sadly not on purpose :( [13:02:54] (03CR) 10Effie Mouzeli: [C: 03+2] Bump namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/726580 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [13:03:20] godog: should we call it a lemon then? :) [13:04:11] 10SRE, 10Language-Team (Language-2021-October-December): Remove Matxin Key from Production - https://phabricator.wikimedia.org/T292635 (10akosiaris) >>! In T292635#7408218, @KartikMistry wrote: > `cxserver/deploy` repository is deprecated and can be deleted. I need to look what `deployment-prep` is doing here.... [13:04:36] 10SRE, 10Anti-Harassment, 10IP Info, 10serviceops, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10phuedx) [13:05:32] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:37] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:27] !log volans@cumin2002 END (PASS) - Cookbook sre.experimental.reimage (exit_code=0) for host sretest1001.eqiad.wmnet [13:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:36] volans: (sorry, was in meeting); yeah the h/w on that system is still sad, and I've reopened T290881 [13:06:36] T290881: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 [13:06:54] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [13:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:18] I'll ack the alert [13:07:28] volans: heh, I hope we can get it repaired under warranty [13:07:36] (03PS6) 10Jbond: icinga: add recheck_failed_services function [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 [13:07:42] (03CR) 10Jbond: "update thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [13:08:05] (03CR) 10Zabe: jobqueue: Batch jobs that will end up in the default queue (031 comment) [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [13:08:17] (03Merged) 10jenkins-bot: Bump namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/726580 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [13:08:31] ACKNOWLEDGEMENT - SSH on ms-be2045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds MVernon h/w still not right, back to DC folk for work - T290881 https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:08:31] ACKNOWLEDGEMENT - Host ms-be2045 is DOWN: PING CRITICAL - Packet loss = 100% MVernon h/w still not right, back to DC folk for work - T290881 [13:09:14] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [13:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:21] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [13:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:26] (03PS1) 10Alexandros Kosiaris: Remove the old redundant passwords::cxserver class [labs/private] - 10https://gerrit.wikimedia.org/r/727337 (https://phabricator.wikimedia.org/T292635) [13:10:51] 10SRE, 10Language-Team (Language-2021-October-December), 10Patch-For-Review: Remove Matxin Key from Production - https://phabricator.wikimedia.org/T292635 (10akosiaris) >>! In T292635#7407541, @Nikerabbit wrote: > For context: > * We found `Could not parse cxserver.conf.mt.Matxin` from cxserver logs This pr... [13:11:22] (03PS7) 10Jbond: icinga: add recheck_failed_services function [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 [13:11:34] (03PS4) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [13:11:36] (03CR) 10David Caro: base::environment: use only vars inside ::realm ifs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [13:12:02] (03CR) 10Volans: [C: 03+1] "LGTM! thanks for the patch" [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [13:12:38] (03CR) 10Herron: [C: 03+1] o11y: port alertmanager alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/724761 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [13:13:51] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [13:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:23] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [13:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:31] !log Upgraded CI Jenkins on contint2001 [13:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:46] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Cmjohnson) [13:19:37] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [13:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:04] (03CR) 10Ladsgroup: jobqueue: Batch jobs that will end up in the default queue (031 comment) [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [13:20:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Cmjohnson) [13:21:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10Cmjohnson) [13:22:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:(Need By: TBD) rack/setup/install cloudmetrics100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T289888 (10Cmjohnson) [13:23:54] T292687: Create project tag for me User-Juan90264 | https://phabricator.wikimedia.org/T292687 [13:23:55] T292687: Create project tag for me User-Juan90264 - https://phabricator.wikimedia.org/T292687 [13:24:45] (03CR) 10Jbond: [C: 03+2] icinga: add recheck_failed_services function [software/spicerack] - 10https://gerrit.wikimedia.org/r/724759 (owner: 10Jbond) [13:26:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Degraded RAID on backup1002 - https://phabricator.wikimedia.org/T292329 (10Cmjohnson) Opened a new ticket with Dell for the disk shelf, You have successfully submitted request SR1072235267. [13:26:51] (03PS5) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [13:27:07] 10SRE, 10Language-Team (Language-2021-October-December), 10Patch-For-Review: Remove Matxin Key from Production - https://phabricator.wikimedia.org/T292635 (10KartikMistry) Thanks @akosiaris [13:28:06] (03CR) 10Herron: [C: 03+1] statsite: log instance identifier [puppet] - 10https://gerrit.wikimedia.org/r/727295 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [13:28:41] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove the old redundant passwords::cxserver class [labs/private] - 10https://gerrit.wikimedia.org/r/727337 (https://phabricator.wikimedia.org/T292635) (owner: 10Alexandros Kosiaris) [13:29:07] (03CR) 10Herron: [C: 03+1] graphite: set settings_module from uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/727294 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [13:29:20] !log restarting CI Jenkins for git plugin update [13:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:36] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:09] (03CR) 10Herron: [C: 03+1] statsite: switch to python3 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/727293 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [13:30:38] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:07] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:37] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [13:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:57] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox: Anycast: Add IPv6 support to bird and anycast-healthchecker (Puppet) - https://phabricator.wikimedia.org/T292737 (10ssingh) [13:51:02] (03PS1) 10Btullis: Increase the maximum renewable lifetime of a Kerberos ticket [puppet] - 10https://gerrit.wikimedia.org/r/727349 (https://phabricator.wikimedia.org/T268985) [13:51:14] 10SRE, 10Infrastructure-Foundations, 10Traffic: Anycast: Add IPv6 support to bird and anycast-healthchecker (Puppet) - https://phabricator.wikimedia.org/T292737 (10ssingh) [13:52:21] (03CR) 10Filippo Giunchedi: [C: 03+2] statsite: switch to python3 on Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/727293 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [13:54:36] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: set settings_module from uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/727294 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [13:56:57] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [13:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:00] (03PS4) 10Alexandros Kosiaris: ganeti: Run a monthly cluster rebalancing [puppet] - 10https://gerrit.wikimedia.org/r/725779 [13:58:11] (03CR) 10Alexandros Kosiaris: ganeti: Run a monthly cluster rebalancing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/725779 (owner: 10Alexandros Kosiaris) [14:01:45] 10SRE, 10Infrastructure-Foundations, 10Traffic: Anycast: Add IPv6 support to bird and anycast-healthchecker (Puppet) - https://phabricator.wikimedia.org/T292737 (10ssingh) [14:01:51] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [14:02:30] (03CR) 10jerkins-bot: [V: 04-1] PageSlotDiffRendererTest::testGetDiff: Skip as new wikidiff2 breaks this test [extensions/ProofreadPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726959 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [14:04:15] (03CR) 10Ladsgroup: [C: 03+2] ".." [extensions/ProofreadPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726959 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [14:04:29] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug: tune memory, cpu and apcu size [deployment-charts] - 10https://gerrit.wikimedia.org/r/727279 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [14:04:46] (03PS1) 10Elukey: Add extra include search path to {CPP,C,CXX,FORTRAN}FLAGS [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/727352 (https://phabricator.wikimedia.org/T292699) [14:08:50] (03Merged) 10jenkins-bot: mwdebug: tune memory, cpu and apcu size [deployment-charts] - 10https://gerrit.wikimedia.org/r/727279 (https://phabricator.wikimedia.org/T280497) (owner: 10Effie Mouzeli) [14:09:55] 10SRE, 10LDAP-Access-Requests: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10Kormat) >>! In T292301#7407832, @conny-kawohl_WMDE wrote: > @Kormat I am @Deniz_WMDE's manager and approve this request. Perfect, thanks! [14:13:56] (03PS1) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [14:15:37] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10Papaul) @fgiunchedi I will re-open the case with Dell. Thanks [14:16:15] (03CR) 10jerkins-bot: [V: 04-1] bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [14:16:59] jbond: ^ (once you are done with the meeting!) [14:19:47] (03PS1) 10Kormat: admin: Add deerbee to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/727356 (https://phabricator.wikimedia.org/T292301) [14:20:26] (03CR) 10Hashar: [C: 03+2] Gerrit v3.3.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716317 (https://phabricator.wikimedia.org/T290236) (owner: 10Hashar) [14:20:44] (03Merged) 10jenkins-bot: Gerrit v3.3.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/716317 (https://phabricator.wikimedia.org/T290236) (owner: 10Hashar) [14:20:51] (03CR) 10Kormat: [C: 03+2] admin: Add deerbee to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/727356 (https://phabricator.wikimedia.org/T292301) (owner: 10Kormat) [14:24:20] (03Merged) 10jenkins-bot: PageSlotDiffRendererTest::testGetDiff: Skip as new wikidiff2 breaks this test [extensions/ProofreadPage] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726959 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [14:25:18] (03CR) 10Ottomata: Add extra include search path to {CPP,C,CXX,FORTRAN}FLAGS (031 comment) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/727352 (https://phabricator.wikimedia.org/T292699) (owner: 10Elukey) [14:25:44] Amir1: the gate-and-submit pipeline for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/726961 did not get triggered [14:26:01] (03CR) 10Ladsgroup: [C: 03+2] "." [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726961 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [14:26:12] tried again [14:26:22] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10Kormat) [14:26:42] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add rsyslog image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/725005 (https://phabricator.wikimedia.org/T288851) (owner: 10Giuseppe Lavagetto) [14:28:15] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Add Deniz Erdogan to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T292301 (10Kormat) 05Open→03Resolved a:03Kormat Hi @Deniz_WMDE, this is now done. You can confirm it by searching for `deerbee` on https://contact.toolforge.org/... [14:28:25] rebased the proofread patch. There is no need to deploy [14:31:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:09] https://phabricator.wikimedia.org/T292687 [14:33:27] (03PS1) 10Hashar: Revert "icinga: add qchris to contactgroup for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/727200 [14:33:58] (03PS2) 10Hashar: Revert "icinga: add qchris to contactgroup for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/727200 [14:34:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:24] (03PS1) 10Volans: sre.experimental.reimage: move Puppet logs [cookbooks] - 10https://gerrit.wikimedia.org/r/727357 [14:38:27] (03PS1) 10Hashar: gerrit: add gerrit as a contact group [puppet] - 10https://gerrit.wikimedia.org/r/727358 [14:38:59] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31538/console" [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [14:39:50] (03CR) 10Zabe: "Could it be that the gate-and-submit job is not going to start, until all three cherry-picks at https://gerrit.wikimedia.org/r/q/If4bc60c3" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726961 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [14:40:38] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31539/console" [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [14:40:58] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "Force merging as there is a circular dependency between broken tests..." [extensions/ProofreadPage] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726960 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [14:41:38] (03CR) 10David Caro: [V: 03+1] "All the changes in pcc are expected (adding ensure-> present, changing from 0755 to 0444, and some tabs in the command of the exec)." [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [14:42:27] (03CR) 10Ladsgroup: [C: 03+2] api-testing: Adjust DiffCompare expected outcome to cope with new wikidiff2 output (031 comment) [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726961 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [14:42:36] (03CR) 10jerkins-bot: [V: 04-1] sre.experimental.reimage: move Puppet logs [cookbooks] - 10https://gerrit.wikimedia.org/r/727357 (owner: 10Volans) [14:44:48] 10SRE, 10Release-Engineering-Team: Add Ahmon and Brennen to Icinga contact list - https://phabricator.wikimedia.org/T292753 (10hashar) [14:45:39] (03PS3) 10Giuseppe Lavagetto: mediawiki: Add rsyslog sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/725892 (https://phabricator.wikimedia.org/T288851) [14:45:55] (03CR) 10Hashar: "My understanding is that we need our group to be added at the role level in order for us to be able to acknowledge alarms." [puppet] - 10https://gerrit.wikimedia.org/r/727358 (owner: 10Hashar) [14:48:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:54] !log Upgrading Gerrit replica to 3.3.6 # T290236 [14:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:59] T290236: Upgrade Gerrit from 3.3.5 to 3.3.6 - https://phabricator.wikimedia.org/T290236 [14:49:00] (03PS2) 10Volans: sre.experimental.reimage: move Puppet logs [cookbooks] - 10https://gerrit.wikimedia.org/r/727357 [14:49:28] !log hashar@deploy1002 Started deploy [gerrit/gerrit@13cef9f]: Gerrit to 3.3.6 on gerrit2001 [14:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:38] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@13cef9f]: Gerrit to 3.3.6 on gerrit2001 (duration: 00m 10s) [14:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:13] (03PS2) 10Herron: warn on idle mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) [14:55:23] (03CR) 10jerkins-bot: [V: 04-1] warn on idle mtail instances [alerts] - 10https://gerrit.wikimedia.org/r/724827 (https://phabricator.wikimedia.org/T292051) (owner: 10Herron) [14:57:24] (03PS1) 10Jbond: apt: fix spec test post stdlib 8.1.0 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/727362 [15:03:23] (03Merged) 10jenkins-bot: api-testing: Adjust DiffCompare expected outcome to cope with new wikidiff2 output [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/726961 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [15:03:29] Amir1: Thanks for dealing with the circular dep. :-( [15:03:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/725779 (owner: 10Alexandros Kosiaris) [15:03:50] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:53] Amir1: You'll want https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/726963 as well to get things back to verified. [15:04:04] RECOVERY - Host ms-be2045 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [15:04:55] James_F: sure, on it. In a meeting atm [15:05:12] Ditto. [15:06:25] I am going to upgrade Gerrit primary, it will be done for a few minutes [15:06:34] (03CR) 10Ladsgroup: [C: 03+2] "let's give it a try" [vendor] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726963 (https://phabricator.wikimedia.org/T266421) (owner: 10Jforrester) [15:06:36] (03PS2) 10Jforrester: jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [15:07:43] !log hashar@deploy1002 Started deploy [gerrit/gerrit@13cef9f]: Gerrit to 3.3.6 on gerrit1001 [15:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:51] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@13cef9f]: Gerrit to 3.3.6 on gerrit1001 (duration: 00m 08s) [15:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:29] (03CR) 10Zabe: "https://gerrit.wikimedia.org/r/c/mediawiki/core/+/726962 needs to be merged first." [vendor] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726963 (https://phabricator.wikimedia.org/T266421) (owner: 10Jforrester) [15:11:51] (03CR) 10jerkins-bot: [V: 04-1] jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [15:12:32] (03CR) 10Hashar: "recheck after Gerrit got restarted." [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [15:13:40] Gerrit upgraded to 3.3.6! [15:13:44] \o/ [15:17:16] nice! [15:17:47] (03CR) 10Jbond: [C: 03+2] apt: fix spec test post stdlib 8.1.0 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/727362 (owner: 10Jbond) [15:18:28] (03PS2) 10Jbond: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [15:19:39] (03CR) 10Ladsgroup: [C: 03+2] api-testing: Adjust DiffCompare expected outcome to cope with new wikidiff2 output [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726962 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [15:21:08] jbond: thanks, that worked! [15:22:24] (03PS3) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [15:22:34] (03CR) 10Zabe: Allow composer-plugin-api ^2.0 for migration (031 comment) [vendor] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726963 (https://phabricator.wikimedia.org/T266421) (owner: 10Jforrester) [15:23:37] (03CR) 10Cwhite: [C: 03+1] o11y: port alertmanager alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/724761 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [15:23:52] James_F: looks like there is also a circular dependency between https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/726963 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/726962 [15:26:00] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31541/console" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [15:26:09] (03CR) 10jerkins-bot: [V: 04-1] api-testing: Adjust DiffCompare expected outcome to cope with new wikidiff2 output [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726962 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [15:26:51] (03CR) 10jerkins-bot: [V: 04-1] Allow composer-plugin-api ^2.0 for migration [vendor] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726963 (https://phabricator.wikimedia.org/T266421) (owner: 10Jforrester) [15:27:14] :(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((( [15:27:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "Force merging" [vendor] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726963 (https://phabricator.wikimedia.org/T266421) (owner: 10Jforrester) [15:28:00] (03PS6) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [15:28:02] (03PS1) 10David Caro: base::environment: add profile::environment and parametrize [puppet] - 10https://gerrit.wikimedia.org/r/727368 [15:28:04] (03PS1) 10David Caro: taskgen: suggest the correct path to put hiera values in [puppet] - 10https://gerrit.wikimedia.org/r/727369 [15:28:44] (03PS5) 10Alexandros Kosiaris: ganeti: Run a monthly cluster rebalancing [puppet] - 10https://gerrit.wikimedia.org/r/725779 [15:29:04] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/727370 [15:29:14] (03CR) 10Ladsgroup: [C: 03+2] "try again" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726962 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [15:29:57] (03PS2) 10David Caro: taskgen: suggest the correct path to put hiera values in [puppet] - 10https://gerrit.wikimedia.org/r/727369 [15:29:59] (03CR) 10Dzahn: statistics::web: require_package -> ensure_packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/726728 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [15:30:09] (03Abandoned) 10Dzahn: statistics::web: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726728 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [15:30:40] (03CR) 10jerkins-bot: [V: 04-1] base::environment: add profile::environment and parametrize [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [15:30:42] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/727370 (owner: 10Effie Mouzeli) [15:30:49] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31542/console" [puppet] - 10https://gerrit.wikimedia.org/r/725779 (owner: 10Alexandros Kosiaris) [15:32:27] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: disable ssl listening on mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/727370 [15:33:03] (03CR) 10Jbond: [C: 04-1] "see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [15:34:23] (03CR) 10Elukey: Add extra include search path to {CPP,C,CXX,FORTRAN}FLAGS (031 comment) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/727352 (https://phabricator.wikimedia.org/T292699) (owner: 10Elukey) [15:34:31] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/727369 (owner: 10David Caro) [15:35:26] 10SRE, 10serviceops: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [15:39:29] (03CR) 10Ssingh: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/31543/centrallog1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [15:39:54] (03CR) 10Ssingh: [V: 03+1] bird: add IPv6 support to bird and anycast-healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [15:42:24] (03CR) 10Btullis: [C: 03+2] "I'm happy to merge and deploy this today." [puppet] - 10https://gerrit.wikimedia.org/r/724497 (https://phabricator.wikimedia.org/T291957) (owner: 10Bearloga) [15:44:06] (03CR) 10David Caro: [C: 03+2] taskgen: suggest the correct path to put hiera values in [puppet] - 10https://gerrit.wikimedia.org/r/727369 (owner: 10David Caro) [15:44:47] (03CR) 10Jbond: [C: 04-1] "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [15:51:28] (03Merged) 10jenkins-bot: api-testing: Adjust DiffCompare expected outcome to cope with new wikidiff2 output [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/726962 (https://phabricator.wikimedia.org/T292676) (owner: 10Jforrester) [15:51:39] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10akosiaris) >>! In T270071#7404837, @akosiaris wrote: >>>! In T270071#7404620, @ayounsi wrote: >> About the Ganeti, as I understand it, the issue is that... [15:56:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10Volans) @akosiaris my bad, I didn't understood you meant that there ;) Yeah i think that if we could just rename the ganeti DNS record to be ouside of t... [15:57:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10serviceops-radar: SVC DNS zonefiles and source of truth - https://phabricator.wikimedia.org/T270071 (10akosiaris) >>! In T270071#7409199, @Volans wrote: > @akosiaris my bad, I didn't understood you meant that there ;) Yeah i think that if we could just re... [15:58:47] (03PS4) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [15:59:36] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31544/console" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [16:00:04] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:49] (03CR) 10Ssingh: [V: 03+1] bird: add IPv6 support to bird and anycast-healthchecker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [16:01:42] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31545/console" [puppet] - 10https://gerrit.wikimedia.org/r/725779 (owner: 10Alexandros Kosiaris) [16:02:25] (03PS5) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [16:03:12] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31546/console" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [16:07:50] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC seems happy, merging. Let's see how this works" [puppet] - 10https://gerrit.wikimedia.org/r/725779 (owner: 10Alexandros Kosiaris) [16:08:44] (03PS7) 10David Caro: base::environment: use only vars inside ::realm ifs [puppet] - 10https://gerrit.wikimedia.org/r/725302 [16:08:45] (03PS2) 10David Caro: base::environment: add profile::environment and parametrize [puppet] - 10https://gerrit.wikimedia.org/r/727368 [16:08:48] (03CR) 10David Caro: base::environment: add profile::environment and parametrize (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [16:08:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/727235 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [16:11:10] zabe: Correct. [16:11:10] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:11:15] 10SRE, 10LDAP-Access-Requests: Grant Access to (some Superset dashboards) for - https://phabricator.wikimedia.org/T292575 (10Jrbranaa) @LSobanski Approved. [16:11:40] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:48] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:11:58] PROBLEM - OSPF status on cr3-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:12:50] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 446 probes of 707 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:12:57] the Host DOWN looks kind of bad but the "oob" part makes it not that critical.. I guess [16:13:00] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [16:13:07] XioNoX: am I right? [16:13:12] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:13:37] mutante: yep [16:13:42] ACK, thanks [16:13:45] weird that 2 fail at the same time though [16:13:59] I'm having trouble accessing Icinga, Gerrit, and SSH through bast1003. Anyone else? [16:14:31] can ssh to bast1003 from US west [16:14:41] gerrit wfm [16:15:12] it's possible that one of our transits temporarily failed [16:15:28] that would explain 2 going down on either end, right [16:15:32] OK, thanks. I'm in UK and definitely can't get to bast1003. [16:15:36] My traceroute to host fails. [16:16:24] PROBLEM - Host mr1-codfw.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:16:44] can you share a traceroute? [16:17:15] Sure. [16:17:28] btullis: try bast3005, esams. should be closer to UK anyways? [16:17:54] XioNoX: https://phabricator.wikimedia.org/P17436 [16:18:13] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [16:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:21] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.cf (exit_code=99) [16:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:36] mutante: Thanks. Can get into bast3005. [16:18:41] volans, cdanis ^ failed [16:18:44] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [16:18:44] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.cf (exit_code=99) [16:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:01] !log ayounsi@cumin2002 START - Cookbook sre.network.cf [16:19:02] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [16:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:08] worked from codfw [16:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:36] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:37] James_F: can you try in ~5-10min? the time the change above propagates? [16:20:20] btullis: also see map on https://wikitech.wikimedia.org/wiki/Bastion you can use any [16:20:46] (current issue is known yes?) [16:21:11] yes [16:21:41] btullis: https://people.wikimedia.org/~dzahn/bastion.sh.txt [16:22:13] (03PS1) 10Alexandros Kosiaris: ganeti_rebalance: Normalize calendar entry [puppet] - 10https://gerrit.wikimedia.org/r/727376 [16:22:45] XioNoX: Will do. [16:23:19] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:23:52] I also cannot access bast1003 right now and none of our sites are loading for me. good luck, folks! [16:24:21] (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti_rebalance: Normalize calendar entry [puppet] - 10https://gerrit.wikimedia.org/r/727376 (owner: 10Alexandros Kosiaris) [16:24:29] XioNoX: Now working again, thanks! [16:24:34] great! [16:24:35] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 72.37 ms [16:24:52] * urbanecm cannot access gerrit [16:24:56] ERR_CONNECTION_TIMED_OUT [16:25:01] `works for me too [16:25:09] (03PS2) 10Jforrester: jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [16:25:32] gerrit is broken for me too [16:25:38] wikis and prod ssh via bast3005 works fine [16:25:48] I can access wikis just fie [16:26:01] Ha, fix one and another breaks. Poor networking team. [16:26:41] looking at twitter, seems like folks across the world are having issues accessing wikipedia [16:27:36] wikis are definitely unreachable from my network [16:27:45] (03CR) 10Jbond: "lgtm couple more nits" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [16:28:01] ...were unreachable [16:28:07] wikis fine on US west coast (comcast) [16:28:21] all up US-east comcast [16:28:28] wikis are fine from Czechia, gerrit not [16:28:29] seems to be fixed now though [16:28:31] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 87.16 ms [16:28:31] (03PS1) 10Volans: dhcp: remove all physical hosts hardcoded config [puppet] - 10https://gerrit.wikimedia.org/r/727387 (https://phabricator.wikimedia.org/T269855) [16:28:35] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 218.65 ms [16:28:51] that looks promising [16:29:06] oh famous last words >.> [16:29:30] (03CR) 10Volans: [C: 04-2] "Not to be merged before next Monday (2021-10-11)" [puppet] - 10https://gerrit.wikimedia.org/r/727387 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [16:31:11] (03PS1) 10Btullis: Mark Christina Macholan's account as Kerberos enabled [puppet] - 10https://gerrit.wikimedia.org/r/727388 (https://phabricator.wikimedia.org/T292532) [16:31:33] (03CR) 10jerkins-bot: [V: 04-1] jobqueue: Batch jobs that will end up in the default queue [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [16:32:19] Gitlab went down but back now [16:32:25] Gerrit / phab was fine [16:32:29] Never tried wikis [16:32:29] (03PS4) 10Jbond: base::auto_restarts: move to profile and make debdeploy support optional [puppet] - 10https://gerrit.wikimedia.org/r/727235 (https://phabricator.wikimedia.org/T289661) [16:32:37] (03CR) 10Ladsgroup: "recheck" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [16:32:54] !log roll restarting maps cassandra instances for java updates [16:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:45] gerrit is up now [16:33:54] wikis are also up [16:35:11] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:14] (03CR) 10Jbond: [C: 03+2] base::auto_restarts: move to profile and make debdeploy support optional [puppet] - 10https://gerrit.wikimedia.org/r/727235 (https://phabricator.wikimedia.org/T289661) (owner: 10Jbond) [16:38:57] Fyi get drop for eqiad fired [16:39:07] text@ [16:39:43] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:11] (03PS1) 10Jbond: P:debdeploy: include debdeploy client not debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/727406 [16:41:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] "override to fix issue" [puppet] - 10https://gerrit.wikimedia.org/r/727406 (owner: 10Jbond) [16:44:05] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 48.61 ms [16:44:44] 10SRE: High loading times on no.wikipedia - https://phabricator.wikimedia.org/T292762 (10jhsoby) [16:44:44] (03PS1) 10Jbond: debdeploy: add force to directory managment to ensure removal [puppet] - 10https://gerrit.wikimedia.org/r/727407 [16:44:45] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:12] mutante: Many thanks. Will update my ssh config. That script almost works. [16:46:41] (03CR) 10Jbond: [C: 03+2] debdeploy: add force to directory managment to ensure removal [puppet] - 10https://gerrit.wikimedia.org/r/727407 (owner: 10Jbond) [16:47:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2001.wikimedia.org with reason: reimage [16:47:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2001.wikimedia.org with reason: reimage [16:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:16] (03CR) 10Ladsgroup: "recheck" [core] (wmf/1.38.0-wmf.2) - 10https://gerrit.wikimedia.org/r/727187 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [16:51:12] (03PS1) 10Volans: sre.experimental.reimage: remove legacy code [cookbooks] - 10https://gerrit.wikimedia.org/r/727411 (https://phabricator.wikimedia.org/T269855) [16:51:14] (03PS1) 10Volans: sre.hosts.reimage: renamed from experimental [cookbooks] - 10https://gerrit.wikimedia.org/r/727412 (https://phabricator.wikimedia.org/T269855) [16:51:46] (03CR) 10Volans: [C: 04-2] "Do not merge before next Monday (2021-10-11)" [cookbooks] - 10https://gerrit.wikimedia.org/r/727411 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [16:53:44] jbond: rzl: hello, could I get https://gerrit.wikimedia.org/r/c/operations/puppet/+/725264 puppet-deployed please? 🙂 [16:54:11] (has only a +1 from another Growth member, not from a SRE; happy to get one if you're not comfortable with deployment without it) [16:54:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/727357 (owner: 10Volans) [16:54:32] urbanecm: looking [16:54:48] (03CR) 10Ssingh: [V: 03+1] bird: add IPv6 support to bird and anycast-healthchecker (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [16:55:25] (03PS6) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [16:56:02] !log down timing gitlab2001 for re-imaging (T283076) [16:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:14] T283076: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 [16:56:19] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31548/console" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [16:56:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [16:56:27] (03CR) 10Jbond: [C: 03+2] growthexperiments: Run updateMenteeData.php in parallel [puppet] - 10https://gerrit.wikimedia.org/r/725264 (https://phabricator.wikimedia.org/T290609) (owner: 10Urbanecm) [16:56:29] (03PS1) 10Volans: cumin: remove wmf-auto-reimage scripts [puppet] - 10https://gerrit.wikimedia.org/r/727415 (https://phabricator.wikimedia.org/T269855) [16:56:43] (03CR) 10Volans: [C: 03+2] sre.experimental.reimage: move Puppet logs [cookbooks] - 10https://gerrit.wikimedia.org/r/727357 (owner: 10Volans) [16:57:11] urbanecm: merged, fyi feel free to just add me as a review when you send the patch [16:57:17] (03CR) 10Volans: [C: 04-2] "Do not merge before next Monday (2021-10-11)" [puppet] - 10https://gerrit.wikimedia.org/r/727415 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [16:57:22] will do, thanks jbond [16:57:25] np [16:57:32] jbond: can you also run puppet at mwmaint1002 please? [16:57:45] yes one sec [16:58:05] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 35 probes of 707 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:58:47] (03CR) 10Ssingh: [V: 03+1] "Trying with do_ipv6 = true." [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [16:59:16] (03PS7) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [16:59:36] urbanecm: do you want a copy of the output? (its quite big) [17:00:03] (03Merged) 10jenkins-bot: sre.experimental.reimage: move Puppet logs [cookbooks] - 10https://gerrit.wikimedia.org/r/727357 (owner: 10Volans) [17:00:05] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T1700). [17:00:13] jbond: not needed, thanks :). [17:00:34] ack well its run now :) [17:00:41] and all looked sane [17:00:45] thanks [17:00:49] np [17:00:51] 10SRE: High loading times on no.wikipedia - https://phabricator.wikimedia.org/T292762 (10Zabe) I read something similar at dewiki: https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Warum_wird_der_Zugriff_auf_WP_immer_langsamer%3F [17:00:56] (03PS3) 10Sharvaniharan: Stream config changes for android_daily_stats schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722970 (https://phabricator.wikimedia.org/T286000) [17:01:03] (03PS8) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [17:03:15] (03PS9) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [17:03:28] 10SRE, 10Performance Issue: High loading times on no.wikipedia - https://phabricator.wikimedia.org/T292762 (10AntiCompositeNumber) Reproduced by loading diffs from Special:RecentChanges. Even with client-side caching disabled, only the first load was slow, subsequent reloads were fast. The slow page load appea... [17:05:26] (03CR) 10Jdlrobson: [C: 03+1] "Looks ready. Just needs to be scheduled for deploy via https://wikitech.wikimedia.org/wiki/Deployments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [17:06:21] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:07:39] (03CR) 10Volans: [C: 04-2] "Do not merge before next Monday (2021-10-11)" [cookbooks] - 10https://gerrit.wikimedia.org/r/727412 (https://phabricator.wikimedia.org/T269855) (owner: 10Volans) [17:08:52] 10SRE, 10Traffic, 10SRE Observability (FY2021/2022-Q2): Investigate cp5006 crash - https://phabricator.wikimedia.org/T292506 (10herron) FWIW I do see syslog entries had arrived to `centrallog1001:/srv/syslog/cp5006/syslog.log-2021100[34].gz` between Oct 2 17:36:22 and Oct 3 at 08:26:37 which is good news in... [17:09:03] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:09:16] (03PS10) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [17:10:07] RECOVERY - OSPF status on cr3-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:13:31] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q2:(Need By: TBD) rack/setup/install civi1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T292767 (10RobH) [17:14:11] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q2:(Need By: TBD) rack/setup/install civi1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T292767 (10RobH) [17:14:42] (03CR) 10Ssingh: "[not ready for review]" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [17:14:49] 10SRE, 10SRE Observability (FY2021/2022-Q2): rsyslog error: queue directory '/var/spool/rsyslog' and file name prefix 'output_kafka_json' already used - https://phabricator.wikimedia.org/T292180 (10herron) [17:15:20] 10SRE, 10SRE Observability (FY2021/2022-Q2): rsyslog errors about duplicate module includes - https://phabricator.wikimedia.org/T292175 (10herron) [17:16:21] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [17:23:47] 10SRE, 10Analytics, 10Event-Platform, 10Observability-Logging, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10herron) [17:25:59] 10SRE, 10SRE Observability: Grafana share button drops duplicate URL params - https://phabricator.wikimedia.org/T292606 (10herron) [17:27:45] (03PS1) 10Urbanecm: shwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727431 (https://phabricator.wikimedia.org/T278240) [17:27:47] (03PS1) 10Urbanecm: Deploy Growth features to test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727432 [17:29:55] 10SRE, 10SRE Observability: Develop tooling for quickly parsing 5xx and sampled-1000 logs - https://phabricator.wikimedia.org/T292682 (10herron) p:05Triage→03Medium [17:30:04] !log rebooting gitlab2001.wikimedia.org [17:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:26] (03PS11) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [17:33:02] (03PS3) 10Urbanecm: growthexperiments: Remove absented systemd job [puppet] - 10https://gerrit.wikimedia.org/r/725286 (https://phabricator.wikimedia.org/T290609) [17:34:02] 10SRE, 10SRE-swift-storage, 10ops-codfw: Spontaneous reboot of ms-be2045 - https://phabricator.wikimedia.org/T290881 (10Papaul) The first crash was because of the PERC ( this is what appeared to crash). and it was showing in the IDRAC log. The second crash after the re-image is about the network card as @f... [17:34:45] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:35:15] (03PS1) 10Btullis: Correct file resource to create directories [puppet] - 10https://gerrit.wikimedia.org/r/727434 (https://phabricator.wikimedia.org/T291957) [17:36:04] (03CR) 10Btullis: [C: 03+2] Correct file resource to create directories [puppet] - 10https://gerrit.wikimedia.org/r/727434 (https://phabricator.wikimedia.org/T291957) (owner: 10Btullis) [17:36:49] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:38:23] (03CR) 10BryanDavis: "Something I discovered when doing a similar thing for Toolhub is that the sidecar container for envoy will prevent the job's pod from exit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/726891 (https://phabricator.wikimedia.org/T283159) (owner: 10Jgiannelos) [17:40:17] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:31] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:25] 10SRE, 10Infrastructure-Foundations, 10netops: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10cmooney) @BBlack @mark @joanna_borun @ayounsi I've put together a doc outlining some options here. Purpose of the doc is to facilitate collaboration / discussion on the op... [17:54:35] (03PS12) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T1800). [18:00:04] sharvani_ and Urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] I can deploy today [18:01:59] (03PS1) 10Michael DiPietro: depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/727438 (https://phabricator.wikimedia.org/T292043) [18:03:06] (03CR) 10jerkins-bot: [V: 04-1] depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/727438 (https://phabricator.wikimedia.org/T292043) (owner: 10Michael DiPietro) [18:05:05] (03PS2) 10Michael DiPietro: depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/727438 (https://phabricator.wikimedia.org/T292043) [18:06:29] (helping sharvani with IRC via a gmeet) [18:06:32] (03CR) 10Joal: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/727388 (https://phabricator.wikimedia.org/T292532) (owner: 10Btullis) [18:07:10] !log gitlab2001 re-image complete (T283076) [18:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:17] T283076: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 [18:08:12] welcome sharvani_ [18:08:19] hi! thanks [18:09:17] (03CR) 10Urbanecm: [C: 03+2] Stream config changes for android_daily_stats schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722970 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [18:10:17] (03Merged) 10jenkins-bot: Stream config changes for android_daily_stats schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/722970 (https://phabricator.wikimedia.org/T286000) (owner: 10Sharvaniharan) [18:12:10] sharvani_: pulled to mwdebug1001, can you test? [18:12:31] I tested it and can see my stream config now! thank you so much! :) [18:15:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:(Need By: TBD) rack/setup (4) fundraising hosts - https://phabricator.wikimedia.org/T289812 (10Cmjohnson) All the idracs are set up @jgreen please DM about these. Thanks [18:15:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Cmjohnson) [18:15:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 33526dfed148068585289f5ac501feda72068fd9: Stream config changes for android_daily_stats schema (T286000) (duration: 01m 06s) [18:15:59] sharvani_: it's live now. [18:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:01] T286000: Android Legacy to MEP Instrumentation - MobileWikiAppDailyStats - https://phabricator.wikimedia.org/T286000 [18:16:43] (03PS2) 10Urbanecm: shwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727431 (https://phabricator.wikimedia.org/T278240) [18:16:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Cmjohnson) The BIOS and Idracs are set up, kubernetes1020 would not power on. @Jclark-ctr can you call Dell about 1020, I am on holiday next week. I will insta... [18:17:11] (03CR) 10Urbanecm: [C: 03+2] shwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727431 (https://phabricator.wikimedia.org/T278240) (owner: 10Urbanecm) [18:17:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10Cmjohnson) [18:17:18] (03PS2) 10Urbanecm: Deploy Growth features to test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727432 [18:18:20] (03Merged) 10jenkins-bot: shwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727431 (https://phabricator.wikimedia.org/T278240) (owner: 10Urbanecm) [18:18:34] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth features to test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727432 (owner: 10Urbanecm) [18:19:25] (03Merged) 10jenkins-bot: Deploy Growth features to test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727432 (owner: 10Urbanecm) [18:19:38] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Dzahn) @Jelto As agreed we reimaged gitlab2001 together in a call. We confirmed a few things: - Arnold can run wmf-auto-re... [18:20:02] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 31770f2b3660e7d7490c0a9ab66285c1f069732d: shwiki: Deploy Growth features to newcomers (T278240) (duration: 01m 04s) [18:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:07] T278240: Deploy Growth features on Serbo-Croatian Wikipedia - https://phabricator.wikimedia.org/T278240 [18:21:04] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Dzahn) Also we spoke about the !log command and Arnold used that though we still need to solve privileges so that he can sche... [18:21:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 87e300137c14451949fac12c3ec89319305a423e: Deploy Growth features to test2wiki (duration: 01m 04s) [18:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:44] (03PS2) 10Urbanecm: Deploy Growth mentor dashboard to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726884 (https://phabricator.wikimedia.org/T278920) [18:21:47] (03CR) 10Urbanecm: [C: 03+2] Deploy Growth mentor dashboard to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726884 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [18:22:34] (03Merged) 10jenkins-bot: Deploy Growth mentor dashboard to pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/726884 (https://phabricator.wikimedia.org/T278920) (owner: 10Urbanecm) [18:23:12] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 87e300137c14451949fac12c3ec89319305a423e: Deploy Growth features to test2wiki (duration: 01m 03s) [18:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:31] (03PS1) 10Majavah: kubernetes: Use Ingress v1 API [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/727449 (https://phabricator.wikimedia.org/T292706) [18:28:55] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4a946c046ae17a520f8d3463a16b1435ceb4856c: Deploy Growth mentor dashboard to pilot wikis (T278920) (duration: 01m 04s) [18:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:01] T278920: Mentor dashboard: V1 desktop - https://phabricator.wikimedia.org/T278920 [18:29:04] !log Morning B&C window done [18:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:28] (03PS1) 10Majavah: fix python version check [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/727450 [18:29:43] (03PS13) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [18:30:01] (03CR) 10jerkins-bot: [V: 04-1] fix python version check [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/727450 (owner: 10Majavah) [18:31:26] (03PS2) 10Majavah: fix python version check [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/727450 [18:32:21] (03CR) 10Ssingh: "[reverting to do_ipv6 = false]" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [18:32:59] (03PS14) 10Ssingh: bird: add IPv6 support to bird and anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) [18:34:17] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/31556/console" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [18:36:26] (03CR) 10Bstorm: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/727438 (https://phabricator.wikimedia.org/T292043) (owner: 10Michael DiPietro) [18:36:57] (03CR) 10Ssingh: [V: 03+1] "This change is ready for review. A few important notes:" [puppet] - 10https://gerrit.wikimedia.org/r/727355 (https://phabricator.wikimedia.org/T292737) (owner: 10Ssingh) [18:38:55] 10SRE, 10DNS, 10Traffic: Additional DNS entries for Wikilearn project (Community Development) - https://phabricator.wikimedia.org/T292537 (10Ijon) 05Resolved→03Open Hi! The forum.learn.wiki CNAME is incorrect. We asked for: forum.learn.wiki wkm-prod-alb-30644061.us-east-1.elb.amazonaws.com but it l... [18:43:10] (03PS1) 10Ssingh: learn.wiki: update record for forum.learn.wiki [dns] - 10https://gerrit.wikimedia.org/r/727460 (https://phabricator.wikimedia.org/T292537) [18:45:30] (03CR) 10Ssingh: [C: 03+2] learn.wiki: update record for forum.learn.wiki [dns] - 10https://gerrit.wikimedia.org/r/727460 (https://phabricator.wikimedia.org/T292537) (owner: 10Ssingh) [18:45:47] (03CR) 10Vgutierrez: [C: 03+1] "thanks for handling this @Ssingh" [dns] - 10https://gerrit.wikimedia.org/r/727460 (https://phabricator.wikimedia.org/T292537) (owner: 10Ssingh) [18:46:14] vgutierrez: <3 you should be off though! [18:46:26] !log running authdns-update for T292537 [18:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:32] T292537: Additional DNS entries for Wikilearn project (Community Development) - https://phabricator.wikimedia.org/T292537 [18:50:14] !log [urbanecm@mwmaint1002 /srv/mediawiki/php]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=test2wiki [18:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:05] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Additional DNS entries for Wikilearn project (Community Development) - https://phabricator.wikimedia.org/T292537 (10ssingh) 05Open→03Resolved >>! In T292537#7409931, @Ijon wrote: > Hi! The forum.learn.wiki CNAME is incorrect. We asked for: > > forum.learn... [18:53:40] (03CR) 10Michael DiPietro: [C: 03+2] depool clouddb1018 [puppet] - 10https://gerrit.wikimedia.org/r/727438 (https://phabricator.wikimedia.org/T292043) (owner: 10Michael DiPietro) [19:00:05] brennen and jeena: How many deployers does it take to do MediaWiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T1900). [19:03:27] !log 1.38.0-wmf.3 train (T281167): unblocked, rolling to all wikis [19:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:33] T281167: 1.38.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T281167 [19:04:42] (03PS1) 10Brennen Bearnes: all wikis to 1.38.0-wmf.3 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727468 [19:04:44] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.38.0-wmf.3 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727468 (owner: 10Brennen Bearnes) [19:05:33] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.3 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727468 (owner: 10Brennen Bearnes) [19:07:08] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.3 refs T281167 [19:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:01] ...rolling back. [19:10:03] I'm interminnently being logged out and back in [19:10:35] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 36.36 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:11:42] (at cswiki/commonswiki) [19:12:35] Unable to load https://ur.wikipedia.org/wiki/Special:Preferences [19:12:42] Fatal exception of type "MWException" [19:12:43] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:12:57] confirmed [19:13:02] same for me [19:13:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Services, 10cloud-services-team (Hardware): hw troubleshooting: crash (with thermal event) for clouddb1020.eqiad.wmnet - https://phabricator.wikimedia.org/T291963 (10Bstorm) @Marostegui I think this host is ready to get moving again. Would you like to check it and try... [19:13:56] urbanecm: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/724077 is in this train, but any issues with it woulf probably have been caught earlier [19:14:04] what's going on? [19:14:09] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: Revert group2 wikis to 1.38.0-wmf.2 [19:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:29] majavah: https://en.wikipedia.org/w/index.php?title=Wikipedia:ITSTHURSDAY is [19:14:47] want a traceback? :-) [19:15:17] brennen is filing a task [19:15:33] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [19:15:49] urbanecm: I want to know if I've broken something :P [19:15:59] urbanecm: (same, enwp) not sure why I'm always surprised y'all find the issues first.. :P [19:16:55] This is likey fallout from the HTMLForm / Preferences / Gadgets changes last week [19:17:18] The gadget that fails the request is Invalid name 'wpgadget-Special:WhatLinksHere_action_links' passed to HTMLFormField::__construct [19:17:31] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:17:33] it doesn't seem to affect other wikis while we had the new branch out to group2 [19:17:38] so might be somethign specific about this gadget [19:17:41] e.g. the colon in its name [19:17:44] brennen: ^ [19:18:01] that wouldn't explain the logouts though [19:18:15] true, but it could be a side-effect [19:18:25] unable to load preferences data et.c [19:18:25] fair enough [19:18:38] but yeah, we should look for other things in the log meanwhile [19:18:55] https://phabricator.wikimedia.org/T292777 [19:19:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Jclark-ctr) kubernetes1020 is powered on might of been delayed [19:20:16] (03PS1) 10Brennen Bearnes: Revert "all wikis to 1.38.0-wmf.3 refs T281167" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727470 (https://phabricator.wikimedia.org/T281167) [19:20:18] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "all wikis to 1.38.0-wmf.3 refs T281167" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727470 (https://phabricator.wikimedia.org/T281167) (owner: 10Brennen Bearnes) [19:20:54] `Wrong provider MediaWiki\Extension\CentralAuth\Session\CentralAuthSessionProvider !== CentralAuthSessionProvider` helpful? [19:20:57] (03Merged) 10jenkins-bot: Revert "all wikis to 1.38.0-wmf.3 refs T281167" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727470 (https://phabricator.wikimedia.org/T281167) (owner: 10Brennen Bearnes) [19:21:11] majavah: that looks like your fault? [19:21:17] *code [19:21:27] yes, definitely [19:21:32] grepping config [19:21:36] do you have a full stack trace? [19:21:57] majavah: see ticket. [19:22:08] brennen: for tn's error message [19:22:21] it's a different error :) [19:22:23] ah, sorry [19:22:27] np [19:22:32] tooooo many windows open. [19:23:55] urbanecm: hm? Sorry, was that helpful? [19:24:16] tn: the error message? Yes. Looking for a stacktrace now [19:24:49] urbanecm: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2021.10.07?id=L_ksXHwB1RMACsoSPNR1 is that message for me [19:24:53] thanks [19:25:06] meh, a warning in session [19:25:09] that's why i can't find anything [19:25:09] urbanecm: can you paste it somewhere for me? [19:25:19] it doesn't have a stacktrace :/ [19:25:44] grep to the rescue, then [19:25:45] https://gerrit.wikimedia.org/g/mediawiki/core/+/58652c0e3d572bde6f7cb0631ef8664c24dc1626/includes/session/SessionManager.php#660 [19:25:51] yeah [19:26:00] i think it's a cache of some sort [19:26:15] something storing the old class [19:26:18] for context, in this train I namespaced CentralAuth's session providers, so the class names had "MediaWiki\Extension\CentralAuth\Session\" added in front [19:26:38] I'd revert that -- or add the old class names as temp aliases [19:26:55] I don't think aliases will help [19:27:08] hmm, maybe you're right [19:27:34] if I'm reading it correctly, it's the "load session from store" code, and it checks that the current provider matches exactly what used was when creating the session [19:27:44] it needs an alias mechanism of some sort [19:27:47] :( [19:27:50] but let's revert for now and figure it out later [19:27:54] yeah [19:28:18] urbanecm: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/727486 [19:28:34] +2'ed [19:28:38] fun question: why was it not caught in group0/1? [19:28:55] not enough traffic? [19:29:22] if it logs out pretty much everyone, I'd imagine someone would notice it [19:29:47] also did https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/727487 [19:30:09] +2'd, thanks [19:30:12] very intermittent though, happened 3 times in quick succession, (2 were pre/post submit of edits) [19:30:19] how are we going to sync those out? [19:31:30] majavah: hmm, we can rollback everything to wmf.2 and do it easily [19:31:38] I don't think there's an easy way to sync it out safely [19:31:57] we can roll back to testwikis easily enough, if needed. [19:32:36] brennen: that's great. Can you do it please? I'll start CI for backports in the meantime. [19:32:43] (03PS1) 10Majavah: Revert "Namespace session providers" [extensions/CentralAuth] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727489 [19:32:50] (03PS1) 10Majavah: Revert "Use namespaced CentralAuthSessionProvider" [extensions/Echo] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727490 [19:33:00] (03CR) 10Urbanecm: [C: 03+2] Revert "Namespace session providers" [extensions/CentralAuth] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727489 (owner: 10Majavah) [19:33:05] thanks majavah [19:33:06] (03CR) 10Urbanecm: [C: 03+2] Revert "Use namespaced CentralAuthSessionProvider" [extensions/Echo] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727490 (owner: 10Majavah) [19:33:13] urbanecm: ack, doing. [19:33:19] thanks [19:33:45] Is there an equivalent in PHP for comparing two class names that accounts for aliasing? [19:33:46] !log 1.38.0-wmf.3 train (T281167): variously blocked, rolling back to testwikis for safe deploy of backports [19:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:53] T281167: 1.38.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T281167 [19:35:25] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2021.10.07?id=Qgc_XHwB1RMACsoSNoyg has a trace it seems fwiw [19:35:35] * majavah cna't see that [19:36:11] copy&pasting [19:36:13] (oh different error too, though related) `Metadata merge failed: MediaWiki\Session\MetadataMergeException: Key "CentralAuthSource" changed in /srv/mediawiki/php-1.38.0-wmf.2/includes/session/SessionProvider.php:330` [19:36:25] https://www.irccloud.com/pastebin/56uHEL4f/ [19:36:33] it's a MediaWiki\Session\MetadataMergeException [19:36:36] CentralAuthSource sounds unrelated [19:36:48] or other metadata merges [19:37:02] (03Merged) 10jenkins-bot: Revert "Namespace session providers" [extensions/CentralAuth] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727489 (owner: 10Majavah) [19:37:05] 10SRE: Wikipedia shutdown in Russia - https://phabricator.wikimedia.org/T292776 (10Zabe) [19:37:36] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: Revert all wikis to 1.38.0-wmf.2 (T281167) [19:37:41] T292779 is probably this, I think [19:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:44] T292779: Account problems - https://phabricator.wikimedia.org/T292779 [19:38:59] brennen: I see CI already merged, lmk once I'm clear for backporting [19:39:09] urbanecm: missing the Echo patch still [19:39:13] yeah [19:39:21] i'll have to do once sync per extension anyway [19:39:28] (03PS1) 10Brennen Bearnes: revert all wikis to 1.38.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727506 [19:39:30] (03CR) 10Brennen Bearnes: [C: 03+2] revert all wikis to 1.38.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727506 (owner: 10Brennen Bearnes) [19:39:33] which is failing https://integration.wikimedia.org/ci/job/mwext-php72-phan-docker/141202/console [19:39:36] grr [19:39:52] missing dependency i think [19:39:55] i'll restart [19:40:34] or not [19:40:34] (03Merged) 10jenkins-bot: revert all wikis to 1.38.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727506 (owner: 10Brennen Bearnes) [19:40:41] gate and submit is green so far [19:40:44] (03CR) 10Urbanecm: [C: 03+2] Revert "Use namespaced CentralAuthSessionProvider" [extensions/Echo] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727490 (owner: 10Majavah) [19:41:47] urbanecm: all wikis are reverted and that patch is merged to 1.38.0-wmf.3, so you are clear to deploy those. [19:41:52] 10SRE: Wikipedia shutdown in Russia - https://phabricator.wikimedia.org/T292776 (10RhinosF1) > Today on evening Could we get a more specific time (and timezone)? [19:41:53] thanks [19:44:26] !log Backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/727489, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Echo/+/727487 in an unsafe way -- exceptions at testwikis expected, wmf.3 is not deployed elsewhere, so this should be ok [19:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:36] testwikis are also wmf.2 [19:45:51] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/CentralAuth/: c01c2e4983bad8582ddd62aeb35ac9be852d493b: Revert "Namespace session providers" (duration: 00m 57s) [19:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:15] ah, that's even better [19:46:28] yeah, i realized there was no reason not to do a wholesale rollback. [19:46:28] * tn watches https://logstash.wikimedia.org/goto/6dabb5b2fc3a780e73a634355ff40aef ... those warnings *seem* to be dropping off.. [19:46:59] waiting for CI on the echo patch [19:47:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "Use namespaced CentralAuthSessionProvider" [extensions/Echo] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727490 (owner: 10Majavah) [19:47:38] V-1 manually removed [19:47:44] gate-and-submit should pass [19:47:46] tn: that's the "wrong provider" one? not the metadata merhe one? [19:48:04] we should give majavah logstash access [19:48:06] 10SRE: Wikipedia shutdown in Russia - https://phabricator.wikimedia.org/T292776 (10DonSimon) @RhinosF1, from 19:00 (UTC+3) to 20:00 but in whole Russia exactly. By the way, look [[ https://downdetector.ru/ne-rabotaet/wikipedia/ | this ]]. [19:48:21] mhm, think me seeing `MetadataMergeException` was a red herring tbh [19:48:53] and +1 to logstash for majavah :-) [19:52:36] (03PS1) 10Krinkle: Revert "Make each gadget a separate preference, instead of one huge multiselect" [extensions/Gadgets] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727492 (https://phabricator.wikimedia.org/T126962) [19:52:44] (03CR) 10Krinkle: [C: 03+2] Revert "Make each gadget a separate preference, instead of one huge multiselect" [extensions/Gadgets] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727492 (https://phabricator.wikimedia.org/T126962) (owner: 10Krinkle) [19:56:10] (03CR) 10Jbond: [C: 03+1] "LGTM minor nit but feel free to fix in a later ps" [puppet] - 10https://gerrit.wikimedia.org/r/725302 (owner: 10David Caro) [19:57:50] urbanecm: cn=nda requires wmf manager approval and things like that, and I'm not in the mood of sorthing them out :P [19:58:26] (03Merged) 10jenkins-bot: Revert "Use namespaced CentralAuthSessionProvider" [extensions/Echo] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727490 (owner: 10Majavah) [19:58:47] majavah: have a graph instead :P https://usercontent.irccloud-cdn.com/file/pXcPOr0B/image.png [19:58:54] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Clare Ming - https://phabricator.wikimedia.org/T292782 (10cjming) [19:59:18] majavah: i think it only applies to users with no NDA at all (I was added within few days after getting deployment) [19:59:26] anyway, I'm sure brennen would be happy to help with approvals ;) [20:00:36] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Clare Ming - https://phabricator.wikimedia.org/T292782 (10cjming) [20:00:37] i could at least make sure the right managers see it. :) [20:01:05] IIRC the process is "any WMF employee + their manager" [20:01:25] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/Echo/: 8a7ff05ba28f302adb581bf430a868bb815b4ffd: Revert "Use namespaced CentralAuthSessionProvider" (duration: 00m 57s) [20:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:37] so, backports done [20:01:43] i see Krinkle +2'ed their own patch [20:01:46] I'll let them deploy [20:02:08] it's a revert yes, those are generaly self-merged per our policy [20:02:14] ok, testing now [20:02:23] I'm not disputing the revert itself [20:02:28] I'm just saying "go ahead and deploy" [20:02:31] np :) [20:02:43] (note we're fully rolled back now) [20:03:14] ah group0 as well [20:03:14] https://wikitech.wikimedia.org/wiki/Volunteer_NDA#Privileged_LDAP_or_shell_access says #sre-access-requests, but there also is #ldap-access-requests :/ [20:03:21] ok, I'll check on beta first meanwhile and then just sync blindly [20:04:05] PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:04:19] majavah: use ldap-access-requests and I make sure it happes [20:04:38] thx mutante [20:04:44] https://integration.wikimedia.org/zuul/#q=727491,727492 [20:05:21] mutante: https://phabricator.wikimedia.org/T292783 [20:07:25] majavah: supported [20:07:43] majavah: got it! so the next step is to receive email from KFrancis. We can add your email to the ticket or I can tell her in private if you want to keep it private [20:08:06] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Clare Ming - https://phabricator.wikimedia.org/T292782 (10SCherukuwada) Approved by manager. [20:08:42] mutante: I think I already signed the correct NDA when I got root on toolforge [20:08:49] was just saying that :) [20:09:06] (03CR) 10Jbond: "lgtm some minor comments/nits but nothing blocking" [puppet] - 10https://gerrit.wikimedia.org/r/727368 (owner: 10David Caro) [20:09:21] ah, ack, checking, brb [20:10:38] can't find in the puppet admin module nor the docs with NDAs [20:10:45] something went wrong in the past [20:11:06] actually I have never seen an "NDA for toolforge" ticket [20:11:30] toolforge root is granted outside of puppet :) [20:11:40] (and note they've security access, so...they should have a NDA at least :D) [20:11:57] https://phabricator.wikimedia.org/T278390#7063385 [20:11:57] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10brennen) Nice work, all! > if you have a user on gitlab-replica.wikimedia.org could you check some features which require lo... [20:12:00] the term NDA means 3 or more different things around here [20:12:10] yeah, unfortunately [20:12:16] * urbanecm has a dozen of them signed [20:12:49] (03Merged) 10jenkins-bot: Revert "Make each gadget a separate preference, instead of one huge multiselect" [extensions/Gadgets] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727492 (https://phabricator.wikimedia.org/T126962) (owner: 10Krinkle) [20:13:06] safer just not to talk ;) [20:14:08] 🙂 [20:14:42] please sign an NDA about signing NDAs [20:14:54] :D [20:14:55] second..checking stuff :p [20:15:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Cmjohnson) I’m sorry, I meant 1021 in D3, U33 [20:16:29] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:51] 10SRE, 10Traffic: Wikipedia shutdown in Russia - https://phabricator.wikimedia.org/T292776 (10RhinosF1) Hi, A public task with information will be coming from SRE (@cdanis) soon but this was known. [20:21:20] gadget issu confirmed fixed in beta [20:22:34] syncing to wmf.3 now [20:22:41] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:22:53] mutante: for admin/data.yaml, please use the email I have on ldap/wikitech and not the that legal had, I'd like to keep it private [20:23:15] !log krinkle@deploy1002 Synchronized php-1.38.0-wmf.3/extensions/Gadgets/: I7c858b8c4bc (duration: 00m 56s) [20:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:25] majavah: pinged you about real name [20:23:35] i just chuckled at the email field in LDAP :) [20:26:50] (03PS1) 10Dzahn: admin: add taavi to ldap_only users (nda) [puppet] - 10https://gerrit.wikimedia.org/r/727518 (https://phabricator.wikimedia.org/T292783) [20:28:19] all right - seems like we should be clear to roll train forward again? [20:29:20] 10Puppet, 10Infrastructure-Foundations, 10GitLab (Infrastructure), 10Patch-For-Review, and 3 others: Puppetise gitlab-ansible playbook - https://phabricator.wikimedia.org/T283076 (10Dzahn) @brennen The VM has been reinstalled and puppet reinstalled gitlab but the actual gitlab data needs to be manually imp... [20:30:15] (03PS2) 10Dzahn: admin: add taavi to ldap_only users (nda) [puppet] - 10https://gerrit.wikimedia.org/r/727518 (https://phabricator.wikimedia.org/T292783) [20:30:29] brennen: looks so. ping Krinkle and majavah just in case :) [20:31:30] well that was fun! :D [20:31:35] should be fine if you synced both the ca and echo patches [20:31:44] i did [20:32:59] kk, going ahead. [20:33:39] yep [20:33:55] (03PS1) 10Brennen Bearnes: all wikis to 1.38.0-wmf.2 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727525 [20:33:57] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.38.0-wmf.2 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727525 (owner: 10Brennen Bearnes) [20:34:18] brennen: wmf.2? [20:34:37] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.2 refs T281167 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727525 (owner: 10Brennen Bearnes) [20:34:58] hrm, deploy-promote clearly did not do what i intended there. [20:35:06] !log cmooney@cumin1001 START - Cookbook sre.network.cf [20:35:07] !log cmooney@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [20:35:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:11] (will instead revert revert commit shortly.) [20:37:49] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.2 refs T281167 [20:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:56] T281167: 1.38.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T281167 [20:41:25] 10SRE, 10Release-Engineering-Team: Add Ahmon and Brennen to Icinga contact list - https://phabricator.wikimedia.org/T292753 (10Dzahn) Hi, confirmed they don't have contacts yet. Will add them with standard email but without phone numbers for now. Note this means _no notifications_ from Icinga though about aler... [20:42:13] 10SRE, 10Release-Engineering-Team: Add Ahmon and Brennen to Icinga contact list - https://phabricator.wikimedia.org/T292753 (10Dzahn) I am not sure how helpful it is to have a contact group that does not actually get email when there are alerts but that is a slightly unrelated issue. [20:42:44] (03PS1) 10Brennen Bearnes: all wikis to 1.38.0-wmf.3 refs 727525 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727536 [20:42:47] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.38.0-wmf.3 refs 727525 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727536 (owner: 10Brennen Bearnes) [20:43:27] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.3 [20:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:42] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.3 refs 727525 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727536 (owner: 10Brennen Bearnes) [20:49:58] 10SRE, 10Release-Engineering-Team: Add Ahmon and Brennen to Icinga contact list - https://phabricator.wikimedia.org/T292753 (10Dzahn) 05Open→03Resolved a:03Dzahn done. contacts added in private repo. the names are "brennen" and "dancy". Please not you will be able to login the web UI as both Brennen/br... [20:54:02] (03PS1) 10Dzahn: nagios_common: add brennen and dancy to gerrit contact group [puppet] - 10https://gerrit.wikimedia.org/r/727555 (https://phabricator.wikimedia.org/T292753) [20:56:46] 10SRE, 10Release-Engineering-Team, 10Patch-For-Review: Add Ahmon and Brennen to Icinga contact list - https://phabricator.wikimedia.org/T292753 (10Dzahn) @hashar That being said, this ticket should not have been needed because we recently already did a more global solution: T289746 specifically see the opt... [20:57:13] 10SRE, 10Gerrit, 10GitLab, 10Icinga, and 4 others: RelEng access to downtime alerts in Icinga for gitlab, gerrit, possibly other services? - https://phabricator.wikimedia.org/T289746 (10Dzahn) We also just did T292753 but that was kind of duplicate work then. [20:57:35] 10SRE, 10Traffic: Wikipedia not accessible in Russia on 2021-10-07 16:00-17:00UTC - https://phabricator.wikimedia.org/T292776 (10Aklapper) [20:58:11] (03CR) 10Dzahn: [C: 03+2] "kind of obsolete since https://phabricator.wikimedia.org/T289746 but doing it anyways to make it more obvious who is related to gerrit, th" [puppet] - 10https://gerrit.wikimedia.org/r/727555 (https://phabricator.wikimedia.org/T292753) (owner: 10Dzahn) [20:59:32] train seems stable. i'm stepping afk very briefly. [21:01:02] (03PS3) 10Dzahn: Revert "icinga: add qchris to contactgroup for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/727200 (owner: 10Hashar) [21:01:08] 10SRE: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10CDanis) [21:01:23] (03CR) 10Dzahn: [C: 03+2] Revert "icinga: add qchris to contactgroup for gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/727200 (owner: 10Hashar) [21:01:51] 10SRE: 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10CDanis) [21:02:18] 10SRE, 10Traffic: Wikipedia not accessible in Russia on 2021-10-07 16:00-17:00UTC - https://phabricator.wikimedia.org/T292776 (10CDanis) [21:05:25] (03CR) 10Dzahn: "since https://phabricator.wikimedia.org/T289746 you should all be able to run commands on _any_ service in Icinga regardless of contact gr" [puppet] - 10https://gerrit.wikimedia.org/r/727358 (owner: 10Hashar) [21:09:32] (03PS2) 10Dzahn: gerrit: add gerrit as a contact group [puppet] - 10https://gerrit.wikimedia.org/r/727358 (owner: 10Hashar) [21:10:37] (03CR) 10Dzahn: "Pretty sure admins is always added by default and we end up with "admins,admins" but that also doesn't hurt Icinga. confirming it" [puppet] - 10https://gerrit.wikimedia.org/r/727358 (owner: 10Hashar) [21:13:30] (03CR) 10Dzahn: [C: 03+2] "Not expecting this is needed to run any commands, you should have that from global privs already, and not expecting it means mail is sent " [puppet] - 10https://gerrit.wikimedia.org/r/727358 (owner: 10Hashar) [21:14:52] (03Abandoned) 10Dzahn: mcrouter: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/726726 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [21:16:21] (03CR) 10Dzahn: "confirmed on alert1001 a bunch of changes like this for gerrit-related checks:" [puppet] - 10https://gerrit.wikimedia.org/r/727358 (owner: 10Hashar) [21:20:42] (03PS1) 10CDanis: NEL alert is empirically high-signal & should page SRE [puppet] - 10https://gerrit.wikimedia.org/r/727594 (https://phabricator.wikimedia.org/T292792) [21:21:40] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rasdaemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:09] Huh?! Is everyone leaving? [21:25:11] irccloud centralization strikes again [21:25:26] Juan_90264: it's that they are all using the same service [21:25:34] and sometimes it gets kicked from the IRCd [21:25:52] Juan_90264: usually they will come back in a moment [21:26:50] mutante: Ok, I'm glad I'm using another service [21:27:10] Juan_90264: yea, it's good if at least some are on something else :) [21:28:35] a bit ironic since they always say how ancient IRC is but it has the decentralization more than the modern alternatives [21:29:44] mutante: True, and also the amazing thing is that almost everyone is using the same service. [21:30:25] Juan_90264: yea, it's irccloud.com [21:31:02] kind of removes the multi-server thing [21:32:15] (03CR) 10Dzahn: [C: 03+2] "thanks! looks good and compiler output confirmed." [puppet] - 10https://gerrit.wikimedia.org/r/727305 (owner: 10Jbond) [21:32:25] (03PS1) 10Legoktm: Add cookbook to build and upload Scap releases to apt.wm.o [cookbooks] - 10https://gerrit.wikimedia.org/r/727605 [21:32:47] I think irccloud.com is back to normal. [21:33:42] maybe one user was flooding and they got automatically blocked [21:34:29] (03PS2) 10Legoktm: Add cookbook to build and upload Scap releases to apt.wm.o [cookbooks] - 10https://gerrit.wikimedia.org/r/727605 [21:34:46] nah, just a network issue [21:34:49] yeah [21:34:57] (and they assign unique IPv6 to users) [21:35:21] !log Password reset for SUL User:LA2-bot (T292793) [21:35:24] (03PS3) 10Legoktm: Add cookbook to build and upload Scap releases to apt.wm.o [cookbooks] - 10https://gerrit.wikimedia.org/r/727605 [21:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:28] T292793: User:LA2-bot needs to have password reset via CLI - https://phabricator.wikimedia.org/T292793 [21:35:30] but they only use one upstream at a time, so if anything flaps it all goes down [21:35:46] (per network at least) [21:36:21] (03CR) 10Dzahn: "confirmed this stopped puppet from trying to start rsyncd on cumin hosts that are _not_ cumin2002 and noop on cumin2002 where rsync is run" [puppet] - 10https://gerrit.wikimedia.org/r/727305 (owner: 10Jbond) [21:37:22] Urbanecm: Good morning, afternoon or evening Urbanecm! [21:37:32] hey Juan_90264, what's up? [21:37:38] (03CR) 10Dzahn: [C: 04-1] "merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/727305 and confirmed it fixed the issue on the cumin hosts that are not the os" [puppet] - 10https://gerrit.wikimedia.org/r/726851 (owner: 10Muehlenhoff) [21:39:07] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook to build and upload Scap releases to apt.wm.o [cookbooks] - 10https://gerrit.wikimedia.org/r/727605 (owner: 10Legoktm) [21:39:10] (03CR) 10Legoktm: Add cookbook to build and upload Scap releases to apt.wm.o (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/727605 (owner: 10Legoktm) [21:41:21] Today I created a task to create a project in Phabricator pessaol, if anyone wants to take a look: https://phabricator.wikimedia.org/T292687 [21:45:51] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Legoktm) Has this been rolled out further since? {T292762} is reporting generating diffs is slower. [22:02:25] 10SRE, 10serviceops, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) >>! In T285857#7410752, @Legoktm wrote: > Has this been rolled out further since? {T29276... [22:04:49] RECOVERY - SSH on thumbor1001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:10:46] (03CR) 10Jforrester: "Is this worth deploying in case we revert to wmf.2 again?" [core] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727186 (https://phabricator.wikimedia.org/T292048) (owner: 10Ladsgroup) [22:14:59] (03PS6) 10Cwhite: opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) [22:15:01] (03PS5) 10Cwhite: opensearch_dashboards: fork kibana module into opensearch_dashboards module [puppet] - 10https://gerrit.wikimedia.org/r/721385 (https://phabricator.wikimedia.org/T288618) [22:15:03] (03PS5) 10Cwhite: icinga: fork icinga::monitor::elasticsearch::base_checks [puppet] - 10https://gerrit.wikimedia.org/r/721386 (https://phabricator.wikimedia.org/T288618) [22:15:05] (03PS4) 10Cwhite: profile: fork elasticsearch profile into opensearch::server [puppet] - 10https://gerrit.wikimedia.org/r/721388 (https://phabricator.wikimedia.org/T288618) [22:15:07] (03PS5) 10Cwhite: profile: fork elasticsearch base_checks for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/721389 (https://phabricator.wikimedia.org/T288618) [22:15:09] (03PS4) 10Cwhite: profile: fork kibana profile into opensearch::dashboards [puppet] - 10https://gerrit.wikimedia.org/r/721391 (https://phabricator.wikimedia.org/T288618) [22:15:11] (03PS5) 10Cwhite: profile: fork elasticsearch::logstash into opensearch::logstash [puppet] - 10https://gerrit.wikimedia.org/r/721395 (https://phabricator.wikimedia.org/T288618) [22:15:13] (03PS1) 10Cwhite: logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) [22:15:15] (03PS1) 10Cwhite: logstash: kafka input: add manage_truststore parameter [puppet] - 10https://gerrit.wikimedia.org/r/727625 (https://phabricator.wikimedia.org/T288618) [22:15:17] (03PS1) 10Cwhite: profile: add logstash common profile [puppet] - 10https://gerrit.wikimedia.org/r/727626 (https://phabricator.wikimedia.org/T288618) [22:15:19] (03PS1) 10Cwhite: profile: add beta logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/727627 (https://phabricator.wikimedia.org/T288618) [22:15:51] (03CR) 10jerkins-bot: [V: 04-1] opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [22:17:45] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:19:05] (03PS7) 10Cwhite: opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) [22:19:07] (03PS6) 10Cwhite: opensearch_dashboards: fork kibana module into opensearch_dashboards module [puppet] - 10https://gerrit.wikimedia.org/r/721385 (https://phabricator.wikimedia.org/T288618) [22:19:09] (03PS6) 10Cwhite: icinga: fork icinga::monitor::elasticsearch::base_checks [puppet] - 10https://gerrit.wikimedia.org/r/721386 (https://phabricator.wikimedia.org/T288618) [22:19:11] (03PS5) 10Cwhite: profile: fork elasticsearch profile into opensearch::server [puppet] - 10https://gerrit.wikimedia.org/r/721388 (https://phabricator.wikimedia.org/T288618) [22:19:13] (03PS6) 10Cwhite: profile: fork elasticsearch base_checks for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/721389 (https://phabricator.wikimedia.org/T288618) [22:19:15] (03PS5) 10Cwhite: profile: fork kibana profile into opensearch::dashboards [puppet] - 10https://gerrit.wikimedia.org/r/721391 (https://phabricator.wikimedia.org/T288618) [22:19:17] (03PS6) 10Cwhite: profile: fork elasticsearch::logstash into opensearch::logstash [puppet] - 10https://gerrit.wikimedia.org/r/721395 (https://phabricator.wikimedia.org/T288618) [22:19:19] (03PS2) 10Cwhite: logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) [22:19:21] (03PS2) 10Cwhite: logstash: kafka input: add manage_truststore parameter [puppet] - 10https://gerrit.wikimedia.org/r/727625 (https://phabricator.wikimedia.org/T288618) [22:19:23] (03PS2) 10Cwhite: profile: add logstash common profile [puppet] - 10https://gerrit.wikimedia.org/r/727626 (https://phabricator.wikimedia.org/T288618) [22:19:25] (03PS2) 10Cwhite: profile: add beta logstash profile [puppet] - 10https://gerrit.wikimedia.org/r/727627 (https://phabricator.wikimedia.org/T288618) [22:25:23] (03CR) 10Cwhite: [C: 03+1] statsite: log instance identifier [puppet] - 10https://gerrit.wikimedia.org/r/727295 (https://phabricator.wikimedia.org/T247963) (owner: 10Filippo Giunchedi) [22:27:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:29:08] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:53] (03CR) 10jerkins-bot: [V: 04-1] opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) (owner: 10Cwhite) [22:37:21] (03CR) 10Cwhite: schemas - metrics: Add puppet keys to the metrics name space (034 comments) [software/ecs] - 10https://gerrit.wikimedia.org/r/722873 (https://phabricator.wikimedia.org/T222826) (owner: 10Jbond) [22:43:02] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 113 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:44:43] (03PS8) 10Cwhite: opensearch: fork elasticsearch module into opensearch module [puppet] - 10https://gerrit.wikimedia.org/r/721359 (https://phabricator.wikimedia.org/T288618) [22:44:45] (03PS7) 10Cwhite: opensearch_dashboards: fork kibana module into opensearch_dashboards module [puppet] - 10https://gerrit.wikimedia.org/r/721385 (https://phabricator.wikimedia.org/T288618) [22:44:47] (03PS7) 10Cwhite: icinga: fork icinga::monitor::elasticsearch::base_checks [puppet] - 10https://gerrit.wikimedia.org/r/721386 (https://phabricator.wikimedia.org/T288618) [22:44:49] (03PS6) 10Cwhite: profile: fork elasticsearch profile into opensearch::server [puppet] - 10https://gerrit.wikimedia.org/r/721388 (https://phabricator.wikimedia.org/T288618) [22:44:51] (03PS7) 10Cwhite: profile: fork elasticsearch base_checks for opensearch [puppet] - 10https://gerrit.wikimedia.org/r/721389 (https://phabricator.wikimedia.org/T288618) [22:44:52] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:44:53] (03PS7) 10Cwhite: profile: fork elasticsearch::logstash into opensearch::logstash [puppet] - 10https://gerrit.wikimedia.org/r/721395 (https://phabricator.wikimedia.org/T288618) [22:44:55] (03PS3) 10Cwhite: logstash: add opensearch output config definition [puppet] - 10https://gerrit.wikimedia.org/r/727624 (https://phabricator.wikimedia.org/T288618) [22:44:57] (03PS3) 10Cwhite: logstash: kafka input: add manage_truststore parameter [puppet] - 10https://gerrit.wikimedia.org/r/727625 (https://phabricator.wikimedia.org/T288618) [23:00:04] brennen: I, the Bot under the Fountain, call upon thee, The Deployer, to do US Backport and Config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211007T2300). [23:00:04] Juan_90264: A patch you scheduled for US Backport and Config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:39] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727497 (owner: 10Juan90264) [23:00:55] (03PS3) 10Juan90264: Change logo in astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727497 (https://phabricator.wikimedia.org/T292742) [23:02:18] I'll have a patch too once Jenkins stops being a slug [23:02:46] * thcipriani waves [23:02:55] I returned [23:03:29] Juan_90264: great! ready to deploy some config changes? [23:03:29] brennen: Online? [23:03:51] here with thcipriani for backport & config training; we're awaiting a trainee. [23:03:53] yes [23:04:02] perfect, we're merging now [23:04:52] (03CR) 10Juan90264: [C: 03+1] Change logo in astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727497 (https://phabricator.wikimedia.org/T292742) (owner: 10Juan90264) [23:05:49] (03PS27) 10Thcipriani: Adding and use wordmark in trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [23:06:13] (03CR) 10Thcipriani: [C: 03+2] "CONFIG DEPLOY" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [23:07:35] I also added https://gerrit.wikimedia.org/r/c/727497/ [23:08:01] got it, thanks [23:08:16] (03PS4) 10Juan90264: Change logo in astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727497 (https://phabricator.wikimedia.org/T292742) [23:08:21] (03Merged) 10jenkins-bot: Adding and use wordmark in trwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/704170 (https://phabricator.wikimedia.org/T286133) (owner: 10Juan90264) [23:08:31] (03PS5) 10Juan90264: Change logo in astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727497 (https://phabricator.wikimedia.org/T292742) [23:10:04] (03PS1) 10Gergő Tisza: Mentee overview: Make UncachedMenteeOverviewDataProvider::getBlocksForUsers faster [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727498 (https://phabricator.wikimedia.org/T290609) [23:11:04] Juan_90264: 704170 live on mwdebug1002, check please [23:11:50] Okay [23:14:04] (03PS4) 10Thcipriani: Change Javanese Wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708065 (https://phabricator.wikimedia.org/T287425) (owner: 10Labdajiwa) [23:15:16] let me know when it's safe to sync [23:15:56] added https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/727498 to the deploy calendar. Can deploy myself afterwards if preferred. [23:17:02] tgr we're both pretty fried, that would be lovely. we'll ping when finished with these. [23:18:06] 704170 Where's Stashbot? [23:19:04] 727497 I believe it is now safe to implement [23:19:45] have you check 704170 on mwdebug1002? Can I deploy it? [23:20:16] I'll merge 727497 after I finish deploying 704170 [23:21:25] Juan_90264: in case you don't know how to check mwdebug1002, use: https://wikitech.wikimedia.org/wiki/WikimediaDebug [23:22:05] Okay [23:25:58] Juan_90264: I see it on https://tr.m.wikiquote.org/wiki/Anasayfa using mwdebug1002 [23:26:12] ^ does that look OK to deploy to you? [23:26:48] (03CR) 10Gergő Tisza: [C: 03+2] Mentee overview: Make UncachedMenteeOverviewDataProvider::getBlocksForUsers faster [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727498 (https://phabricator.wikimedia.org/T290609) (owner: 10Gergő Tisza) [23:27:24] Yes [23:27:32] OK, going live! [23:28:38] !log thcipriani@deploy1002 Synchronized static/images/mobile/copyright/wikiquote-wordmark-tr.svg: Config: [[gerrit:704170|Adding and use wordmark in trwikiquote (T286133)]] Part 1/2 (duration: 00m 57s) [23:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:45] T286133: Add wordmark to trwikiquote - https://phabricator.wikimedia.org/T286133 [23:28:45] I wasn't used to using WikimediaDebug properly :) [23:29:25] no worries :) [23:30:04] !log thcipriani@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:704170|Adding and use wordmark in trwikiquote (T286133)]] Part 2/2 (duration: 00m 56s) [23:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:17] ^ Juan_90264 should be live now [23:30:57] not yet [23:30:59] alrght on to the next patch [23:31:17] Now yes [23:31:46] perfect :) [23:32:00] (03CR) 10Thcipriani: [C: 03+2] "CONFIG DEPLOY" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708065 (https://phabricator.wikimedia.org/T287425) (owner: 10Labdajiwa) [23:32:54] (03Merged) 10jenkins-bot: Change Javanese Wiktionary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/708065 (https://phabricator.wikimedia.org/T287425) (owner: 10Labdajiwa) [23:35:31] I arrived at WikimediaDebug and I also approve the implementation [23:35:50] * deployment [23:36:24] Juan_90264: 708065 is live on mwdebug1002, check please [23:36:36] looks good on mwdebut1002 [23:36:41] s/but/bug/ [23:36:58] (but it's where patches debut too and I love that) [23:37:10] haha [23:37:28] checked and approved [23:37:32] perfect! [23:37:34] going live [23:40:09] !log thcipriani@deploy1002 Synchronized static/images/project-logos: Config: [[gerrit:708065|Change Javanese Wiktionary logo (T287425)]] part 1/3 (duration: 00m 56s) [23:40:10] brennen: I hope you are doing well in training. [23:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:16] T287425: Change Javanese Wiktionary logo - https://phabricator.wikimedia.org/T287425 [23:41:02] thanks. :) [23:41:56] !log thcipriani@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:708065|Change Javanese Wiktionary logo (T287425)]] part 2/3 (duration: 00m 55s) [23:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:00] (03PS6) 10Thcipriani: Change logo in astwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/727497 (https://phabricator.wikimedia.org/T292742) (owner: 10Juan90264) [23:43:04] !log thcipriani@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:708065|Change Javanese Wiktionary logo (T287425)]] part 3/3 (duration: 00m 55s) [23:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:20] ^ Juan_90264 javanese wiktionary logo should be live now [23:44:40] Not yet [23:45:33] if it's the same logo, you probably need to purge the varnish cache for it [23:45:38] er, same filename [23:45:48] echo "$url" | mwscript purgeList.php [23:46:09] e.g. https://sal.toolforge.org/log/fR1BU3wB8Fs0LHO5pCvh [23:47:03] Purging the cache from this will be useful [23:48:51] Juan_90264: looking at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/727497 could you compress those pngs? they are double the size of the images they're replacing. [23:49:26] (03Merged) 10jenkins-bot: Mentee overview: Make UncachedMenteeOverviewDataProvider::getBlocksForUsers faster [extensions/GrowthExperiments] (wmf/1.38.0-wmf.3) - 10https://gerrit.wikimedia.org/r/727498 (https://phabricator.wikimedia.org/T290609) (owner: 10Gergő Tisza) [23:50:02] thcipriani: Yes I can but another change will be needed [23:51:06] Juan_90264: we're out of time to deploy changes in this window, could you optimize the images and add them to the next backport window? [23:51:59] I can [23:52:12] (03CR) 10Bstorm: [C: 03+1] "This looks right for the new requirements and v1 was legit in 1.19." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/727449 (https://phabricator.wikimedia.org/T292706) (owner: 10Majavah) [23:52:49] This Monday I will be available for 18:00 UTC Backport [23:52:55] perfect [23:53:00] that was a poor time to lock myself out of my IRC client for letsencrypt cert issues. [23:53:16] tgr_: I was wondering what happened to you [23:53:23] the server is yours [23:53:31] thx [23:53:34] [23:54:10] Let's go to the Ast Wikipedia change? [23:58:20] legoktm: Can purge cache from https://gerrit.wikimedia.org/r/708065? [23:59:25] This one has already been merged, but the logo looks like it needs to be purged to appear permanently, as it appears on WikimediaDebug