[00:00:04] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T0000). [00:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:09:44] present [00:09:50] Is there a deployer available? [00:10:26] I know that Roan is not available at the moment [00:10:37] Thanks EricGardner [00:10:46] (03PS3) 10Jdlrobson: MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743) [00:10:56] I too am trying to get some things deployed in the current window but I'm still putting cherry picks together [00:11:11] thcipriani: are you around? [00:13:10] (03PS4) 10Jdlrobson: Default commons search experience is MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745935 (https://phabricator.wikimedia.org/T297484) [00:13:25] (03PS6) 10Jdlrobson: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 [00:13:34] (03PS3) 10Jdlrobson: Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) [00:24:17] (03CR) 10Eric Gardner: "This change is ready for review." [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746915 (https://phabricator.wikimedia.org/T297529) (owner: 10Eric Gardner) [00:24:35] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) >>! In T297517#7566856, @brennen wrote: > We're currently on 1.38.0-wmf.9, and this remains a block... [00:25:25] Jdlrobson: did you find someone? [00:26:06] tgr: nope [00:26:47] I'll deploy then [00:26:54] tgr: thank you <3 [00:27:18] I'm taking myself off the list, my patches can ride the train it turns out [00:27:22] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) The only thing unique to this report as compared to T296098 and T296063 is the failure mode, i.e. m... [00:28:01] (03Abandoned) 10Eric Gardner: Vue: Unbreak after Vue 3 migration [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746915 (https://phabricator.wikimedia.org/T297529) (owner: 10Eric Gardner) [00:29:35] (03CR) 10Gergő Tisza: [C: 03+2] Default commons search experience is MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745935 (https://phabricator.wikimedia.org/T297484) (owner: 10Jdlrobson) [00:30:15] (03Merged) 10jenkins-bot: Default commons search experience is MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745935 (https://phabricator.wikimedia.org/T297484) (owner: 10Jdlrobson) [00:31:23] Jdlrobson: first patch is on mwdebug1001 [00:31:28] testing [00:34:11] LGTM. [00:34:15] Haven't checked the logs yet [00:34:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:44] (03PS4) 10Gergő Tisza: Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson) [00:35:07] tgr: i think we're good to sync that one. Not seeing anything new on logstash/mwdebug channel [00:35:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:10] (03CR) 10Gergő Tisza: [C: 03+2] Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson) [00:36:22] !log tgr@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:745935|Default commons search experience is MediaSearch (T297484)]] (duration: 00m 56s) [00:36:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:27] T297484: Update how destination of top-right search form is set - https://phabricator.wikimedia.org/T297484 [00:36:52] (03Merged) 10jenkins-bot: Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson) [00:37:34] Jdlrobson: second patch is on mwdebug1001 [00:38:34] (03PS4) 10Gergő Tisza: MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743) (owner: 10Jdlrobson) [00:39:20] tgr: that's good to go too. [00:41:20] !log tgr@deploy1002 Synchronized images/mobile/: Config: [[gerrit:745573|Remove broken wikipedia-wordmark-en.png symlink (T278193)]] (duration: 00m 56s) [00:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:25] T278193: [php-fpm] Symbolic link not allowed or link target not accessible: wikipedia-wordmark-en.png - https://phabricator.wikimedia.org/T278193 [00:42:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:22] Jdlrobson: can you check production too? (if thre's anything to check, not sure how that works with symlinks) Files can be tricky due to the edge cache needing purges. [00:44:54] tgr: yep checking [00:45:05] (03CR) 10Gergő Tisza: [C: 03+2] MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743) (owner: 10Jdlrobson) [00:45:59] (03Merged) 10jenkins-bot: MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743) (owner: 10Jdlrobson) [00:47:00] php-fpm I'm getting a 404 on https://en.wikipedia.org/images/mobile/wikipedia-wordmark-en.png so that's promising [00:47:05] Need to monitor the logs a bit more though [00:48:16] thanks! meanwhile the third patch is on mwdebug [00:48:29] tgr: testing that one now.. [00:48:55] Original exception: [f282c392-f8a3-47e6-8e19-f2bc9e3b5475] 2021-12-14 00:48:32: Fatal exception of type "TypeError" doesn't seem good [00:49:23] ahh yeh that one's not good. [00:49:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:49:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:37] It looks like I misread the format. Please revert that one. I'll redo it [00:49:54] (03PS1) 10Jdlrobson: Revert "MinervaDonateLink is enabled in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746916 [00:50:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:41] (03PS1) 10Gergő Tisza: Revert "MinervaDonateLink is enabled in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746917 [00:51:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:51:55] (03PS1) 10Jdlrobson: [Attempt 2] MinervaDonateLink is enabled in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746977 [00:52:03] tgr: the above one is the correct one^ [00:52:16] not sure if it makes sense to revert than try again or just squash these into 2 [00:52:57] squashing is nicer if you can do it [00:53:01] can [00:53:30] (03Abandoned) 10Gergő Tisza: Revert "MinervaDonateLink is enabled in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746917 (owner: 10Gergő Tisza) [00:53:33] (03PS2) 10Jdlrobson: [Attempt 2] MinervaDonateLink is enabled in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746977 [00:53:37] there you go [00:53:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:54:49] (03CR) 10Gergő Tisza: [C: 03+2] [Attempt 2] MinervaDonateLink is enabled in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746977 (owner: 10Jdlrobson) [00:55:17] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10brennen) > Is tuning the kernel the thing that you want unbroken now? Again, it has probably been broken for y... [00:55:28] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10brennen) [00:55:34] (03Merged) 10jenkins-bot: [Attempt 2] MinervaDonateLink is enabled in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746977 (owner: 10Jdlrobson) [00:56:48] Jdlrobson: it's on mwdebug1001 [00:57:34] tgr: testing [00:57:45] tgr: LGTM [00:57:49] donate link still there :) [00:58:13] (03PS7) 10Gergő Tisza: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson) [00:58:37] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) Since the train was rolled forward from wmf.9 -> wmf.12 today, [[ https://grafana.wikimedia.org/d/000... [00:58:59] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:746977|[Attempt 2] MinervaDonateLink is enabled in production""]] (duration: 00m 57s) [00:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [00:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:30] hm, not sure what the sync order is for dblist changes these days. dblist -> yaml -> generator -> IS.php? [01:00:41] 10SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10komla) [01:00:57] tgr: I'm not sure either. Can delay this one until tomorrow if you are not comfortable doing it [01:01:00] it's not urgent at all [01:01:04] just an opportunity to clean up some cruft [01:03:11] it has to be safe as long as PHP is left to the end, as far as I can see [01:03:30] (03CR) 10Gergő Tisza: [C: 03+2] Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson) [01:03:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:44] (03Merged) 10jenkins-bot: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson) [01:05:54] Jdlrobson: it's on mwdebug1001 [01:06:01] tgr: testing [01:08:13] tgr: good to sync [01:10:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:20] !log tgr@deploy1002 Synchronized wmf-config/config/: Config: [[gerrit:743051|Clean up readers web team config]] (duration: 00m 56s) [01:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:49] !log tgr@deploy1002 Synchronized dblists/mobile-anon-talk.dblist: Config: [[gerrit:743051|Clean up readers web team config]] (duration: 00m 55s) [01:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:02] tgr: and testing in production [01:15:19] it's not really deployed yet [01:15:30] ah ok ping me when i should check [01:15:57] I'm trying to figure out whether https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/6e17f55d2badd6efcf30dd856a0bcc1da35217cd/multiversion/MWConfigCacheGenerator.php#16 is for deployers or code authors [01:18:53] I guess that's outdated? https://noc.wikimedia.org/conf/ seems to include the new dblist without any manual action [01:19:43] There's https://wikitech.wikimedia.org/w/index.php?search=%22createTxtFileSymlinks.sh%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns12=1&ns116=1&ns498=1 so probably outdated [01:19:47] tgr: running it locally seems to do nothing. [01:20:39] !log tgr@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Config: [[gerrit:743051|Clean up readers web team config]] (duration: 00m 55s) [01:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:23] seems like the last time someone ran it was in 2019 [01:22:48] oh well, can't hurt. [01:24:35] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743051|Clean up readers web team config]] (duration: 00m 55s) [01:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:10] Jdlrobson: now deployed for reals [01:26:20] tgr: yay! thanks a bunch [01:26:25] running through some last tests [01:27:29] tgr: and all looks good to me [01:28:08] (03CR) 10Gergő Tisza: [C: 03+2] wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [01:30:19] (03PS4) 10Gergő Tisza: wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [01:32:50] (03CR) 10Gergő Tisza: [C: 03+2] wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [01:33:40] (03Merged) 10jenkins-bot: wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [01:36:29] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:745833|wgEventStreams: Add WelcomeSurvey Interaction schema (T267273)]] (duration: 00m 56s) [01:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:35] T267273: [arwiki] Submitting a POST on a form redirected to immediately after account creation sometimes logs user out - https://phabricator.wikimedia.org/T267273 [01:37:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:03] !log UTC late deploys done [01:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:42] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad rolling restart - ryankemper@cumin1001 - T297468 [01:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:16] !log T297468 `sudo cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad rolling restart" --nodes-per-run 3 --start-datetime 2021-12-14T01:27:58 --task-id T297468` on `ryankemper@cumin1001` tmux `elastic_restarts` [01:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:15] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 98 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:49:23] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:05:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.13 [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746984 [02:07:04] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.13 [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746984 (owner: 10TrainBranchBot) [02:13:56] (03Abandoned) 10Gergő Tisza: Revert "MinervaDonateLink is enabled in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746916 (owner: 10Jdlrobson) [02:24:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 89 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:27:00] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) I filed T297667 for the PHP bug which I'm working on. [02:27:07] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.13 [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746984 (owner: 10TrainBranchBot) [02:29:23] PROBLEM - cassandra-a service on aqs1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:29:31] PROBLEM - Check systemd state on aqs1014 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:15] PROBLEM - cassandra-a CQL 10.64.48.65:9042 on aqs1014 is CRITICAL: connect to address 10.64.48.65 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [02:30:23] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 56 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:33:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:15] RECOVERY - cassandra-a service on aqs1014 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:38:23] RECOVERY - Check systemd state on aqs1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:41:19] RECOVERY - cassandra-a CQL 10.64.48.65:9042 on aqs1014 is OK: TCP OK - 0.000 second response time on 10.64.48.65 port 9042 https://phabricator.wikimedia.org/T93886 [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T0300) [03:23:57] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:13:03] (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746919 (https://phabricator.wikimedia.org/T297580) (owner: 10Juan90264) [04:13:40] (03PS4) 10Juan90264: Fix wordmark to outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746919 (https://phabricator.wikimedia.org/T297580) [04:22:51] PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:22:57] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) It indeed looks like wmf.12 has increased db traffic: https://grafana.wikimedia.org/d/000000278/mys... [04:23:18] (03PS1) 10Ladsgroup: Cache page properties in memory to avoid extra queries [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746920 (https://phabricator.wikimedia.org/T297132) [04:25:01] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:28:38] (03PS2) 10Ladsgroup: Cache page properties in memory to avoid extra queries [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746920 (https://phabricator.wikimedia.org/T297132) [04:30:19] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) Created {T297669} for the database issue. [04:42:10] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad rolling restart - ryankemper@cumin1001 - T297468 [04:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:40] (03CR) 10Ladsgroup: [C: 03+2] Cache page properties in memory to avoid extra queries [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746920 (https://phabricator.wikimedia.org/T297132) (owner: 10Ladsgroup) [05:07:14] (03Merged) 10jenkins-bot: Cache page properties in memory to avoid extra queries [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746920 (https://phabricator.wikimedia.org/T297132) (owner: 10Ladsgroup) [05:09:05] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/DiscussionTools/includes/Hooks/HookUtils.php: Backport: [[gerrit:746920|Cache page properties in memory to avoid extra queries (T297132 T297669)]] (duration: 00m 57s) [05:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:11] T297669: Noticeable increase in db load after wmf.12 roll out - https://phabricator.wikimedia.org/T297669 [05:09:12] T297132: DiscussionTools is making duplicate DB requests back to back - https://phabricator.wikimedia.org/T297132 [05:11:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:55] RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:25:35] (03PS1) 10Ladsgroup: blameStartupRegistry: Fix clash in $startupBytes variable name [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746921 (https://phabricator.wikimedia.org/T295413) [05:25:55] (03CR) 10Ladsgroup: [C: 03+2] "Catch the train, doesn't seem to need syncing" [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746921 (https://phabricator.wikimedia.org/T295413) (owner: 10Ladsgroup) [05:28:01] (03Merged) 10jenkins-bot: blameStartupRegistry: Fix clash in $startupBytes variable name [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746921 (https://phabricator.wikimedia.org/T295413) (owner: 10Ladsgroup) [05:34:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:25] PROBLEM - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is CRITICAL: connect to address 10.64.0.120 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [05:38:41] PROBLEM - cassandra-b service on aqs1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:38:55] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-b.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:49:23] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/746949 (https://phabricator.wikimedia.org/T293331) (owner: 10Accraze) [05:59:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 15 hosts with reason: Maintenance [05:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 15 hosts with reason: Maintenance [05:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:00:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [06:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance [06:01:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance [06:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T277354)', diff saved to https://phabricator.wikimedia.org/P18180 and previous config saved to /var/cache/conftool/dbconfig/20211214-060125-marostegui.json [06:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:30] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [06:03:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T277354)', diff saved to https://phabricator.wikimedia.org/P18181 and previous config saved to /var/cache/conftool/dbconfig/20211214-060311-marostegui.json [06:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:17] RECOVERY - cassandra-b service on aqs1010 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:05:33] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:07:11] RECOVERY - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is OK: TCP OK - 0.000 second response time on 10.64.0.120 port 9042 https://phabricator.wikimedia.org/T93886 [06:18:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P18182 and previous config saved to /var/cache/conftool/dbconfig/20211214-061816-marostegui.json [06:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P18183 and previous config saved to /var/cache/conftool/dbconfig/20211214-063321-marostegui.json [06:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:55] 10SRE, 10DBA, 10observability, 10Patch-For-Review, 10User-Ladsgroup: Send metrics of db errors of mediawiki to prometheus - https://phabricator.wikimedia.org/T297435 (10Marostegui) [06:48:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T277354)', diff saved to https://phabricator.wikimedia.org/P18184 and previous config saved to /var/cache/conftool/dbconfig/20211214-064825-marostegui.json [06:48:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:48:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:31] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [06:48:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T277354)', diff saved to https://phabricator.wikimedia.org/P18185 and previous config saved to /var/cache/conftool/dbconfig/20211214-064833-marostegui.json [06:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:17] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: CAS should link to account creation tutorial - https://phabricator.wikimedia.org/T297524 (10Majavah) 05Open→03Resolved a:03jbond thanks! [06:50:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T277354)', diff saved to https://phabricator.wikimedia.org/P18186 and previous config saved to /var/cache/conftool/dbconfig/20211214-065019-marostegui.json [06:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P18187 and previous config saved to /var/cache/conftool/dbconfig/20211214-070524-marostegui.json [07:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:04] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Joe) >>! In T297517#7568203, @tstarling wrote: >>>! In T297517#7566856, @brennen wrote: >> We're currently on... [07:16:52] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Joe) >>! In T297517#7568208, @tstarling wrote: > The only thing unique to this report as compared to T296098 a... [07:20:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P18188 and previous config saved to /var/cache/conftool/dbconfig/20211214-072029-marostegui.json [07:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:53] (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring-articlequality img [deployment-charts] - 10https://gerrit.wikimedia.org/r/746949 (https://phabricator.wikimedia.org/T293331) (owner: 10Accraze) [07:24:08] !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw rolling restart - ryankemper@cumin2001 - T297468 [07:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:57] !log T297468 `sudo cookbook sre.elasticsearch.rolling-operation search_codfw "codfw rolling restart" --nodes-per-run 3 --start-datetime 2021-12-14T01:27:58 --task-id T297468` on `ryankemper@cumin2001` tmux `elastic_restarts` [07:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:17] (03PS1) 10Giuseppe Lavagetto: mwdebug: switch to socket proxying [deployment-charts] - 10https://gerrit.wikimedia.org/r/747008 [07:35:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T277354)', diff saved to https://phabricator.wikimedia.org/P18189 and previous config saved to /var/cache/conftool/dbconfig/20211214-073534-marostegui.json [07:35:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1164.eqiad.wmnet with reason: Maintenance [07:35:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1164.eqiad.wmnet with reason: Maintenance [07:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:40] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [07:35:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T277354)', diff saved to https://phabricator.wikimedia.org/P18190 and previous config saved to /var/cache/conftool/dbconfig/20211214-073541-marostegui.json [07:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:55] PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T277354)', diff saved to https://phabricator.wikimedia.org/P18191 and previous config saved to /var/cache/conftool/dbconfig/20211214-073727-marostegui.json [07:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:41] (03PS1) 10Marostegui: generate_dsns_table.sh: Remove [software] - 10https://gerrit.wikimedia.org/r/747009 [07:42:17] (03CR) 10Marostegui: [C: 03+2] generate_dsns_table.sh: Remove [software] - 10https://gerrit.wikimedia.org/r/747009 (owner: 10Marostegui) [07:42:47] (03Merged) 10jenkins-bot: generate_dsns_table.sh: Remove [software] - 10https://gerrit.wikimedia.org/r/747009 (owner: 10Marostegui) [07:45:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: switch to socket proxying [deployment-charts] - 10https://gerrit.wikimedia.org/r/747008 (owner: 10Giuseppe Lavagetto) [07:48:19] (03Merged) 10jenkins-bot: mwdebug: switch to socket proxying [deployment-charts] - 10https://gerrit.wikimedia.org/r/747008 (owner: 10Giuseppe Lavagetto) [07:52:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P18192 and previous config saved to /var/cache/conftool/dbconfig/20211214-075232-marostegui.json [07:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:47] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:02:29] RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:58] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P18193 and previous config saved to /var/cache/conftool/dbconfig/20211214-080736-marostegui.json [08:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:29] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:12] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10akosiaris) Had a quick look at that. It is true that we never have r... [08:17:42] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Marostegui) [08:21:51] (03PS1) 10Ayounsi: Update netflow collector for codfw/eqdfw to netflow2002 [homer/public] - 10https://gerrit.wikimedia.org/r/747047 (https://phabricator.wikimedia.org/T297595) [08:22:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2008.codfw.wmnet with OS buster [08:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:22] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2008.codfw.wmnet with OS buster [08:22:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T277354)', diff saved to https://phabricator.wikimedia.org/P18194 and previous config saved to /var/cache/conftool/dbconfig/20211214-082241-marostegui.json [08:22:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:22:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:46] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [08:22:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T277354)', diff saved to https://phabricator.wikimedia.org/P18195 and previous config saved to /var/cache/conftool/dbconfig/20211214-082249-marostegui.json [08:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:20] (03CR) 10Ayounsi: [C: 03+2] Update netflow collector for codfw/eqdfw to netflow2002 [homer/public] - 10https://gerrit.wikimedia.org/r/747047 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi) [08:24:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T277354)', diff saved to https://phabricator.wikimedia.org/P18196 and previous config saved to /var/cache/conftool/dbconfig/20211214-082433-marostegui.json [08:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:12] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [08:29:16] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow3002.esams.wmnet [08:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:07] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow4002.ulsfo.wmnet [08:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:42] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow5002.eqsin.wmnet [08:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:00] (03PS1) 10Muehlenhoff: Update import hook to import logstash 6.8.21 [puppet] - 10https://gerrit.wikimedia.org/r/747048 [08:31:38] (03PS2) 10Muehlenhoff: Update import hook to import logstash 6.8.21 [puppet] - 10https://gerrit.wikimedia.org/r/747048 [08:32:12] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [08:33:47] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netflow4002.ulsfo.wmnet [08:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:58] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netflow5002.eqsin.wmnet [08:33:59] !log restart blazegraph on wdqs1013 (jvm stuck for 5h) [08:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:09] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow4002.ulsfo.wmnet [08:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:37] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:39:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P18197 and previous config saved to /var/cache/conftool/dbconfig/20211214-083938-marostegui.json [08:39:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:02] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow5002.eqsin.wmnet [08:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow3002.esams.wmnet [08:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:53] (03PS1) 10Kosta Harlan: WelcomeSurvey: Instrument interactions with form [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273) [08:45:12] 10SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10komla) >>! In T297621#7567433, @Aklapper wrote: > Adding @komla as some data needs to be filled in above (user account registered on wikitech.wikimedia.org; separate SSH key; etc). This has... [08:48:07] (03PS1) 10Ayounsi: Add new netflow hosts to Kafka jumbo ACL [puppet] - 10https://gerrit.wikimedia.org/r/747050 (https://phabricator.wikimedia.org/T297595) [08:49:10] (03CR) 10Elukey: [C: 03+1] Add new netflow hosts to Kafka jumbo ACL [puppet] - 10https://gerrit.wikimedia.org/r/747050 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi) [08:49:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow4002.ulsfo.wmnet [08:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:21] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow1002.eqiad.wmnet [08:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:52] (03CR) 10Kosta Harlan: [C: 03+2] "backport" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [08:50:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2008.codfw.wmnet with OS buster [08:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:44] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2008.codfw.wmnet with OS buster completed: - ganeti2008 (**PASS**) - Downtimed on Icinga... [08:54:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P18198 and previous config saved to /var/cache/conftool/dbconfig/20211214-085443-marostegui.json [08:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:47] !log failover Ganeti master to ganeti2016 T296622 [08:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:52] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [08:55:38] (03CR) 10Ayounsi: [C: 03+2] "All are in DNS." [puppet] - 10https://gerrit.wikimedia.org/r/747050 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi) [08:56:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow5002.eqsin.wmnet [08:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2017.codfw.wmnet with OS buster [08:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:21] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS buster [08:57:25] PROBLEM - ganeti-wconfd running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:57:45] ^ that's expected, icinga fallout of the master failover [09:00:57] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow1002.eqiad.wmnet [09:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:26] (03PS1) 10Ayounsi: Add DHCP for new netflow VMs [puppet] - 10https://gerrit.wikimedia.org/r/747052 (https://phabricator.wikimedia.org/T297595) [09:07:00] (03CR) 10Ayounsi: [C: 03+2] Add DHCP for new netflow VMs [puppet] - 10https://gerrit.wikimedia.org/r/747052 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi) [09:09:15] PROBLEM - HTTPS Ganeti RAPI codfw on ganeti2019 is CRITICAL: connect to address ganeti01.svc.codfw.wmnet and port 5080: No route to host https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [09:09:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T277354)', diff saved to https://phabricator.wikimedia.org/P18199 and previous config saved to /var/cache/conftool/dbconfig/20211214-090948-marostegui.json [09:09:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:09:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:53] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [09:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:57] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1139.eqiad.wmnet with reason: Maintenance [09:10:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1139.eqiad.wmnet with reason: Maintenance [09:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:49] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:11:24] (03CR) 10jerkins-bot: [V: 04-1] WelcomeSurvey: Instrument interactions with form [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [09:11:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T277354)', diff saved to https://phabricator.wikimedia.org/P18200 and previous config saved to /var/cache/conftool/dbconfig/20211214-091130-marostegui.json [09:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:48] (03CR) 10Kosta Harlan: [C: 03+2] WelcomeSurvey: Instrument interactions with form [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [09:13:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T277354)', diff saved to https://phabricator.wikimedia.org/P18201 and previous config saved to /var/cache/conftool/dbconfig/20211214-091315-marostegui.json [09:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:27] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [09:15:32] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32982/console" [puppet] - 10https://gerrit.wikimedia.org/r/746890 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:15:34] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw rolling restart - ryankemper@cumin2001 - T297468 [09:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:52] (03CR) 10Filippo Giunchedi: [C: 03+1] maps: add stub values for tegola swift credentials [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [09:20:16] (03CR) 10Filippo Giunchedi: maps: write tegola credentials out to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [09:21:03] (03CR) 10Vgutierrez: [C: 03+2] cache: Provide a Envoy upload role [puppet] - 10https://gerrit.wikimedia.org/r/745772 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:25:16] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: pin discovery probes to their site [puppet] - 10https://gerrit.wikimedia.org/r/746881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [09:27:18] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P18202 and previous config saved to /var/cache/conftool/dbconfig/20211214-092820-marostegui.json [09:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:13] (03PS1) 10Kormat: tox.ini: Fix py3{7,8}-format [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747054 (https://phabricator.wikimedia.org/T297616) [09:32:47] (03CR) 10Kormat: [C: 03+2] tox.ini: Fix py3{7,8}-format [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747054 (https://phabricator.wikimedia.org/T297616) (owner: 10Kormat) [09:33:16] (03PS3) 10Kormat: wmfdb/section: Add class for handling of sections. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745249 [09:34:05] (03Merged) 10jenkins-bot: WelcomeSurvey: Instrument interactions with form [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [13:17:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32990/console" [puppet] - 10https://gerrit.wikimedia.org/r/747108 (owner: 10Jbond) [13:18:43] (03PS2) 10Hashar: build: add mypy types [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104 [13:19:17] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: Add addtional port to configuration [puppet] - 10https://gerrit.wikimedia.org/r/747108 (owner: 10Jbond) [13:19:31] (03CR) 10Hashar: "I have added types-requests and types-PyYAML as suggested by Volans :)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104 (owner: 10Hashar) [13:19:44] (03CR) 10Kormat: [C: 03+2] wmfdb/addr: Add addr.py to handle addresses. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745852 (owner: 10Kormat) [13:20:55] (03Merged) 10jenkins-bot: wmfdb/addr: Add addr.py to handle addresses. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745852 (owner: 10Kormat) [13:21:04] (03PS1) 10Muehlenhoff: Add library hint for libsamplerate [puppet] - 10https://gerrit.wikimedia.org/r/747110 [13:21:57] (03PS8) 10Kormat: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618) [13:24:18] (03PS1) 10Jgiannelos: kartographer: Enable tegola on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747111 (https://phabricator.wikimedia.org/T280767) [13:25:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P18227 and previous config saved to /var/cache/conftool/dbconfig/20211214-132551-marostegui.json [13:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:12] (03PS1) 10Ladsgroup: Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747068 (https://phabricator.wikimedia.org/T297669) [13:31:31] (03PS1) 10Ladsgroup: Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747069 (https://phabricator.wikimedia.org/T297669) [13:32:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [13:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:53] jouncebot: nowandnext [13:32:53] For the next 0 hour(s) and 27 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1300) [13:32:54] In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1400) [13:33:04] (03CR) 10Ladsgroup: [C: 03+2] Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747068 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup) [13:33:08] (03CR) 10Ladsgroup: [C: 03+2] Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747069 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup) [13:37:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [13:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:13] (03CR) 10Jbond: [C: 03+1] sre.hosts.dhcp: add support for Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/747099 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans) [13:39:26] (03PS1) 10Jcrespo: mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) [13:39:50] (03CR) 10Jbond: [C: 03+1] "LGTM but unsure of the original issue" [puppet] - 10https://gerrit.wikimedia.org/r/747067 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [13:39:52] (03CR) 10Ayounsi: [C: 03+2] "Pmacct add sflow listener" try #2 [puppet] - 10https://gerrit.wikimedia.org/r/747067 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi) [13:40:40] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [13:40:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P18228 and previous config saved to /var/cache/conftool/dbconfig/20211214-134056-marostegui.json [13:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:55] (03PS2) 10Jcrespo: mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) [13:42:11] (03PS9) 10Kormat: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618) [13:42:30] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [13:43:03] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:44:55] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:46:23] (03PS3) 10Jcrespo: mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) [13:47:32] jouncebot: nowandnext [13:47:32] For the next 0 hour(s) and 12 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1300) [13:47:32] In 0 hour(s) and 12 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1400) [13:47:40] aha [13:48:02] (03PS1) 10Jbond: WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [13:51:05] !log depool cp4025 to be reimaged as cache::upload_envoy - T271421 [13:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:10] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [13:51:44] (03CR) 10Jcrespo: "I am thinking of installing age, but that is not available on buster only starting in bullseye: https://packages.debian.org/search?keyword" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [13:52:26] (03CR) 10Volans: [C: 03+2] sre.hosts.dhcp: add support for Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/747099 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans) [13:52:35] (03PS3) 10Vgutierrez: site: Reimage cp4025 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/746891 (https://phabricator.wikimedia.org/T271421) [13:53:43] (03Merged) 10jenkins-bot: Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747068 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup) [13:54:03] (03PS4) 10Jcrespo: mediabackup: Add an encryption key to store private files securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) [13:54:18] (03CR) 10jerkins-bot: [V: 04-1] WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [13:54:53] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10akosiaris) p:05Triage→03Low Code found. https://github.com/znuny... [13:55:19] (03Merged) 10jenkins-bot: sre.hosts.dhcp: add support for Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/747099 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans) [13:55:37] !log Deployed patch for T297570 [13:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T277354)', diff saved to https://phabricator.wikimedia.org/P18229 and previous config saved to /var/cache/conftool/dbconfig/20211214-135601-marostegui.json [13:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:07] T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354 [13:56:17] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) @aborrero are we doing trunk so i can assign this task to netops? [13:57:02] (03Merged) 10jenkins-bot: Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747069 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup) [13:57:55] (I’m done) [13:58:08] (03PS10) 10Kormat: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618) [13:59:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:04] hashar and dancy: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1400). [14:00:32] (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4025 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/746891 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:01:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul) [14:02:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4025.ulsfo.wmnet with OS buster [14:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:56] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster [14:09:12] 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve automatic query killer under high load - https://phabricator.wikimedia.org/T293532 (10Marostegui) p:05Triage→03Medium [14:10:38] 10ops-eqiad, 10DC-Ops, 10Graphite: Upgrade firmware on graphite1004 if upgrade available. - https://phabricator.wikimedia.org/T297433 (10Marostegui) [14:12:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:50] 10SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T297652 (10Marostegui) p:05Triage→03Medium [14:13:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:49] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:15:57] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) [14:16:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade netflow VMs to Bullseye - https://phabricator.wikimedia.org/T297595 (10ayounsi) 05Open→03Resolved a:03ayounsi All done! [14:16:30] (03CR) 10Kormat: [C: 03+2] wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [14:16:31] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/OutputPage.php: Backport: [[gerrit:747068|Reuse the query result in addCategoryLinks instead of relying on cache (T297669)]] (duration: 00m 57s) [14:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:37] T297669: Noticeable increase in db load after wmf.12 roll out - https://phabricator.wikimedia.org/T297669 [14:17:11] (03CR) 10Klausman: [C: 03+1] helmfile.d: add the istio pod security policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/746880 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [14:17:46] (03Merged) 10jenkins-bot: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat) [14:19:07] 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Marostegui) p:05Triage→03Medium [14:19:37] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Tests are successful: I tested it by configuring sflow on the non-yet-prod asw1-b12-drmrs switch: `lang=diff [edit protoc... [14:20:12] 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10Marostegui) p:05Triage→03Medium [14:20:43] RECOVERY - cassandra-b CQL 10.192.48.171:9042 on restbase2026 is OK: TCP OK - 0.033 second response time on 10.192.48.171 port 9042 https://phabricator.wikimedia.org/T93886 [14:22:13] RECOVERY - cassandra-c service on restbase2026 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:22:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10jcrespo) [14:22:21] RECOVERY - cassandra-c SSL 10.192.48.172:7001 on restbase2026 is OK: SSL OK - Certificate restbase2026-c valid until 2023-12-09 16:37:44 +0000 (expires in 725 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [14:24:20] (03PS1) 10MVernon: admin: add elapps to ldap_only_users (T297652) [puppet] - 10https://gerrit.wikimedia.org/r/747120 [14:26:00] (03PS1) 10Jcrespo: install_server: Add backup1008/backup2008 to partman [puppet] - 10https://gerrit.wikimedia.org/r/747123 (https://phabricator.wikimedia.org/T294973) [14:26:49] (03PS2) 10Jcrespo: install_server: Add backup1008/backup2008 to partman [puppet] - 10https://gerrit.wikimedia.org/r/747123 (https://phabricator.wikimedia.org/T294973) [14:28:00] (03CR) 10Jcrespo: [C: 03+2] install_server: Add backup1008/backup2008 to partman [puppet] - 10https://gerrit.wikimedia.org/r/747123 (https://phabricator.wikimedia.org/T294973) (owner: 10Jcrespo) [14:28:41] (03PS1) 10JMeybohm: cert-manager: Allow ingress to webhook from k8s master and nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/747124 (https://phabricator.wikimedia.org/T294560) [14:29:36] (03CR) 10Marostegui: "The change itself looks good. The user matches the ldap one." [puppet] - 10https://gerrit.wikimedia.org/r/747120 (owner: 10MVernon) [14:30:49] (03PS2) 10MVernon: admin: add elapps to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/747120 (https://phabricator.wikimedia.org/T297652) [14:31:31] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10hashar) The issue appeared with wmf.12 which is fully deployed now and it does not seem we will roll it back.... [14:32:14] RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:08] (03PS1) 10Vgutierrez: role::cache: Add missing upload_envoy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/747127 (https://phabricator.wikimedia.org/T271421) [14:33:47] (03CR) 10Vgutierrez: [C: 03+2] role::cache: Add missing upload_envoy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/747127 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:34:59] (03CR) 10MVernon: admin: add elapps to ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747120 (https://phabricator.wikimedia.org/T297652) (owner: 10MVernon) [14:37:45] (03CR) 10Marostegui: [C: 03+1] admin: add elapps to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/747120 (https://phabricator.wikimedia.org/T297652) (owner: 10MVernon) [14:38:23] (03CR) 10MVernon: [C: 03+2] admin: add elapps to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/747120 (https://phabricator.wikimedia.org/T297652) (owner: 10MVernon) [14:38:40] RECOVERY - cassandra-b CQL 10.64.16.206:9042 on aqs1011 is OK: TCP OK - 0.000 second response time on 10.64.16.206 port 9042 https://phabricator.wikimedia.org/T93886 [14:40:00] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) >>! In T297517#7569567, @hashar wrote: > The issue appeared with wmf.12 which is fully deployed now a... [14:40:48] RECOVERY - cassandra-b service on aqs1011 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:42:09] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T297652 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Hi, This is now done. Thanks, Matthew [14:43:12] (03PS1) 10Herron: mx: make exim queue alert paging [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) [14:47:28] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1018.eqiad.wmnet with OS buster [14:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1018.eqiad.wmnet with OS buster [14:47:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=envoy site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:49:54] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1019.eqiad.wmnet with OS buster [14:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1019.eqiad.wmnet with OS buster [14:50:49] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] maps: add stub values for tegola swift credentials [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [14:50:55] I am going to sync mediawiki wmf.13 code to the cluster but without promoting any wikis to it [14:50:58] cause of some blockers [14:51:06] but at least the code will be around [14:52:10] !log hashar@deploy1002 Started scap: Push wmf.13 without promoting any wikis [14:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:54] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) So I have been working on this on several fronts (with Daniel and Tim). The [[https://gerrit.wikime... [14:56:06] (03PS1) 10Vgutierrez: cache::envoy: Fix ocsp systemd config file content [puppet] - 10https://gerrit.wikimedia.org/r/747130 (https://phabricator.wikimedia.org/T271421) [14:57:33] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32991/console" [puppet] - 10https://gerrit.wikimedia.org/r/747130 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:58:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:59:01] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Fix ocsp systemd config file content [puppet] - 10https://gerrit.wikimedia.org/r/747130 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [14:59:54] hnowlan: you got a commit pending to be merged [15:01:42] vgutierrez: oops, on labs-private? that's safe to merge [15:01:49] indeed [15:01:55] sorry about that [15:02:03] merging [15:02:05] (done) [15:04:20] (03PS1) 10Kormat: wmfdb/addr: If _dc_map doesn't find a dc ID, use DNS. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747132 [15:06:59] PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur [15:06:59] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:07:17] PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur [15:07:18] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:07:23] (03CR) 10Kormat: [C: 03+2] wmfdb/addr: If _dc_map doesn't find a dc ID, use DNS. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747132 (owner: 10Kormat) [15:08:17] PROBLEM - Apache HTTP on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:08:34] (03Merged) 10jenkins-bot: wmfdb/addr: If _dc_map doesn't find a dc ID, use DNS. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747132 (owner: 10Kormat) [15:08:48] the aqs endpoints are the new cluster being currently worked on by Data Engineer (no user traffic) [15:09:05] RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:09:27] PROBLEM - Apache HTTP on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:09:31] !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cp4025.ulsfo.wmnet with OS buster [15:09:33] PROBLEM - Apache HTTP on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:39] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster completed: - cp4025 (**FAIL*... [15:09:41] PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur [15:09:41] unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:09:43] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster executed with errors: - cp40... [15:09:56] These aqs alerts are to do with me. [15:10:23] RECOVERY - Apache HTTP on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:10:25] RECOVERY - Apache HTTP on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:10:31] RECOVERY - Apache HTTP on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:10:41] RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:10:49] RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:11:33] RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:12:07] (03CR) 10MSantos: [C: 03+1] kartographer: Enable tegola on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747111 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [15:13:58] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1018.eqiad.wmnet with OS buster [15:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1018.eqiad.wmnet with OS buster completed: - lvs1018 (**PASS**)... [15:15:21] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1019.eqiad.wmnet with OS buster [15:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1019.eqiad.wmnet with OS buster completed: - lvs1019 (**PASS**)... [15:21:06] (03PS1) 10Ladsgroup: cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747072 (https://phabricator.wikimedia.org/T297669) [15:21:41] !log hashar@deploy1002 Finished scap: Push wmf.13 without promoting any wikis (duration: 29m 31s) [15:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:28] (03PS1) 10Ladsgroup: cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747073 (https://phabricator.wikimedia.org/T297669) [15:22:32] (03CR) 10Ladsgroup: [C: 03+2] cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747072 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup) [15:22:35] (03CR) 10Ladsgroup: [C: 03+2] cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747073 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup) [15:25:08] (03PS2) 10Jelto: Rakefile: remove helm2 from Rakefile, bump scaffold to v2 api [deployment-charts] - 10https://gerrit.wikimedia.org/r/746864 (https://phabricator.wikimedia.org/T251305) [15:25:34] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: remove helm2 from Rakefile, bump scaffold to v2 api [deployment-charts] - 10https://gerrit.wikimedia.org/r/746864 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [15:28:45] (03CR) 10Giuseppe Lavagetto: [C: 03+1] imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [15:30:38] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Nice, but you should add the user to all clusters." [puppet] - 10https://gerrit.wikimedia.org/r/745202 (https://phabricator.wikimedia.org/T287130) (owner: 10JMeybohm) [15:31:33] 10SRE, 10Prod-Kubernetes, 10Kubernetes: Helm chart dependencies no longer in requirements.yaml - https://phabricator.wikimedia.org/T295750 (10MatthewVernon) [15:31:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add the ability to inject php files for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/747101 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [15:31:55] 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10MatthewVernon) [15:35:13] (03Merged) 10jenkins-bot: mediawiki: add the ability to inject php files for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/747101 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto) [15:35:42] (03PS1) 10Jgiannelos: tegola-vector-tiles: Disable pregeneration on eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/747136 (https://phabricator.wikimedia.org/T280767) [15:39:35] (03PS1) 10Vgutierrez: cache::envoy: Strip [] from X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/747137 (https://phabricator.wikimedia.org/T271421) [15:41:46] 10Puppet, 10Infrastructure-Foundations: Role hieradata for non-existent roles - https://phabricator.wikimedia.org/T296533 (10MatthewVernon) [15:42:12] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:26] 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Set max execution time for several expensive mediawiki actions - https://phabricator.wikimedia.org/T297708 (10Ladsgroup) [15:43:32] (03Merged) 10jenkins-bot: cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747072 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup) [15:44:04] (03Merged) 10jenkins-bot: cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747073 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup) [15:44:18] (03PS1) 10Filippo Giunchedi: prometheus: add mini-textfile-exporter [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946) [15:44:20] (03PS1) 10Filippo Giunchedi: prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) [15:45:36] (03CR) 10jerkins-bot: [V: 04-1] prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:46:09] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add mini-textfile-exporter [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:46:16] (03CR) 10JMeybohm: [C: 03+2] cert-manager: Allow ingress to webhook from k8s master and nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/747124 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [15:47:57] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2018.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [15:48:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2018.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage [15:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:49] (03PS6) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 [15:49:00] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1010.eqiad.wmnet [15:49:02] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host aqs1010.eqiad.wmnet [15:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:03] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) [15:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:10] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2018. Ready to be powered off any time. [15:49:43] (03Merged) 10jenkins-bot: cert-manager: Allow ingress to webhook from k8s master and nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/747124 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [15:50:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] !log drain primary/secondary instances off ganeti2023 T296622 [15:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:03] T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 [15:51:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:07] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10herron) [15:52:40] (03PS2) 10Filippo Giunchedi: prometheus: don't probe services not deployed in the current site [puppet] - 10https://gerrit.wikimedia.org/r/747055 (https://phabricator.wikimedia.org/T291946) [15:52:42] (03PS2) 10Filippo Giunchedi: prometheus: add mini-textfile-exporter [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946) [15:52:44] (03PS2) 10Filippo Giunchedi: prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) [15:53:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1010.eqiad.wmnet [15:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:00] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/cache/LinkCache.php: Backport: [[gerrit:747073|cache: Add four fields to LinkCache::getSelectFields (T297669)]] (duration: 00m 57s) [15:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:05] T297669: Noticeable increase in db load after wmf.12 roll out - https://phabricator.wikimedia.org/T297669 [15:54:27] (03CR) 10jerkins-bot: [V: 04-1] prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:56:55] (03CR) 10Jbond: [C: 03+1] "LGTM some minor nits (and will also need to copy the package)" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [15:58:55] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [15:58:58] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) I need to go to a meeting but after that, I'll run a rolling restart [15:59:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1010.eqiad.wmnet [15:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:14] (03PS7) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 [16:00:02] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1011.eqiad.wmnet [16:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:06] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/747048 (owner: 10Muehlenhoff) [16:00:59] (03PS8) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 [16:01:03] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: don't probe services not deployed in the current site [puppet] - 10https://gerrit.wikimedia.org/r/747055 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:02:35] (03CR) 10Filippo Giunchedi: "The idea here is to be able to 'target' only production services (e.g. for paging purposes) with an expression like the following:" [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [16:02:56] (03CR) 10Jbond: [C: 03+1] mx: make exim queue alert paging [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) (owner: 10Herron) [16:04:00] (03CR) 10Dzahn: "It seems uncontroversial that we want it to page. Just the actual threshold was "yet to be determined" per the original comment. +0.5" [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) (owner: 10Herron) [16:05:23] (03PS1) 10Vgutierrez: envoyproxy: Allow disabling x-request-id generation [puppet] - 10https://gerrit.wikimedia.org/r/747150 (https://phabricator.wikimedia.org/T271421) [16:05:25] (03PS1) 10Vgutierrez: cache::envoy: Disable x-request-id generation [puppet] - 10https://gerrit.wikimedia.org/r/747151 (https://phabricator.wikimedia.org/T271421) [16:08:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32992/console" [puppet] - 10https://gerrit.wikimedia.org/r/747150 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:08:14] (03CR) 10Vgutierrez: [C: 03+2] cache::envoy: Strip [] from X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/747137 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:08:20] (03CR) 10Jcrespo: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:09:41] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:10:15] jouncebot now [16:10:15] No deployments scheduled for the next 0 hour(s) and 49 minute(s) [16:11:13] PROBLEM - Check systemd state on wtp1034 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:45] OOM errors hitting wtp* hosts again (T297517) [16:11:45] T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 [16:11:51] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:12:27] (03PS1) 10Elukey: helmfile.d: Add Istio Egress config for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/747153 (https://phabricator.wikimedia.org/T294414) [16:15:10] (03PS1) 10Volans: WIP [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 [16:15:47] (03PS1) 10Elukey: helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414) [16:17:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "As i said, this is for fairness of tests rather than actually a desirable result - generating request IDs should happen somewhere at the e" [puppet] - 10https://gerrit.wikimedia.org/r/747150 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:17:42] (03PS2) 10Volans: remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 [16:19:33] (03PS1) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) [16:19:45] RECOVERY - Check systemd state on wtp1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:49] (03PS5) 10Jcrespo: mediabackup: Add an encryption key to store private files securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) [16:19:58] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] envoyproxy: Allow disabling x-request-id generation [puppet] - 10https://gerrit.wikimedia.org/r/747150 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:20:03] (03CR) 10Jcrespo: "done" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:20:39] PROBLEM - cassandra-a CQL 10.64.16.204:9042 on aqs1011 is CRITICAL: connect to address 10.64.16.204 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:20:50] !log accraze@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [16:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:05] PROBLEM - cassandra-b CQL 10.64.16.206:9042 on aqs1011 is CRITICAL: connect to address 10.64.16.206 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [16:21:10] !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host mirror1001.wikimedia.org with OS bullseye [16:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:17] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye [16:21:18] !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mirror1001.wikimedia.org with OS bullseye [16:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:29] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye executed with... [16:21:56] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32993/console" [puppet] - 10https://gerrit.wikimedia.org/r/747151 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:22:01] (03CR) 10Muehlenhoff: [C: 03+2] Update import hook to import logstash 6.8.21 [puppet] - 10https://gerrit.wikimedia.org/r/747048 (owner: 10Muehlenhoff) [16:22:28] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Disable x-request-id generation [puppet] - 10https://gerrit.wikimedia.org/r/747151 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:23:47] (03CR) 10jerkins-bot: [V: 04-1] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans) [16:24:33] (03CR) 10Jbond: [C: 03+1] mediabackup: Add an encryption key to store private files securely (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:24:43] !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host mirror1001.wikimedia.org with OS bullseye [16:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:50] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye [16:24:51] !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mirror1001.wikimedia.org with OS bullseye [16:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye executed with... [16:25:23] (03CR) 10jerkins-bot: [V: 04-1] spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [16:28:14] (03PS1) 10Jcrespo: mediabackup: Add dummy age private key for mediabackups [labs/private] - 10https://gerrit.wikimedia.org/r/747160 (https://phabricator.wikimedia.org/T262668) [16:28:25] (03PS2) 10Jcrespo: mediabackup: Add dummy age private key for mediabackups [labs/private] - 10https://gerrit.wikimedia.org/r/747160 (https://phabricator.wikimedia.org/T262668) [16:28:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:30:45] !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host mirror1001.wikimedia.org with OS bullseye [16:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:51] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye [16:30:52] !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mirror1001.wikimedia.org with OS bullseye [16:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:01] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye executed with... [16:32:28] PROBLEM - cassandra-a service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:32:49] (03PS3) 10Jcrespo: mediabackup: Add dummy age private key for mediabackups [labs/private] - 10https://gerrit.wikimedia.org/r/747160 (https://phabricator.wikimedia.org/T262668) [16:33:08] !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host mirror1001.wikimedia.org with OS bullseye [16:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:14] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye [16:33:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Cmjohnson) [16:34:50] PROBLEM - cassandra-b service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:36:34] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Cmjohnson) [16:36:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Cmjohnson) 05Open→03Resolved The servers are finished with rack and initial setup, cross row connections should be handled in a separate task. [16:37:52] (03CR) 10Jcrespo: "Will deploy https://gerrit.wikimedia.org/r/c/labs/private/+/747160 first to test compilation." [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:40:06] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mediabackup: Add dummy age private key for mediabackups [labs/private] - 10https://gerrit.wikimedia.org/r/747160 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:41:07] (03CR) 10Accraze: [C: 03+1] helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [16:41:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10Cmjohnson) [16:42:24] (03CR) 10Elukey: [C: 03+2] knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 (owner: 10Elukey) [16:43:11] (03CR) 10JHathaway: [C: 03+1] "2000, seems pretty reasonable, since we have about 900 messages sitting in the queue on mx1001 at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) (owner: 10Herron) [16:43:46] (03CR) 10Jcrespo: "Surprisingly, seems to work as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/32994/" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [16:43:48] !log rolling restart of php-fpm on all mediawiki hosts (T297517 T297667) [16:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:54] T297667: mysqli/mysqlnd memory leak - https://phabricator.wikimedia.org/T297667 [16:43:54] T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 [16:46:43] (03PS1) 10Giuseppe Lavagetto: mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163 [16:48:17] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163 (owner: 10Giuseppe Lavagetto) [16:48:40] (03CR) 10Elukey: [C: 03+2] helmfile.d: Add Istio Egress config for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/747153 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey) [16:51:26] PROBLEM - Check systemd state on wtp1045 is CRITICAL: CRITICAL - degraded: The following units failed: phpsessionclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:31] (03PS2) 10Giuseppe Lavagetto: mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163 [16:51:54] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:01] (03PS2) 10Elukey: helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414) [16:53:21] jbond, I am about to run "reprepro -C main includedeb buster-wikimedia age_1.0.0~rc1-2+b3_amd64.deb" on apt1001- I double checked the sha256sum and tested it on a buster host (no new dependencies) [16:53:46] jynus: cool [16:55:14] PROBLEM - Check systemd state on wtp1041 is CRITICAL: CRITICAL - degraded: The following units failed: phpsessionclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:45] can it be because of the rolling restart? [16:55:49] jynus: i did a test install on sretest1001 and look ok to me [16:55:54] if so, it should recover [16:56:04] jbond, cool [16:56:20] (03PS3) 10Elukey: helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414) [16:56:22] (03PS1) 10Elukey: knative-serving: fix net_istio_egress template [deployment-charts] - 10https://gerrit.wikimedia.org/r/747164 [16:56:24] (03PS3) 10Volans: remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 [16:56:26] (03PS2) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) [16:56:28] (03PS1) 10Volans: pylint: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/747165 [16:56:33] I will send a patch to install it on puppet masters if you are ok with that (I think it may be useful outside of mediabackups) [16:56:51] jynus: sgtm [16:57:22] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mirror1001.wikimedia.org with OS bullseye [16:57:25] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye completed: - m... [16:59:06] RECOVERY - cassandra-c CQL 10.192.48.172:9042 on restbase2026 is OK: TCP OK - 0.033 second response time on 10.192.48.172 port 9042 https://phabricator.wikimedia.org/T93886 [17:00:04] jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:12] ✅ [17:01:33] (03CR) 10Elukey: [C: 03+2] knative-serving: fix net_istio_egress template [deployment-charts] - 10https://gerrit.wikimedia.org/r/747164 (owner: 10Elukey) [17:03:23] (03PS1) 10BBlack: Add mediawiki redirects for WME typo domains [puppet] - 10https://gerrit.wikimedia.org/r/747167 (https://phabricator.wikimedia.org/T296445) [17:04:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163 (owner: 10Giuseppe Lavagetto) [17:04:27] (03PS1) 10BBlack: Define enterprise.(wm|wp).o for MW-level redirects [dns] - 10https://gerrit.wikimedia.org/r/747168 (https://phabricator.wikimedia.org/T296445) [17:04:55] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T297652 (10elappen-WMF) Thank you so much @MatthewVernon! [17:05:37] (03PS2) 10BBlack: Add MW and ncredir redirects for WME typo domains [puppet] - 10https://gerrit.wikimedia.org/r/747167 (https://phabricator.wikimedia.org/T296445) [17:06:10] the rolling restart is done now [17:06:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Cmjohnson) [17:06:44] !log hnowlan@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: name=restbase2026.codfw.wmnet [17:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:53] (03PS1) 10Volans: sre.hosts.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) [17:07:14] Amir1: The mem usage chart I'm looking at for parsoid dropped down a lot. [17:07:25] (03CR) 10Volans: "Example usage in I38b4bccee29e3222654c078f8544dfba03a8ca16" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [17:07:30] (03Merged) 10jenkins-bot: mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163 (owner: 10Giuseppe Lavagetto) [17:07:34] yeah but that's sorta expected, it's fresh and without the leak [17:07:53] the leak is happening but hopefully with slower pace with lower number of db queries happening [17:08:09] nod. [17:08:12] Desired. [17:09:15] for a while we can run the rolling restart until the underlying issue gets fixed [17:09:30] (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Disable pregeneration on eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/747136 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [17:09:43] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [17:10:09] 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) [17:10:14] RECOVERY - cassandra-b service on aqs1011 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:10:20] RECOVERY - cassandra-b CQL 10.64.16.206:9042 on aqs1011 is OK: TCP OK - 0.000 second response time on 10.64.16.206 port 9042 https://phabricator.wikimedia.org/T93886 [17:10:24] RECOVERY - Check systemd state on wtp1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:33] (03PS1) 10Jcrespo: puppetmaster: Install 'age' on puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) [17:10:44] RECOVERY - Check systemd state on wtp1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:52] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:04] (03CR) 10Volans: "Example cookbook usage for Icc10491cf2c90d2bc51122c7ec3d2e168327afba . CI is expected to fail until the linked patch is merged and release" [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans) [17:11:08] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: Install 'age' on puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [17:12:20] RECOVERY - cassandra-a CQL 10.64.16.204:9042 on aqs1011 is OK: TCP OK - 0.000 second response time on 10.64.16.204 port 9042 https://phabricator.wikimedia.org/T93886 [17:12:20] RECOVERY - cassandra-a service on aqs1011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:12:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1011.eqiad.wmnet [17:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:06] (03PS2) 10Jcrespo: puppetmaster: Install 'age' on puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) [17:14:29] (03CR) 10Jbond: [C: 03+1] mediabackup: Add an encryption key to store private files securely (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [17:14:56] (03CR) 10Jbond: [C: 03+1] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans) [17:15:05] (03Merged) 10jenkins-bot: tegola-vector-tiles: Disable pregeneration on eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/747136 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [17:15:13] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:40] (03CR) 10Jbond: [C: 03+1] pylint: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/747165 (owner: 10Volans) [17:17:20] (03CR) 10Volans: [C: 03+2] pylint: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/747165 (owner: 10Volans) [17:18:50] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1012.eqiad.wmnet [17:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10Cmjohnson) [17:20:08] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10thcipriani) Documenting my understanding of this problem after reading this task (along with T297669 and T2976... [17:20:27] (03CR) 10Jcrespo: "I think this will be a useful tool no matter what, but if we use age for key generation, this is almost a requirement." [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [17:21:27] !log icinga - re-enabling active monitoring checks on mx2001 (T297128) [17:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:32] T297128: Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128 [17:23:08] (03Merged) 10jenkins-bot: pylint: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/747165 (owner: 10Volans) [17:23:16] !log elastic1043 is down and alerting since > 6h [17:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:27] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:13] jhathaway: just added mirror1001 in puppet? [17:24:33] mutante: yes, just re-imaged it [17:24:38] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Am I right in assuming that this data has the same schema as the original `netflow`? [17:24:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1012.eqiad.wmnet [17:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1013.eqiad.wmnet [17:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:45] jhathaway: confirmed there are new monitoring checks for in Icinga in the state "pending". So soon it might start talking about these. Though the cookbook would first set a downtime for 2 hours or so. [17:26:04] this means the host is in puppetdb and it worked, basically [17:26:43] ok thanks [17:27:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:33] (the pending ones are actually just the mgmt interface, other checks already green but with disabled notifications ) [17:30:35] (03PS4) 10Volans: remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 [17:30:37] (03PS3) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) [17:31:11] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [17:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:29] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:31:55] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1013.eqiad.wmnet [17:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:29] aphlict1001 ran out of disk, people2002 dpkg error, cr1 OSPF alerts, stat1007 broken product-analytics-movement-service, and 20 other alerts [17:34:45] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [17:35:01] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [17:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:05] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) I don't have strong opinions but I think wmf.12 issues are "mitigated" (but not resolved) and wmf.1... [17:35:31] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:35] (03CR) 10Jbond: [C: 03+1] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans) [17:35:54] (03PS1) 10Vgutierrez: prometheus::ops: Gather full metrics for cache::envoy [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421) [17:35:56] (03PS1) 10Vgutierrez: prometheus::ops: Add varnish/ATS metrics for cache::upload_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421) [17:36:30] (03CR) 10jerkins-bot: [V: 04-1] prometheus::ops: Gather full metrics for cache::envoy [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:36:51] (03CR) 10jerkins-bot: [V: 04-1] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans) [17:36:54] (03CR) 10jerkins-bot: [V: 04-1] prometheus::ops: Add varnish/ATS metrics for cache::upload_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:38:09] !log aphlict1001 - (Phabricator realtime notifications) - out of disk, attempting to gzip a large log [17:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Cmjohnson) [17:39:00] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:13] (03PS2) 10Vgutierrez: prometheus::ops: Gather full metrics for cache::envoy [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421) [17:39:15] (03PS2) 10Vgutierrez: prometheus::ops: Add varnish/ATS metrics for cache::upload_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421) [17:39:51] (03PS5) 10Volans: remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 [17:39:53] (03PS4) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) [17:40:26] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Joe) FWIW, I wholeheartedly agree with @thcipriani's opinions above. As for the remaining work: we need to ru... [17:41:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1014.eqiad.wmnet [17:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:39] !log Temporarily deactivated BGP peering to AS8932 at AMS-IX (cr2-esams) as peer is constantly tripping max-prefix configuration for a few days, and according to peeringdb they should be within limit. [17:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:43] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:42:48] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32996/console" [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:43:47] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) >>! In T297517#7570257, @thcipriani wrote: > > - I would prefer we either (a) abandon wmf.12 and roll... [17:44:06] (03PS2) 10Andrew Bogott: cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary [puppet] - 10https://gerrit.wikimedia.org/r/745950 (https://phabricator.wikimedia.org/T289888) [17:44:08] (03PS1) 10BBlack: lvs1020: lvs role and iface/addr metadata [puppet] - 10https://gerrit.wikimedia.org/r/747173 (https://phabricator.wikimedia.org/T295804) [17:45:22] (03PS1) 10Andrew Bogott: Replace cloudmetrics1001 with cloudmetrics1003 [dns] - 10https://gerrit.wikimedia.org/r/747174 (https://phabricator.wikimedia.org/T297712) [17:46:14] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] prometheus::ops: Gather full metrics for cache::envoy [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:46:29] (03PS1) 10BBlack: lvs1020: add to homer lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) [17:46:51] jynus: you got a pending commit on labs-private repo to be merged [17:47:27] RECOVERY - DPKG on people2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:47:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1014.eqiad.wmnet [17:47:31] vgutierrez, oh [17:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:33] let me fix that [17:47:37] I always forget [17:48:00] sorry, done [17:48:01] RECOVERY - Disk space on aphlict1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops [17:48:31] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32997/console" [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:48:55] !log people2002 - apt-get install --reinstall linux-image-5.10.0-9-amd64 to fix Icinga DPKG alert [17:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:25] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] prometheus::ops: Add varnish/ATS metrics for cache::upload_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [17:51:09] ACKNOWLEDGEMENT - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:51:09] ACKNOWLEDGEMENT - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:51:09] ACKNOWLEDGEMENT - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:51:09] ACKNOWLEDGEMENT - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:51:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1015.eqiad.wmnet [17:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:57] i'm deploying something [17:53:18] hehe [17:54:35] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10aborrero) 05Open→03Stalled Yes, we will be doing trunk. Thanks @Papaul I think we're fine here from DCops side f... [17:54:58] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) [17:55:11] (03PS3) 10Hnowlan: maps: write tegola swift credentials out to file [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) [17:56:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) 05Open→03Stalled We just re-shifted team priorities... [17:56:36] (03CR) 10Hnowlan: maps: write tegola swift credentials out to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [17:57:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1015.eqiad.wmnet [17:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) 05Open→03Stalled FYI network details for these servers are blocked on {T296411}, which is in turn stalled, so marking... [18:00:04] chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1800). [18:04:56] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Zabe) >>! In T297517#7570358, @ssastry wrote: > [...] does that mean if wmf.13 had to be rolled back, it will... [18:10:50] (03CR) 10Muehlenhoff: logstash: use logstash-oss for gelf_relay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [18:17:02] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic rolling restart - ryankemper@cumin1001 - T297468 [18:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:20] !log T297468 `sudo cookbook sre.elasticsearch.rolling-operation cloudelastic "cloudelastic rolling restart" --nodes-per-run 3 --start-datetime 2021-12-14T01:27:58 --task-id T297468` on `ryankemper@cumin1001` tmux `elastic_restarts` [18:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:33] !log T297468 [Elastic] Performing manual rolling restart of `relforge`. Starting with `ryankemper@relforge1004:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service` (non-master node) [18:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:13] (03CR) 10Muehlenhoff: [C: 04-1] "Fine with this, but the puppet masters are on buster and age is only included as of bullseye." [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [18:22:54] (03CR) 10Cwhite: logstash: use logstash-oss for gelf_relay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [18:24:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [18:25:31] !log repooling eventgate-main discovery to include codfw - T296699 - confctl --object-type discovery select 'dnsdisc=eventgate-main,name=codfw' set/pooled=true [18:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:36] T296699: Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 [18:25:39] !log otto@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-main,name=codfw [18:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:31] 10SRE, 10Analytics, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Ottomata) Ran ` root@puppetmaster1001:~# confctl --object-type discovery select 'dnsdisc=eventgate-main,name=codfw' set/pooled=... [18:28:32] !log T297468 [Elastic] `ryankemper@relforge1003:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service` [18:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:57] 10SRE, 10serviceops, 10Patch-For-Review: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) 05Open→03Stalled stalled on https://gerrit.wikimedia.org/r/c/operations/puppet/+/736596/5 [18:30:17] (03CR) 10Dzahn: "While this is waiting for follow-up, can I simply add parsoid to "mw" alias to get the ticket resolved, while forgetting about the rest of" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [18:31:49] (03CR) 10Cathal Mooney: [C: 03+1] "interfaces.yaml bits all look good, rest also makes sense but I'm not as familiar with that." [puppet] - 10https://gerrit.wikimedia.org/r/747173 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [18:32:32] (03PS1) 10Kosta Harlan: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941) [18:32:49] (03CR) 10Giuseppe Lavagetto: cumin: reorganize mediawiki aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [18:32:57] (03PS1) 10Urbanecm: Make fix-staging-perms also fix /srv/patches permissions [puppet] - 10https://gerrit.wikimedia.org/r/747187 [18:33:13] (03CR) 10Dzahn: "still wanna get this done? I think Filippo's latest comment about a typo is still current" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm) [18:34:13] (03CR) 10Dzahn: cumin: reorganize mediawiki aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [18:34:36] !log deployed updated patch for T297322 [18:34:37] * majavah done [18:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:16] (03CR) 10Jcrespo: "0:-)" [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [18:38:34] !log milimetric@deploy1002 Started deploy [analytics/refinery@92c63c9]: Regular analytics weekly train [analytics/refinery@92c63c9] [18:38:36] (03PS6) 10Dzahn: cumin: add parsoid servers to all-mw-* aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) [18:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:21] (03PS7) 10Dzahn: cumin: add parsoid servers to all-mw-* aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) [18:39:40] (03CR) 10Dzahn: "amended, rebased, recycled. ok to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [18:39:58] !log lvs1016: downtimed for attempt at moving its role to lvs1020 (expect a few minor related alerts, such as BGP ones for eqiad routers) [18:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:44] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Proof it is working: {F34883926} {F34883925} [18:40:49] !log lvs1016: puppet agent disabled, pybal stopped [18:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:05] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:41:29] ^ expected from lvs1016/lvs1020 work [18:41:45] thanks! and ACK @ no touching pybal [18:42:19] ACKNOWLEDGEMENT - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black lvs1016/lvs1020 swap process https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:42:19] ACKNOWLEDGEMENT - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black lvs1016/lvs1020 swap process https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:43:52] (03PS1) 10Clare Ming: Prevent A/B test enrollment hook from firing for unsampled [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747075 (https://phabricator.wikimedia.org/T297662) [18:50:56] is elastic ok? [18:52:10] ah, it is cloudelastic, not elastic [18:52:26] and seems to be getting better [18:52:28] (03CR) 10Herron: [C: 03+1] prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [18:52:30] (03PS1) 10Bartosz Dziewoński: VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747190 (https://phabricator.wikimedia.org/T296269) [18:55:15] (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747075 (https://phabricator.wikimedia.org/T297662) (owner: 10Clare Ming) [18:56:59] PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:58:23] !log milimetric@deploy1002 Finished deploy [analytics/refinery@92c63c9]: Regular analytics weekly train [analytics/refinery@92c63c9] (duration: 19m 49s) [18:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:40] (03CR) 10BBlack: [C: 03+2] lvs1020: lvs role and iface/addr metadata [puppet] - 10https://gerrit.wikimedia.org/r/747173 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [18:59:13] RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:59:29] !log lvs1020: running puppet agent with lvs role + config for first time [18:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] RoanKattouw and Urbanecm: (Dis)respected human, time to deploy UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1900). Please do the needful. [19:00:05] nn1l2, nemo-yiannis, and cjming: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:12] hi [19:00:12] o/ [19:00:30] hey [19:00:43] Hey [19:01:07] !log milimetric@deploy1002 Started deploy [analytics/refinery@92c63c9] (thin): Regular analytics weekly train THIN [analytics/refinery@92c63c9] [19:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:14] !log milimetric@deploy1002 Finished deploy [analytics/refinery@92c63c9] (thin): Regular analytics weekly train THIN [analytics/refinery@92c63c9] (duration: 00m 07s) [19:01:17] !log milimetric@deploy1002 Started deploy [analytics/refinery@92c63c9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@92c63c9] [19:01:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:27] cjming: want to deploy today? Or should i? [19:01:52] hello, i have a patch too if i'm not too late. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/747190 [19:02:01] i'll add it to the table [19:02:15] MatmaRex: feel free to [19:02:21] urbanecm: do you mind doing it? i'm trying to get another patch out the door [19:02:46] cjming: not at all [19:02:56] ty 🙌 [19:03:10] (03PS1) 10BBlack: lvs1020: add interface_tweaks data [puppet] - 10https://gerrit.wikimedia.org/r/747192 (https://phabricator.wikimedia.org/T295804) [19:03:39] (03CR) 10Urbanecm: [C: 03+2] Prevent A/B test enrollment hook from firing for unsampled [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747075 (https://phabricator.wikimedia.org/T297662) (owner: 10Clare Ming) [19:03:57] (03CR) 10Urbanecm: [C: 03+2] kartographer: Enable tegola on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747111 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [19:04:39] (03CR) 10BBlack: [C: 03+2] lvs1020: add interface_tweaks data [puppet] - 10https://gerrit.wikimedia.org/r/747192 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [19:05:16] (03Merged) 10jenkins-bot: kartographer: Enable tegola on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747111 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos) [19:06:51] oops, i got disconnected, hope i didn't miss anything [19:08:11] !log milimetric@deploy1002 Finished deploy [analytics/refinery@92c63c9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@92c63c9] (duration: 06m 54s) [19:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:21] MatmaRex: nope [19:08:50] nemo-yiannis: your patch is at mwdebug1001 [19:08:53] can you have a look? [19:09:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:04] nn1l2: sorry for not being clear yesterday. I meant that the patch ideally should be scheduled with a +1 from someone. I tend to not have time to do enough review during B&C [19:10:08] i can try to do it at the end [19:10:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:10:09] but no guarantees [19:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:25] (03PS2) 10Urbanecm: VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747190 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:10:40] understood [19:11:12] it's not because the patch is complicated or something. it's...quite large (compared to other config patches) [19:12:11] (03CR) 10Urbanecm: [C: 03+2] VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747190 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:12:26] nemo-yiannis: hey, how is your test going? [19:12:41] diff looks ok, i am having some hard time navigating ja.wikipedia.org [19:12:53] nemo-yiannis: try to clear your cookies [19:12:59] (it's a...known issue) [19:13:26] the gadget that caused it was disabled, but...obviously we can't remove the cookies from visitors ourselves :( [19:13:34] (03Merged) 10jenkins-bot: VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747190 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński) [19:15:02] to roots: can someone tell me what is `mwdebug1001:/srv/mediawiki/w/debug/vardump.php`? it's root-owned in a root-owned directory, and scap complains about it [19:15:31] (03CR) 10Cwhite: [C: 03+2] logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite) [19:15:33] nemo-yiannis: pulled the patch to mwdebug1002 as well -- I'm not sure if the error i missed when originally pulling broke scap pull or not [19:16:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:03] !log lvs1020 - rebooting on new config [19:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:01] i cant find a page on jawiki to reproduce the issue but patch should be fairly straightforward (we've already tried in many wikis the past few days) [19:19:13] its more of a matter of rollout [19:19:27] nemo-yiannis: so ok to deploy? [19:19:30] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic rolling restart - ryankemper@cumin1001 - T297468 [19:19:31] or do you want more time? [19:19:32] i think so yeah [19:19:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:39] okay [19:19:47] once i get clarification re `mwdebug1001:/srv/mediawiki/w/debug/vardump.php`, I'll push it [19:22:59] (03Merged) 10jenkins-bot: Prevent A/B test enrollment hook from firing for unsampled [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747075 (https://phabricator.wikimedia.org/T297662) (owner: 10Clare Ming) [19:27:24] (03PS1) 10Jbond: P:age::store: Add profile and class to configure age secret store [puppet] - 10https://gerrit.wikimedia.org/r/747193 [19:27:26] (03PS1) 10Jbond: O:puppetmaster: Add age::store to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/747194 [19:27:54] urbanecm: found an article from the logs, change should be ok [19:28:04] okay [19:28:16] I'm discussing the suspicious file with others in a different channel [19:28:17] stay turned [19:28:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:41] (03PS2) 10Jbond: O:puppetmaster: Add age::store to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/747194 [19:29:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32999/console" [puppet] - 10https://gerrit.wikimedia.org/r/747194 (owner: 10Jbond) [19:32:18] (03CR) 10AntiCompositeNumber: [C: 04-1] "WV should not be removed from trwikivoyage. Everything else looks fine from here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [19:33:53] (03CR) 10AntiCompositeNumber: [C: 03+1] "Correction: not an issue, WV is already a default alias on wikivoyages." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2) [19:34:38] definitely would have been better to split that into smaller patches [19:37:54] PROBLEM - PyBal BGP sessions are established on lvs1020 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad+prometheus/ops [19:38:09] Thanks AntiComposite [19:39:10] bblack: not sure if the alert above is expected or not -- saw you rebooted lvs1020 recently [19:39:42] (03PS1) 10Jgiannelos: Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 [19:40:32] urbanecm: thanks. yes, expected/ok [19:40:41] thanks for checking bblack [19:41:47] (03CR) 10Jgiannelos: [C: 04-1] "Block until next deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (owner: 10Jgiannelos) [19:42:10] (03PS2) 10Jgiannelos: Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) [19:46:07] nemo-yiannis: going to sync your patch soon, rzl's dealing with the file [19:46:17] sounds good [19:46:42] scap at mwdebug1001 completes w/o errors [19:47:16] (03PS3) 10JHathaway: debian mirrors: add new mirror, mirror1001 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/745612 (https://phabricator.wikimedia.org/T286898) [19:47:54] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7f4ae4cc678aa64b0795be7bc4c9a6f1ba4c1929: kartographer: Enable tegola on jawiki (T280767) (duration: 00m 58s) [19:47:58] nemo-yiannis: and, live [19:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:59] T280767: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 [19:48:20] MatmaRex: sorry this takes so long. your patch is at mwdebug1001 [19:48:22] please test [19:48:31] looking [19:48:42] cc mbsantos ^ [19:48:43] (03CR) 10JHathaway: [C: 03+2] debian mirrors: add new mirror, mirror1001 in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/745612 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [19:49:13] cjming: your patch is at mwdebug1002, can you have a look? [19:49:52] urbanecm: gtg [19:49:55] urbanecm: seems good [19:50:02] cjming: that was quick, thanks [19:50:03] syncing both [19:50:07] hopefully this is the last time you hear about VE on zhwiki :) [19:50:36] and A/B test enrollment fixes 🤞 [19:50:41] hehe [19:51:27] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 40f0cff8da7c4484e1fe93b9d649fd03f462e434: VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki (T296269) (duration: 00m 57s) [19:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:34] T296269: Enable VisualEditor for Chinese Wikipedia - https://phabricator.wikimedia.org/T296269 [19:51:38] MatmaRex: i wouldn't be _that_ sure about zhwiki and VE. I'm working for Growth, which was the team that kinda created the need for VE there :)) [19:52:01] hah [19:53:00] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.12/skins/Vector/resources/skins.vector.es6/AB.js: 62e84e7467c1765986cd1f80b466b8cacc6d91f6: Prevent A/B test enrollment hook from firing for unsampled (T297662) (duration: 00m 56s) [19:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:06] T297662: mediawiki_web_ab_test_enrollment schema is logging users in the unsampled bucket - https://phabricator.wikimedia.org/T297662 [19:53:07] cjming: should be live [19:53:23] urbanecm: thanks! [19:53:27] with the exception of nn1l2's patch, we're done [19:53:43] I'm still around [19:53:44] (03CR) 10Ottomata: [C: 03+1] Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [19:53:59] if you want to deploy it :) [19:54:11] nn1l2: can we leave it for tomorrow? [19:54:19] Of course! [19:54:28] it's reviewed now, but i'm afraid there's not enough team for testing it [19:54:31] *time [19:54:41] thanks :) [19:54:44] No problem [19:54:46] (03PS5) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) [19:54:47] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10hashar) We will promote testwikis to wmf.13 in a few minutes. Tomorrow evening we would had wmf.12 running on... [19:54:48] !log UTC evening B&C window done [19:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:32] (03CR) 10Jgiannelos: [C: 04-1] "Is there anything else other than this patch that we need to do to remove the deprecated stream?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos) [19:55:40] thanks urbanecm [19:55:43] np [19:55:55] thanks for the VE work :)) [19:56:13] (03PS2) 10Urbanecm: zhwiki: Promote Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746831 (https://phabricator.wikimedia.org/T287884) [19:56:16] actually...let me also quickly push this [19:56:27] (03CR) 10Urbanecm: [C: 03+2] zhwiki: Promote Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746831 (https://phabricator.wikimedia.org/T287884) (owner: 10Urbanecm) [19:57:09] (03Merged) 10jenkins-bot: zhwiki: Promote Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746831 (https://phabricator.wikimedia.org/T287884) (owner: 10Urbanecm) [19:58:11] urbanecm: actually, a quick question [19:58:17] yes? [19:58:34] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e127f4c6459cd9bc708b35a75c1f272b96fc3211: zhwiki: Promote Growth features out of dark mode (T287884) (duration: 00m 57s) [19:58:35] urbanecm: so zhwiki still has the wiktiext editor as the default mode, is that okay for the Growth features? [19:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:39] T287884: Deploy Growth features on Chinese Wikipedia - https://phabricator.wikimedia.org/T287884 [19:58:40] * urbanecm now fully done with deployment [19:58:42] visual is available, but the user has to switch to it [19:58:57] very good question [19:59:00] let me check that [19:59:37] i am guessing that your code probably makes sure to open VE when it needs VE [19:59:47] but i haven't tested and i don't know if you've enabled it on wikis with this config before [20:00:05] hashar and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T2000). [20:00:11] for non-structured edits, we're telling the newcomer to press "Edit" (and highlighting it with a blinking dot) [20:01:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:26] (03PS6) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) [20:01:28] at my WMF acc, it works [20:01:42] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:02:07] testing with a new one [20:02:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:29] good morning :) [20:02:46] Hey hashar [20:02:49] hello hashar [20:02:57] will do the testwikis promotion [20:03:04] since apparently the deployments are done aren't they? [20:03:06] hashar: can you wait for a second? [20:03:10] urbanecm: it's the same config as enwiki btw [20:03:13] MatmaRex pointed out a reason why i should revert [20:03:13] sure! [20:03:23] or...maybe not? [20:03:44] please take your time. There is no rush ;) [20:03:47] enwiki and eswiki, frwiktionary, hewiki [20:03:49] thanks [20:04:06] so if it's enabled on any of these as well, you're probably good [20:04:12] (wmgVisualEditorIsSecondaryEditor) [20:04:21] with zhwiki, we're at all Wikipedias except pwnwiki [20:04:55] i double checked it, and VE loads as expected [20:05:01] hashar: over to you :)) [20:05:17] okay great :D [20:05:21] launching! [20:05:32] and thanks MatmaRex for raising that up [20:06:49] actually I am doing group 0 not testwikis [20:08:02] Promote group0 from 1.38.0-wmf.12 to 1.38.0-wmf.12 refs T293954 [y/N] y [20:08:03] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [20:08:08] so hmm I broke the script :D [20:08:19] * urbanecm was just typing "fingers crossed" [20:08:21] too late i guess [20:08:22] it tries to promote from 12 to 12 [20:08:52] I have hit ^C before pressing enter [20:09:27] fun [20:09:54] so I am blocked until I figure out why it can't find out the new version [20:09:56] i thought the script has an argument of target version? [20:11:43] hmm maybe [20:11:47] but really it should just work [20:11:57] docs in https://github.com/wikimedia/mediawiki-tools-release/blob/master/bin/deploy-promote#L45 say "defaults to last version in wikiversions.json" [20:12:57] or the doc is outdated [20:13:42] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [20:13:45] https://github.com/wikimedia/mediawiki-tools-release/blob/master/bin/deploy-promote#L339 looks to query scap wikiversions-inuse --staging, which outputs only 1.38.0-wmf.12 [20:13:50] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Wait_for_deploy_window shows ~/release/bin/deploy-promote group0 [20:14:19] (03PS1) 10Hashar: group0 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747200 [20:14:22] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747200 (owner: 10Hashar) [20:14:25] and wikiversions.json does not have wmf.13 in it [20:14:32] yeah [20:14:43] so i think that the "Sync to cluster and verify on testwiki" step was not done [20:15:09] (and the later step you quote assumes testwiki is already at the new version, so promoting to "newest version that's in use" works) [20:15:10] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747200 (owner: 10Hashar) [20:15:14] my 2c on the bug :)) [20:15:26] I have amended the wiki doc [20:15:32] ohh [20:15:36] yeah testwikis that is it [20:15:37] bah [20:15:40] thank you urbanecm ! [20:15:46] any time! [20:16:19] 20:15:51 Check 'Logstash Error rate for mw1416.eqiad.wmnet' failed: ERROR: 92% OVER_THRESHOLD (Avg. Error rate: Before: 0.00, After: 14.00, Threshold: 1.00) [20:16:23] :( [20:16:31] that looks like a very short window to me [20:16:32] it is a single canary though [20:16:50] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.13 refs T293954 [20:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:55] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [20:17:59] (03CR) 10AOkoth: gitlab: restore script keep_config options (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [20:18:09] ouch, I can't even go onto https://test.wikipedia.org/ [20:18:20] (03PS2) 10BBlack: eqiad lvs_neighbors: swap lvs1020 for lvs1016 [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) [20:18:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:55] hashar: i think you did not build i18n [20:19:00] (ie full scap sync-world) [20:19:12] group0 is fully down [20:19:46] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 4704 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:19:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:56] hashar: please revert [20:20:04] yeah [20:20:06] trying [20:20:16] FileNotFoundError: [Errno 2] ExtensionMessages not found in /srv/mediawiki-staging/wmf-config/ExtensionMessages-1.38.0-wmf.13.php: '/srv/mediawiki-staging/wmf-config/ExtensionMessages-1.38.0-wmf.13.php' [20:20:17] :( [20:20:29] yeah, missing scap sync-world in the proces [20:20:36] i18n not getting build [20:20:57] I did sync-world earlier [20:21:21] anyway I can't seem to be able to rollback due the above error bah [20:21:27] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [20:21:31] hashar: did you use --force? [20:21:32] --force? [20:21:52] ditto [20:22:03] or maybe I can sync-file wikiversions.json [20:22:15] (03CR) 10BBlack: [C: 03+2] eqiad lvs_neighbors: swap lvs1020 for lvs1016 [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [20:22:49] (03Merged) 10jenkins-bot: eqiad lvs_neighbors: swap lvs1020 for lvs1016 [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [20:23:05] hashar: that won't work [20:23:13] It has a compile step in it [20:23:19] yeah :-\ [20:23:35] Compile manually and syncing the php version should work [20:23:38] so I gotta trigger a rebuild of the l10n [20:23:56] cause of course we no more have the l10n update helper in scap :/ [20:23:56] That'll fail for same reason [20:24:06] PROBLEM - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:39] FileNotFoundError: [Errno 2] Directory not found: '/srv/mediawiki-staging/php-1.38.0-wmf.13/cache/l10n' [20:24:40] :-( [20:24:45] so clearly I am in trouble [20:24:54] Just group0, fortunately [20:25:00] I can try to fix it myself in a minite [20:25:17] I don't even understand why the l10n cache did not get build in the first place [20:26:35] dancy: ^ any chance that's related to the new scap version? [20:27:42] ACKNOWLEDGEMENT - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service andrew bogott I dont know what this is, but mediawiki behavior in codfw1dev barely matters -- its a rapidly deprecating test/dev site. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:50] mediawiki.org is still down BTW [20:27:57] yeah [20:27:59] opened my laptop [20:28:31] !log group0 wikis (eg mediawiki.org) are unavailable due to a deployment issue. We are working on it # T293954 [20:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:36] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [20:29:04] here if you need anything from SRE, staying hands-off otherwise [20:29:09] compiled php version [20:29:16] hashar: mind if i try to sync it? [20:29:20] here if there's anything i can help with. [20:29:24] please try yes [20:29:33] (03CR) 10Dzahn: [C: 03+1] debian mirrors: add new mirror, mirror1001 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/745612 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [20:29:37] cause I can't find a way to rebuild the localization cache from scratch [20:30:31] doing [20:30:34] the issue might be that I ran `scap sync-world` while wmf.13 was not listed in wikiversions.json [20:30:44] sounds plausible [20:30:48] which I guess might not have caused the generation of the l10n cache [20:31:04] yeah, if the testwikis step was missed, that would make sense. [20:31:07] so when I then run sync-wikiversion there is no l10n cache pushed anywere and the sites explode [20:31:12] !log urbanecm@deploy1002 Synchronized wikiversions.php: rollback group0 (duration: 00m 41s) [20:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:16] oh [20:31:18] wikis are up [20:31:37] I did the git revert directly on the deploy server [20:31:40] yup [20:31:41] will send it to gerrit [20:31:43] made use of that [20:31:48] RECOVERY - Check systemd state on cloudweb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:54] (03PS1) 10Hashar: Revert "group0 wikis to 1.38.0-wmf.13 refs T293954" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747202 [20:33:17] FTR, i did `sudo -u mwdeploy cp /srv/mediawiki-staging/wikiversions.json /srv/mediawiki/wikiversions.json`, then `scap wikiversions-compile`, then `cp /srv/mediawiki/wikiversions.php /srv/mediawiki-staging/wikiversions.php` followed by `scap sync-file --force wikiversions.php 'rollback group0'` [20:33:29] (03CR) 10Hashar: [C: 03+2] "Deployed by Urbanecm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747202 (owner: 10Hashar) [20:33:48] that is clever urbanecm ! [20:34:08] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.38.0-wmf.13 refs T293954" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747202 (owner: 10Hashar) [20:34:20] so now I guess I should bump testwiki to wmf.13 [20:34:25] and run scap sync-world [20:34:33] which would trigger the l10n cache build for wmf.13 [20:34:38] (03CR) 10Dzahn: [C: 03+1] "alright, yea, let's ship it" [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) (owner: 10Herron) [20:34:51] !log Group 0 wikis are available again and still on 1.38.0-wmf.12 [20:34:52] !log Manually rollback group0 to wmf.12 by running `sudo -u mwdeploy cp /srv/mediawiki-staging/wikiversions.json /srv/mediawiki/wikiversions.json && scap wikiversions-compile && cp /srv/mediawiki/wikiversions.php /srv/mediawiki-staging/wikiversions.php && scap sync-file --force wikiversions.php 'rollback group0'` [20:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:09] log'ed the magic sequence, as it's an usual operation to perform [20:35:17] * urbanecm logs off from deployment infra [20:35:50] !log hashar@deploy1002 Started scap: testwiki to php-1.38.0-wmf.13 and rebuild l10n cache [20:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:00] hashar: what you suggested as next steps make sense to me [20:36:06] doing that [20:36:10] great [20:36:28] so my rookie mistake is that I did a sync world without any entries in wikiversion.json being at wmf.13 [20:36:34] thus syncing solely the code [20:36:40] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:36:42] sounds like it [20:37:04] 20:36:51 Updating ExtensionMessages-1.38.0-wmf.13.php [20:37:04] 20:36:51 Updating LocalisationCache for 1.38.0-wmf.13 using 30 thread(s) [20:37:12] sounds about right [20:37:23] urbanecm: please order yourself a "I fixed the website" t-shirt :] [20:37:36] where do i do that? :)) [20:37:43] no idea haha [20:38:15] ages ago we had a t-shirt "I broke wikipedia and I fixed it" [20:38:17] Isn't it "I broke wikipedia and I fixed it" .. you each get half of it .. lol. [20:38:27] which was not to brag about breaking wikipedia, cause at the time it was super easy to do [20:38:40] but really that one managed to fix it using any leverage needed [20:38:59] the key point being to be totally transparent about what has happened including being honest with mistake [20:39:09] the second point is screaming for help as soon as possible :D [20:39:19] i think both things happened here :)) [20:39:26] i'm now imagining a little 2-piece friendship necklace like were popular in the 90s with half of the phrase on each side. [20:39:38] :wiki_love: :D [20:39:47] leaving it to hashar now [20:39:52] I would be very honored to share such a necklace with urbanecm [20:40:38] urbanecm: thank you so much. I am back on track! [20:40:41] (03PS1) 10BBlack: pybal: peer all eqiad lvses with eqiad routers [puppet] - 10https://gerrit.wikimedia.org/r/747203 (https://phabricator.wikimedia.org/T295804) [20:41:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:30] and the canary error was probably legit but since it only got detected on one canary that was not enough to abort [20:41:38] I guess cause group0 does not have that much traffic [20:42:35] (03CR) 10BBlack: [C: 03+2] pybal: peer all eqiad lvses with eqiad routers [puppet] - 10https://gerrit.wikimedia.org/r/747203 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [20:43:17] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/747203 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack) [20:43:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [20:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:35] brennen: dancy: thank you :) [20:44:28] RECOVERY - PyBal BGP sessions are established on lvs1020 is OK: NaN https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad+prometheus/ops [20:45:00] NaN? [20:45:16] https://en.wikipedia.org/wiki/NaN [20:45:36] don't ask me why it reports the NaN or why that's ok, but the thing it's checking is actually functioning :) [20:45:59] the linked grafana dashboard ain't much help either :) [20:46:08] that's exactly what I was asking :-) [20:47:27] AntiComposite: it actually is, once you get the datasource and server parameters fixed [20:47:34] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [20:48:32] hashar: Sorry abut the trouble. Reading scrollback to see what happened. [20:49:12] dancy: I did the scap prep and then a scap sync-world but wmf.13 was not in wikiversions.json so the l10n cache did not get build [20:49:19] which I guess is working as expected [20:49:23] ooh, edge case [20:49:37] the issue is that I should have promoted "testwiki" to wmf.13 which would have caused the sync-world to build the cache [20:49:52] then an hour ago I did the scap wikiversions which well only synced that file [20:50:01] promoting all of group0 wikis to wmf.13 as expected [20:50:06] but without any l10n cache :-\ [20:50:40] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:50:46] so yeah I should have followed the process to the letter :/ [20:50:50] nod. [20:50:52] hugs [20:51:05] trying to make sense of all the backlog, I've been off working on unrelated things: are we fully-ok now on whatever happened with train stuff? [20:51:16] yup [20:51:30] ok thanks [20:51:34] I have head back to the start of the process and now doing the testwiki update [20:51:49] then will promote group0 to wmf.13 [20:52:27] ok [20:52:42] I have a semi-risky test to execute on the lvs stuff, but I'll wait till after you're done just in case [20:53:06] I will let you know as soon as it has completed [20:53:36] sync is roughly 40% done [20:57:22] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:59:34] majavah: Looks like I'm off the hook! [21:02:48] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:04:41] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) a:05Papaul→03None [21:06:00] (03PS1) 10Eric Gardner: Don't attempt to scroll to a non-existing result [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747078 [21:09:37] !log hashar@deploy1002 Finished scap: testwiki to php-1.38.0-wmf.13 and rebuild l10n cache (duration: 33m 47s) [21:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:46] now promoting group0 wikis [21:10:20] $ ~/release/bin/deploy-promote group0 [21:10:20] Promote group0 from 1.38.0-wmf.12 to 1.38.0-wmf.13 refs T293954 [21:10:20] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [21:10:28] this time deploy promote works as expected [21:10:34] (03PS1) 10Hashar: group0 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747204 [21:10:36] 👍🏾 [21:10:36] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747204 (owner: 10Hashar) [21:10:45] I feel dumb really [21:10:54] Flying too close to the sun [21:11:06] I have been running that for ages and I still manage to screw up something whenever I try to outsmart the process [21:11:09] yeah [21:11:13] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.13 refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747204 (owner: 10Hashar) [21:11:41] funnily I was saying this week-end that holidays and week-end are usually super quiet [21:11:53] hehe [21:11:56] indicating that humans touching computers are the root cause of all issues and outages [21:12:13] and that really we should be replaced by cron jobs :D [21:12:17] nod. then squirrels. [21:13:36] logstash has 67k queries missing for mediawiki.org from 20:15 to 20:30 [21:15:20] a huge chunk of them being for /w/index.php?title=Special:HideBanners&duration=604800&category=fundraising&reason=close [21:15:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [21:18:12] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.13 refs T293954 [21:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:19] T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954 [21:19:24] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:19:48] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:19:55] bblack: I have updated the group0 wikis to wmf.13 and there is no log error so it is probably a good one :] [21:19:58] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:20:02] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:20:16] hashar: thanks! [21:20:41] I will go off for the night soonish [21:20:59] I'll keep watch [21:21:00] and dancy is the backup if something needs assistant on the mediawiki train end [21:21:50] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [21:22:34] how did we lose two links? [21:23:39] 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10hashar) After some deployment issue, 1.38.0-wmf.13 has reached group 0 wikis. [21:25:16] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5006 is CRITICAL: 2.483e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [21:25:16] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5001 is CRITICAL: 2.7e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [21:25:18] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5013 is CRITICAL: 2.283e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [21:25:42] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5012 is CRITICAL: 2.298e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [21:25:48] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5015 is CRITICAL: 2.604e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [21:26:00] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5016 is CRITICAL: 2.243e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5016 [21:26:00] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5004 is CRITICAL: 2.382e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5004 [21:26:06] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 9763 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [21:26:06] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5008 is CRITICAL: 2.288e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [21:26:06] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5010 is CRITICAL: 2.418e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [21:26:18] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5005 is CRITICAL: 2.412e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [21:26:30] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5003 is CRITICAL: 2.814e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [21:26:30] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5014 is CRITICAL: 2.492e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [21:26:32] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5002 is CRITICAL: 5.544e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [21:27:10] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 1.804e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [21:27:18] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: 3.278e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [21:28:20] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:29:39] (03PS1) 10Legoktm: Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747079 (https://phabricator.wikimedia.org/T297744) [21:29:49] (03PS1) 10Legoktm: Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747080 (https://phabricator.wikimedia.org/T297744) [21:30:23] kafka brokers and purge events, something funky is going on [21:30:31] anyone have an idea? [21:30:40] https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-instance=cp5001 [21:30:52] seems like codfw datacenter is sending purges, was only eqiad before [21:31:02] and latency is up in kafka [21:31:03] i repooled codfw eventgate-main today [21:31:07] must be related [21:31:27] https://phabricator.wikimedia.org/T296699 [21:31:39] but, just pooling it shouldn't matter... [21:31:45] looks like the graph event might've started ~3h ago [21:31:50] but just started causing those alerts above [21:31:52] yeah that seems about right [21:31:59] it is supposed to be active/active [21:32:12] it just wasn't for a while due to a bug somewhere [21:32:13] the broker latency didn't start spiking until ~15 mins ago though [21:32:27] dancy: I am off to bed. Logstash seems happy nothing concerning since I have promoted group 0 :] [21:32:29] oh wow yeah [21:32:39] 👍🏾 [21:32:47] dancy: will do some triage tomorrow morning and file tasks as needed. But it seems to be quiet train. Have a good afternoon! [21:32:55] Have a good night! [21:33:22] (03CR) 10MSantos: [C: 03+1] maps: write tegola swift credentials out to file [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan) [21:37:02] as far as i can tell it is all on the consumer side [21:37:25] its just eqsin? [21:37:41] bblack: do yo know how the eqsin consumers are configured? how do they know which main kafka cluster cluster to consume from? [21:41:18] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5015 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [21:41:25] okay, found it in puppet [21:41:26] profile::cache::purge::kafka_cluster_name [21:41:50] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [21:41:58] most use main-eqiad, codfw and ulsfo use main-codfw [21:42:09] bblack: what's up with this link stuff? [21:42:21] this is an eqsin consumer reading from eqiad [21:42:39] if there is link latency, that would cause increase in RTT and consumer latency, right? [21:43:21] 10SRE, 10MediaWiki-Revision-backend, 10Performance-Team (Radar): Compress data at external storage - https://phabricator.wikimedia.org/T106386 (10Krinkle) [21:45:46] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [21:47:44] ottomata: yeah I guess so, but the link stuff shouldn't have impacted eqsin, I don't think [21:47:51] maybe I'm missing something there [21:48:17] the latency for ulsfo should've increased, but not eqsin [21:48:26] (and even ulsfo, shouldn't be by that much [21:48:27] ) [21:51:58] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5015 is CRITICAL: 3.155e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [21:53:21] oh yeah, I guess the primary eqsin transport is via-ulsfo [21:53:24] so this impacts that as well [21:53:33] hmmmm [21:54:44] ah [21:56:36] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 5.217e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [21:57:36] (03PS1) 10Jbond: populate_puppetdb: update tp use config class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/747207 [22:13:01] (03CR) 10Dzahn: gitlab: restore script keep_config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [22:28:46] PROBLEM - puppet last run on wcqs1001 is CRITICAL: CRITICAL: Puppet has been disabled for 604921 seconds, message: Debugging nginx - jetty request handling - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [22:34:22] I'm going to deploy a security patch [22:36:58] 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) 05Open→03Resolved The was a breaker problem . This is now resolved [22:37:12] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) [22:42:46] (Traffic on tunnel link) firing: Traffic on tunnel link - https://alerts.wikimedia.org [22:47:28] 10SRE, 10MediaWiki-Revision-backend, 10Performance-Team: Compress data at external storage - https://phabricator.wikimedia.org/T106386 (10Krinkle) [22:57:56] ACKNOWLEDGEMENT - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP Cathal Mooney Telia IC-331929 to cr3-eqsin down. - The acknowledgement expires at: 2021-12-16 09:00:50. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:59:11] ACKNOWLEDGEMENT - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP Cathal Mooney Telia IC-331929 to cr1-codfw down - The acknowledgement expires at: 2021-12-15 22:58:51. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:02:46] (Traffic on tunnel link) resolved: Traffic on tunnel link - https://alerts.wikimedia.org [23:03:21] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:03:35] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:04:13] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5016 is OK: (C)5000 gt (W)3000 gt 1683 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5016 [23:04:17] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5010 is OK: (C)5000 gt (W)3000 gt 2038 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [23:04:17] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5008 is OK: (C)5000 gt (W)3000 gt 1168 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [23:04:35] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5005 is OK: (C)5000 gt (W)3000 gt 679 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [23:04:51] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5003 is OK: (C)5000 gt (W)3000 gt 445.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003 [23:04:53] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5014 is OK: (C)5000 gt (W)3000 gt 451.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014 [23:04:53] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5002 is OK: (C)5000 gt (W)3000 gt 315.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [23:05:09] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5006 is OK: (C)5000 gt (W)3000 gt 383.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [23:05:11] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5015 is OK: (C)5000 gt (W)3000 gt 390.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015 [23:05:11] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5012 is OK: (C)5000 gt (W)3000 gt 364.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [23:05:41] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5004 is OK: (C)5000 gt (W)3000 gt 418.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5004 [23:05:47] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 589.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [23:08:49] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 354.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [23:08:51] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5013 is OK: (C)5000 gt (W)3000 gt 579.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013 [23:10:03] !log deploying patch for T297416 [23:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:49] PROBLEM - Number of messages locally queued by purged for processing on cp5012 is CRITICAL: cluster=cache_text instance=cp5012 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [23:10:59] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5001 is OK: (C)5000 gt (W)3000 gt 437.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001 [23:12:33] RECOVERY - Number of messages locally queued by purged for processing on cp5012 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [23:15:34] !log lvs1014 (upload) - disabling pybal, will over traffic to lvs1020 (to test lvs1020 sanity) [23:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:50] (expect a couple of pybal/bgp alerts here) [23:18:12] (03PS1) 10Eric Gardner: Remove multiple instance of VUEX initialization [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747081 (https://phabricator.wikimedia.org/T297690) [23:20:19] PROBLEM - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [23:20:46] ^ expected [23:21:13] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:21:19] PROBLEM - pybal on lvs1014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [23:21:39] PROBLEM - PyBal connections to etcd on lvs1014 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [23:22:09] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: (C)5000 gt (W)3000 gt 380.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [23:26:27] !log lvs1014 (upload) restart pybal, back to normal [23:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:49] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:27:53] RECOVERY - PyBal connections to etcd on lvs1014 is OK: OK: 36 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [23:27:55] RECOVERY - pybal on lvs1014 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [23:28:39] !log lvs1013 (text) - disabling pybal, will fail over traffic to lvs1020 (to test lvs1020 sanity) [23:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:03] RECOVERY - PyBal backends health check on lvs1014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:33:39] PROBLEM - pybal on lvs1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [23:33:41] PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [23:34:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:34:23] PROBLEM - PyBal connections to etcd on lvs1013 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [23:34:28] ^ again all expected [23:35:44] (03CR) 10Dzahn: [C: 03+2] "merging this. also checked in codesearch it's not used in other code" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [23:38:14] (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [23:39:31] 10SRE, 10serviceops, 10Patch-For-Review: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) the "all-mw-*" aliases now include parsoid servers: ` before: [cumin1001:~] $ sudo cumin A:all-mw-eqiad 'uptime' 157 hosts will be targeted: mw[1302-1456].eqi... [23:41:42] 10SRE, 10serviceops, 10Patch-For-Review: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) 05Stalled→03Resolved I did add them to "all-mw" while not touching core "mw". Based on Gerrit comments etc. Hope this still resolves it! [23:44:09] !log lvs1013 (text) restart pybal, back to normal [23:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:41] RECOVERY - pybal on lvs1013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [23:44:45] RECOVERY - PyBal backends health check on lvs1013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:45:09] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:46:49] RECOVERY - PyBal connections to etcd on lvs1013 is OK: OK: 12 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [23:49:09] !log lvs1015 (internal services) - disabling pybal, will fail over traffic to lvs1020 (to test lvs1020 sanity) [23:49:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:35] (03CR) 10Catrope: [C: 03+2] Remove multiple instance of VUEX initialization [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747081 (https://phabricator.wikimedia.org/T297690) (owner: 10Eric Gardner) [23:51:50] (03CR) 10Catrope: [C: 03+2] Don't attempt to scroll to a non-existing result [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747078 (owner: 10Eric Gardner) [23:51:59] (03CR) 10Catrope: [C: 03+2] Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747079 (https://phabricator.wikimedia.org/T297744) (owner: 10Legoktm) [23:52:04] (03CR) 10Catrope: [C: 03+2] Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747080 (https://phabricator.wikimedia.org/T297744) (owner: 10Legoktm) [23:52:42] --^ Early +2s for the upcoming backport window, because CI takes forever to merge them [23:53:09] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10colewhite) I tested Logstash 7.10 writing api feature usage logs to an ES 6 instance in cloud. Somewhere in the pipeline, the api feature usage logs... [23:53:42] (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary [puppet] - 10https://gerrit.wikimedia.org/r/745950 (https://phabricator.wikimedia.org/T289888) (owner: 10Andrew Bogott) [23:53:47] Thanks Roan [23:54:01] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:54:11] (03CR) 10Andrew Bogott: [C: 03+2] Replace cloudmetrics1001 with cloudmetrics1003 [dns] - 10https://gerrit.wikimedia.org/r/747174 (https://phabricator.wikimedia.org/T297712) (owner: 10Andrew Bogott) [23:54:41] PROBLEM - pybal on lvs1015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [23:54:57] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [23:55:35] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:56:47] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal [23:57:33] does the restbase-dev1005 alert have any known cause? [23:57:47] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:58:31] hopefully just a blip!