[00:00:04] <jouncebot>	 RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T0000).
[00:00:04] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:09:44] <Jdlrobson>	 present
[00:09:50] <Jdlrobson>	 Is there a deployer available?
[00:10:26] <EricGardner>	 I know that Roan is not available at the moment
[00:10:37] <Jdlrobson>	 Thanks EricGardner 
[00:10:46] <wikibugs>	 (03PS3) 10Jdlrobson: MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743)
[00:10:56] <EricGardner>	 I too am trying to get some things deployed in the current window but I'm still putting cherry picks together
[00:11:11] <Jdlrobson>	 thcipriani: are you around?
[00:13:10] <wikibugs>	 (03PS4) 10Jdlrobson: Default commons search experience is MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745935 (https://phabricator.wikimedia.org/T297484)
[00:13:25] <wikibugs>	 (03PS6) 10Jdlrobson: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051
[00:13:34] <wikibugs>	 (03PS3) 10Jdlrobson: Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193)
[00:24:17] <wikibugs>	 (03CR) 10Eric Gardner: "This change is ready for review." [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746915 (https://phabricator.wikimedia.org/T297529) (owner: 10Eric Gardner)
[00:24:35] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) >>! In T297517#7566856, @brennen wrote: > We're currently on 1.38.0-wmf.9, and this remains a block...
[00:25:25] <tgr>	 Jdlrobson: did you find someone?
[00:26:06] <Jdlrobson>	 tgr: nope
[00:26:47] <tgr>	 I'll deploy then
[00:26:54] <Jdlrobson>	 tgr: thank you <3
[00:27:18] <EricGardner>	 I'm taking myself off the list, my patches can ride the train it turns out
[00:27:22] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) The only thing unique to this report as compared to T296098 and T296063 is the failure mode, i.e. m...
[00:28:01] <wikibugs>	 (03Abandoned) 10Eric Gardner: Vue: Unbreak after Vue 3 migration [extensions/MediaSearch] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746915 (https://phabricator.wikimedia.org/T297529) (owner: 10Eric Gardner)
[00:29:35] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Default commons search experience is MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745935 (https://phabricator.wikimedia.org/T297484) (owner: 10Jdlrobson)
[00:30:15] <wikibugs>	 (03Merged) 10jenkins-bot: Default commons search experience is MediaSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745935 (https://phabricator.wikimedia.org/T297484) (owner: 10Jdlrobson)
[00:31:23] <tgr>	 Jdlrobson: first patch is on mwdebug1001
[00:31:28] <Jdlrobson>	 testing
[00:34:11] <Jdlrobson>	 LGTM.
[00:34:15] <Jdlrobson>	 Haven't checked the logs yet
[00:34:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:44] <wikibugs>	 (03PS4) 10Gergő Tisza: Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson)
[00:35:07] <Jdlrobson>	 tgr: i think we're good to sync that one. Not seeing anything new on logstash/mwdebug channel
[00:35:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:10] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson)
[00:36:22] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:745935|Default commons search experience is MediaSearch (T297484)]] (duration: 00m 56s)
[00:36:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:27] <stashbot>	 T297484: Update how destination of top-right search form is set - https://phabricator.wikimedia.org/T297484
[00:36:52] <wikibugs>	 (03Merged) 10jenkins-bot: Remove broken wikipedia-wordmark-en.png symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745573 (https://phabricator.wikimedia.org/T278193) (owner: 10Jdlrobson)
[00:37:34] <tgr>	 Jdlrobson: second patch is on mwdebug1001
[00:38:34] <wikibugs>	 (03PS4) 10Gergő Tisza: MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743) (owner: 10Jdlrobson)
[00:39:20] <Jdlrobson>	 tgr: that's good to go too.
[00:41:20] <logmsgbot>	 !log tgr@deploy1002 Synchronized images/mobile/: Config: [[gerrit:745573|Remove broken wikipedia-wordmark-en.png symlink (T278193)]] (duration: 00m 56s)
[00:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:41:25] <stashbot>	 T278193: [php-fpm] Symbolic link not allowed or link target not accessible: wikipedia-wordmark-en.png - https://phabricator.wikimedia.org/T278193
[00:42:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:42:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:22] <tgr>	 Jdlrobson: can you check production too? (if thre's anything to check, not sure how that works with symlinks) Files can be tricky due to the edge cache needing purges.
[00:44:54] <Jdlrobson>	 tgr: yep checking
[00:45:05] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743) (owner: 10Jdlrobson)
[00:45:59] <wikibugs>	 (03Merged) 10jenkins-bot: MinervaDonateLink is enabled in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745940 (https://phabricator.wikimedia.org/T191743) (owner: 10Jdlrobson)
[00:47:00] <Jdlrobson>	 php-fpm I'm getting a 404 on https://en.wikipedia.org/images/mobile/wikipedia-wordmark-en.png so that's promising
[00:47:05] <Jdlrobson>	 Need to monitor the logs a bit more though
[00:48:16] <tgr>	 thanks! meanwhile the third patch is on mwdebug
[00:48:29] <Jdlrobson>	 tgr: testing that one now..
[00:48:55] <Jdlrobson>	 Original exception: [f282c392-f8a3-47e6-8e19-f2bc9e3b5475] 2021-12-14 00:48:32: Fatal exception of type "TypeError" doesn't seem good
[00:49:23] <Jdlrobson>	 ahh yeh that one's not good.
[00:49:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:49:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:37] <Jdlrobson>	 It looks like I misread the format. Please revert that one. I'll redo it
[00:49:54] <wikibugs>	 (03PS1) 10Jdlrobson: Revert "MinervaDonateLink is enabled in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746916
[00:50:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:50:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:51:41] <wikibugs>	 (03PS1) 10Gergő Tisza: Revert "MinervaDonateLink is enabled in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746917
[00:51:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:51:55] <wikibugs>	 (03PS1) 10Jdlrobson: [Attempt 2] MinervaDonateLink is enabled in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746977
[00:52:03] <Jdlrobson>	 tgr: the above one is the correct one^
[00:52:16] <Jdlrobson>	 not sure if it makes sense to revert than try again or just squash these into 2 
[00:52:57] <tgr>	 squashing is nicer if you can do it
[00:53:01] <Jdlrobson>	 can
[00:53:30] <wikibugs>	 (03Abandoned) 10Gergő Tisza: Revert "MinervaDonateLink is enabled in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746917 (owner: 10Gergő Tisza)
[00:53:33] <wikibugs>	 (03PS2) 10Jdlrobson: [Attempt 2] MinervaDonateLink is enabled in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746977
[00:53:37] <Jdlrobson>	 there you go
[00:53:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:54:49] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] [Attempt 2] MinervaDonateLink is enabled in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746977 (owner: 10Jdlrobson)
[00:55:17] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10brennen) > Is tuning the kernel the thing that you want unbroken now? Again, it has probably been broken for y...
[00:55:28] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10brennen)
[00:55:34] <wikibugs>	 (03Merged) 10jenkins-bot: [Attempt 2] MinervaDonateLink is enabled in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746977 (owner: 10Jdlrobson)
[00:56:48] <tgr>	 Jdlrobson: it's on mwdebug1001
[00:57:34] <Jdlrobson>	 tgr: testing
[00:57:45] <Jdlrobson>	 tgr: LGTM
[00:57:49] <Jdlrobson>	 donate link still there :)
[00:58:13] <wikibugs>	 (03PS7) 10Gergő Tisza: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson)
[00:58:37] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) Since the train was rolled forward from wmf.9 -> wmf.12 today, [[ https://grafana.wikimedia.org/d/000...
[00:58:59] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:746977|[Attempt 2] MinervaDonateLink is enabled in production""]] (duration: 00m 57s)
[00:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:59:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[00:59:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:00:30] <tgr>	 hm, not sure what the sync order is for dblist changes these days. dblist -> yaml -> generator -> IS.php?
[01:00:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10komla)
[01:00:57] <Jdlrobson>	 tgr: I'm not sure either. Can delay this one until tomorrow if you are not comfortable doing it
[01:01:00] <Jdlrobson>	 it's not urgent at all
[01:01:04] <Jdlrobson>	 just an opportunity to clean up some cruft
[01:03:11] <tgr>	 it has to be safe as long as PHP is left to the end, as far as I can see
[01:03:30] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson)
[01:03:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[01:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:04:44] <wikibugs>	 (03Merged) 10jenkins-bot: Clean up readers web team config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/743051 (owner: 10Jdlrobson)
[01:05:54] <tgr>	 Jdlrobson: it's on mwdebug1001
[01:06:01] <Jdlrobson>	 tgr: testing
[01:08:13] <Jdlrobson>	 tgr: good to sync
[01:10:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[01:10:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:10:20] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/config/: Config: [[gerrit:743051|Clean up readers web team config]] (duration: 00m 56s)
[01:10:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:11:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[01:11:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:11:49] <logmsgbot>	 !log tgr@deploy1002 Synchronized dblists/mobile-anon-talk.dblist: Config: [[gerrit:743051|Clean up readers web team config]] (duration: 00m 55s)
[01:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:15:02] <Jdlrobson>	 tgr: and testing in production
[01:15:19] <tgr>	 it's not really deployed yet
[01:15:30] <Jdlrobson>	 ah ok ping me when i should check
[01:15:57] <tgr>	 I'm trying to figure out whether https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/6e17f55d2badd6efcf30dd856a0bcc1da35217cd/multiversion/MWConfigCacheGenerator.php#16 is for deployers or code authors
[01:18:53] <tgr>	 I guess that's outdated? https://noc.wikimedia.org/conf/ seems to include the new dblist without any manual action
[01:19:43] <merryprog>	 There's https://wikitech.wikimedia.org/w/index.php?search=%22createTxtFileSymlinks.sh%22&title=Special:Search&profile=advanced&fulltext=1&ns0=1&ns12=1&ns116=1&ns498=1 so probably outdated
[01:19:47] <Jdlrobson>	 tgr: running it locally seems to do nothing. 
[01:20:39] <logmsgbot>	 !log tgr@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Config: [[gerrit:743051|Clean up readers web team config]] (duration: 00m 55s)
[01:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:22:23] <tgr>	 seems like the last time someone ran it was in 2019
[01:22:48] <tgr>	 oh well, can't hurt.
[01:24:35] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:743051|Clean up readers web team config]] (duration: 00m 55s)
[01:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:25:10] <tgr>	 Jdlrobson: now deployed for reals
[01:26:20] <Jdlrobson>	 tgr: yay! thanks a bunch
[01:26:25] <Jdlrobson>	 running through some last tests
[01:27:29] <Jdlrobson>	 tgr: and all looks good to me 
[01:28:08] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan)
[01:30:19] <wikibugs>	 (03PS4) 10Gergő Tisza: wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan)
[01:32:50] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan)
[01:33:40] <wikibugs>	 (03Merged) 10jenkins-bot: wgEventStreams: Add WelcomeSurvey Interaction schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745833 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan)
[01:36:29] <logmsgbot>	 !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:745833|wgEventStreams: Add WelcomeSurvey Interaction schema (T267273)]] (duration: 00m 56s)
[01:36:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:36:35] <stashbot>	 T267273: [arwiki] Submitting a POST on a form redirected to immediately after account creation sometimes logs user out - https://phabricator.wikimedia.org/T267273
[01:37:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[01:37:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:39:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[01:39:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:41:03] <tgr>	 !log UTC late deploys done
[01:41:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:41:42] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad rolling restart - ryankemper@cumin1001 - T297468
[01:41:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:42:16] <ryankemper>	 !log T297468 `sudo cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad rolling restart" --nodes-per-run 3 --start-datetime 2021-12-14T01:27:58 --task-id T297468` on `ryankemper@cumin1001` tmux `elastic_restarts`
[01:42:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:43:15] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 98 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[01:49:23] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:05:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:05:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:06:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:06:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:07:02] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.13 [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746984
[02:07:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.13 [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746984 (owner: 10TrainBranchBot)
[02:13:56] <wikibugs>	 (03Abandoned) 10Gergő Tisza: Revert "MinervaDonateLink is enabled in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746916 (owner: 10Jdlrobson)
[02:24:25] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 89 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:27:00] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10tstarling) I filed T297667 for the PHP bug which I'm working on.
[02:27:07] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.13 [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746984 (owner: 10TrainBranchBot)
[02:29:23] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1014 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:29:31] <icinga-wm>	 PROBLEM - Check systemd state on aqs1014 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-a.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:15] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.65:9042 on aqs1014 is CRITICAL: connect to address 10.64.48.65 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[02:30:23] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 56 probes of 638 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:33:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:34:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[02:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:38:15] <icinga-wm>	 RECOVERY - cassandra-a service on aqs1014 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:38:23] <icinga-wm>	 RECOVERY - Check systemd state on aqs1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:41:19] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.65:9042 on aqs1014 is OK: TCP OK - 0.000 second response time on 10.64.48.65 port 9042 https://phabricator.wikimedia.org/T93886
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T0300)
[03:23:57] <icinga-wm>	 PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:13:03] <wikibugs>	 (03CR) 10Juan90264: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746919 (https://phabricator.wikimedia.org/T297580) (owner: 10Juan90264)
[04:13:40] <wikibugs>	 (03PS4) 10Juan90264: Fix wordmark to outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746919 (https://phabricator.wikimedia.org/T297580)
[04:22:51] <icinga-wm>	 PROBLEM - SSH on db2083.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:22:57] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) It indeed looks like wmf.12 has increased db traffic: https://grafana.wikimedia.org/d/000000278/mys...
[04:23:18] <wikibugs>	 (03PS1) 10Ladsgroup: Cache page properties in memory to avoid extra queries [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746920 (https://phabricator.wikimedia.org/T297132)
[04:25:01] <icinga-wm>	 RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:28:38] <wikibugs>	 (03PS2) 10Ladsgroup: Cache page properties in memory to avoid extra queries [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746920 (https://phabricator.wikimedia.org/T297132)
[04:30:19] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) Created {T297669} for the database issue.
[04:42:10] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad rolling restart - ryankemper@cumin1001 - T297468
[04:42:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:02:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Cache page properties in memory to avoid extra queries [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746920 (https://phabricator.wikimedia.org/T297132) (owner: 10Ladsgroup)
[05:07:14] <wikibugs>	 (03Merged) 10jenkins-bot: Cache page properties in memory to avoid extra queries [extensions/DiscussionTools] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/746920 (https://phabricator.wikimedia.org/T297132) (owner: 10Ladsgroup)
[05:09:05] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/extensions/DiscussionTools/includes/Hooks/HookUtils.php: Backport: [[gerrit:746920|Cache page properties in memory to avoid extra queries (T297132 T297669)]] (duration: 00m 57s)
[05:09:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:09:11] <stashbot>	 T297669: Noticeable increase in db load after wmf.12 roll out - https://phabricator.wikimedia.org/T297669
[05:09:12] <stashbot>	 T297132: DiscussionTools is making duplicate DB requests back to back - https://phabricator.wikimedia.org/T297132
[05:11:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[05:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:12:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[05:12:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:23:55] <icinga-wm>	 RECOVERY - SSH on db2083.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:25:35] <wikibugs>	 (03PS1) 10Ladsgroup: blameStartupRegistry: Fix clash in $startupBytes variable name [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746921 (https://phabricator.wikimedia.org/T295413)
[05:25:55] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "Catch the train, doesn't seem to need syncing" [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746921 (https://phabricator.wikimedia.org/T295413) (owner: 10Ladsgroup)
[05:28:01] <wikibugs>	 (03Merged) 10jenkins-bot: blameStartupRegistry: Fix clash in $startupBytes variable name [extensions/WikimediaMaintenance] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746921 (https://phabricator.wikimedia.org/T295413) (owner: 10Ladsgroup)
[05:34:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[05:34:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:35:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[05:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:38:25] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is CRITICAL: connect to address 10.64.0.120 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[05:38:41] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:38:55] <icinga-wm>	 PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-b.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:49:23] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/746949 (https://phabricator.wikimedia.org/T293331) (owner: 10Accraze)
[05:59:33] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on 15 hosts with reason: Maintenance
[05:59:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:59:44] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 15 hosts with reason: Maintenance
[05:59:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:31] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[06:00:33] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[06:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[06:01:21] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[06:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T277354)', diff saved to https://phabricator.wikimedia.org/P18180 and previous config saved to /var/cache/conftool/dbconfig/20211214-060125-marostegui.json
[06:01:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:30] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[06:03:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T277354)', diff saved to https://phabricator.wikimedia.org/P18181 and previous config saved to /var/cache/conftool/dbconfig/20211214-060311-marostegui.json
[06:03:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:17] <icinga-wm>	 RECOVERY - cassandra-b service on aqs1010 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:05:33] <icinga-wm>	 RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:07:11] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.0.120:9042 on aqs1010 is OK: TCP OK - 0.000 second response time on 10.64.0.120 port 9042 https://phabricator.wikimedia.org/T93886
[06:18:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P18182 and previous config saved to /var/cache/conftool/dbconfig/20211214-061816-marostegui.json
[06:18:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P18183 and previous config saved to /var/cache/conftool/dbconfig/20211214-063321-marostegui.json
[06:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:55] <wikibugs>	 10SRE, 10DBA, 10observability, 10Patch-For-Review, 10User-Ladsgroup: Send metrics of db errors of mediawiki to prometheus  - https://phabricator.wikimedia.org/T297435 (10Marostegui)
[06:48:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T277354)', diff saved to https://phabricator.wikimedia.org/P18184 and previous config saved to /var/cache/conftool/dbconfig/20211214-064825-marostegui.json
[06:48:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[06:48:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[06:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:31] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[06:48:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T277354)', diff saved to https://phabricator.wikimedia.org/P18185 and previous config saved to /var/cache/conftool/dbconfig/20211214-064833-marostegui.json
[06:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:49:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO: CAS should link to account creation tutorial - https://phabricator.wikimedia.org/T297524 (10Majavah) 05Open→03Resolved a:03jbond thanks!
[06:50:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T277354)', diff saved to https://phabricator.wikimedia.org/P18186 and previous config saved to /var/cache/conftool/dbconfig/20211214-065019-marostegui.json
[06:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P18187 and previous config saved to /var/cache/conftool/dbconfig/20211214-070524-marostegui.json
[07:05:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:04] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Joe) >>! In T297517#7568203, @tstarling wrote: >>>! In T297517#7566856, @brennen wrote: >> We're currently on...
[07:16:52] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Joe) >>! In T297517#7568208, @tstarling wrote: > The only thing unique to this report as compared to T296098 a...
[07:20:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P18188 and previous config saved to /var/cache/conftool/dbconfig/20211214-072029-marostegui.json
[07:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update revscoring-articlequality img [deployment-charts] - 10https://gerrit.wikimedia.org/r/746949 (https://phabricator.wikimedia.org/T293331) (owner: 10Accraze)
[07:24:08] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw rolling restart - ryankemper@cumin2001 - T297468
[07:24:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:57] <ryankemper>	 !log T297468 `sudo cookbook sre.elasticsearch.rolling-operation search_codfw "codfw rolling restart" --nodes-per-run 3 --start-datetime 2021-12-14T01:27:58 --task-id T297468` on `ryankemper@cumin2001` tmux `elastic_restarts`
[07:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:17] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mwdebug: switch to socket proxying [deployment-charts] - 10https://gerrit.wikimedia.org/r/747008
[07:35:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T277354)', diff saved to https://phabricator.wikimedia.org/P18189 and previous config saved to /var/cache/conftool/dbconfig/20211214-073534-marostegui.json
[07:35:35] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[07:35:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[07:35:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:40] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[07:35:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T277354)', diff saved to https://phabricator.wikimedia.org/P18190 and previous config saved to /var/cache/conftool/dbconfig/20211214-073541-marostegui.json
[07:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:55] <icinga-wm>	 PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:37:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T277354)', diff saved to https://phabricator.wikimedia.org/P18191 and previous config saved to /var/cache/conftool/dbconfig/20211214-073727-marostegui.json
[07:37:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:41] <wikibugs>	 (03PS1) 10Marostegui: generate_dsns_table.sh: Remove [software] - 10https://gerrit.wikimedia.org/r/747009
[07:42:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] generate_dsns_table.sh: Remove [software] - 10https://gerrit.wikimedia.org/r/747009 (owner: 10Marostegui)
[07:42:47] <wikibugs>	 (03Merged) 10jenkins-bot: generate_dsns_table.sh: Remove [software] - 10https://gerrit.wikimedia.org/r/747009 (owner: 10Marostegui)
[07:45:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mwdebug: switch to socket proxying [deployment-charts] - 10https://gerrit.wikimedia.org/r/747008 (owner: 10Giuseppe Lavagetto)
[07:48:19] <wikibugs>	 (03Merged) 10jenkins-bot: mwdebug: switch to socket proxying [deployment-charts] - 10https://gerrit.wikimedia.org/r/747008 (owner: 10Giuseppe Lavagetto)
[07:52:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P18192 and previous config saved to /var/cache/conftool/dbconfig/20211214-075232-marostegui.json
[07:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:47] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:02:29] <icinga-wm>	 RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:03:58] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[08:04:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P18193 and previous config saved to /var/cache/conftool/dbconfig/20211214-080736-marostegui.json
[08:07:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:29] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[08:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10akosiaris) Had a quick look at that. It is true that we never have r...
[08:17:42] <wikibugs>	 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Marostegui)
[08:21:51] <wikibugs>	 (03PS1) 10Ayounsi: Update netflow collector for codfw/eqdfw to netflow2002 [homer/public] - 10https://gerrit.wikimedia.org/r/747047 (https://phabricator.wikimedia.org/T297595)
[08:22:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2008.codfw.wmnet with OS buster
[08:22:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2008.codfw.wmnet with OS buster
[08:22:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T277354)', diff saved to https://phabricator.wikimedia.org/P18194 and previous config saved to /var/cache/conftool/dbconfig/20211214-082241-marostegui.json
[08:22:43] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[08:22:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[08:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:46] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[08:22:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T277354)', diff saved to https://phabricator.wikimedia.org/P18195 and previous config saved to /var/cache/conftool/dbconfig/20211214-082249-marostegui.json
[08:22:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Update netflow collector for codfw/eqdfw to netflow2002 [homer/public] - 10https://gerrit.wikimedia.org/r/747047 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi)
[08:24:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T277354)', diff saved to https://phabricator.wikimedia.org/P18196 and previous config saved to /var/cache/conftool/dbconfig/20211214-082433-marostegui.json
[08:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:12] <jinxer-wm>	 (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[08:29:16] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow3002.esams.wmnet
[08:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:07] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow4002.ulsfo.wmnet
[08:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:42] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow5002.eqsin.wmnet
[08:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Update import hook to import logstash 6.8.21 [puppet] - 10https://gerrit.wikimedia.org/r/747048
[08:31:38] <wikibugs>	 (03PS2) 10Muehlenhoff: Update import hook to import logstash 6.8.21 [puppet] - 10https://gerrit.wikimedia.org/r/747048
[08:32:12] <jinxer-wm>	 (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245  - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org
[08:33:47] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netflow4002.ulsfo.wmnet
[08:33:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:58] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host netflow5002.eqsin.wmnet
[08:33:59] <dcausse>	 !log restart blazegraph on wdqs1013 (jvm stuck for 5h)
[08:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:09] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow4002.ulsfo.wmnet
[08:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:37] <icinga-wm>	 PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:39:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P18197 and previous config saved to /var/cache/conftool/dbconfig/20211214-083938-marostegui.json
[08:39:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow5002.eqsin.wmnet
[08:43:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:47] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow3002.esams.wmnet
[08:43:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:53] <wikibugs>	 (03PS1) 10Kosta Harlan: WelcomeSurvey: Instrument interactions with form [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273)
[08:45:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for komla - https://phabricator.wikimedia.org/T297621 (10komla) >>! In T297621#7567433, @Aklapper wrote: > Adding @komla as some data needs to be filled in above (user account registered on wikitech.wikimedia.org; separate SSH key; etc).  This has...
[08:48:07] <wikibugs>	 (03PS1) 10Ayounsi: Add new netflow hosts to Kafka jumbo ACL [puppet] - 10https://gerrit.wikimedia.org/r/747050 (https://phabricator.wikimedia.org/T297595)
[08:49:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add new netflow hosts to Kafka jumbo ACL [puppet] - 10https://gerrit.wikimedia.org/r/747050 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi)
[08:49:13] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow4002.ulsfo.wmnet
[08:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:21] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host netflow1002.eqiad.wmnet
[08:49:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:52] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] "backport" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan)
[08:50:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2008.codfw.wmnet with OS buster
[08:50:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:50:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2008.codfw.wmnet with OS buster completed: - ganeti2008 (**PASS**)   - Downtimed on Icinga...
[08:54:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P18198 and previous config saved to /var/cache/conftool/dbconfig/20211214-085443-marostegui.json
[08:54:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:47] <moritzm>	 !log failover Ganeti master to ganeti2016 T296622
[08:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:52] <stashbot>	 T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622
[08:55:38] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "All are in DNS." [puppet] - 10https://gerrit.wikimedia.org/r/747050 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi)
[08:56:08] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow5002.eqsin.wmnet
[08:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2017.codfw.wmnet with OS buster
[08:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS buster
[08:57:25] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti2019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[08:57:45] <moritzm>	 ^ that's expected, icinga fallout of the master failover
[09:00:57] <icinga-wm>	 PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:04:00] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow1002.eqiad.wmnet
[09:04:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:26] <wikibugs>	 (03PS1) 10Ayounsi: Add DHCP for new netflow VMs [puppet] - 10https://gerrit.wikimedia.org/r/747052 (https://phabricator.wikimedia.org/T297595)
[09:07:00] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add DHCP for new netflow VMs [puppet] - 10https://gerrit.wikimedia.org/r/747052 (https://phabricator.wikimedia.org/T297595) (owner: 10Ayounsi)
[09:09:15] <icinga-wm>	 PROBLEM - HTTPS Ganeti RAPI codfw on ganeti2019 is CRITICAL: connect to address ganeti01.svc.codfw.wmnet and port 5080: No route to host https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon
[09:09:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T277354)', diff saved to https://phabricator.wikimedia.org/P18199 and previous config saved to /var/cache/conftool/dbconfig/20211214-090948-marostegui.json
[09:09:49] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[09:09:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[09:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:53] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[09:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:57] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[09:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[09:10:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[09:10:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:49] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:11:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[09:11:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WelcomeSurvey: Instrument interactions with form [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan)
[09:11:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[09:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T277354)', diff saved to https://phabricator.wikimedia.org/P18200 and previous config saved to /var/cache/conftool/dbconfig/20211214-091130-marostegui.json
[09:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:48] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] WelcomeSurvey: Instrument interactions with form [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan)
[09:13:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T277354)', diff saved to https://phabricator.wikimedia.org/P18201 and previous config saved to /var/cache/conftool/dbconfig/20211214-091315-marostegui.json
[09:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite)
[09:15:32] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32982/console" [puppet] - 10https://gerrit.wikimedia.org/r/746890 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:15:34] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw rolling restart - ryankemper@cumin2001 - T297468
[09:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] maps: add stub values for tegola swift credentials [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan)
[09:20:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: maps: write tegola credentials out to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan)
[09:21:03] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache: Provide a Envoy upload role [puppet] - 10https://gerrit.wikimedia.org/r/745772 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:25:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: pin discovery probes to their site [puppet] - 10https://gerrit.wikimedia.org/r/746881 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[09:27:18] <icinga-wm>	 RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:28:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P18202 and previous config saved to /var/cache/conftool/dbconfig/20211214-092820-marostegui.json
[09:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:13] <wikibugs>	 (03PS1) 10Kormat: tox.ini: Fix py3{7,8}-format [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747054 (https://phabricator.wikimedia.org/T297616)
[09:32:47] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] tox.ini: Fix py3{7,8}-format [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747054 (https://phabricator.wikimedia.org/T297616) (owner: 10Kormat)
[09:33:16] <wikibugs>	 (03PS3) 10Kormat: wmfdb/section: Add class for handling of sections. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745249
[09:34:05] <wikibugs>	 (03Merged) 10jenkins-bot: WelcomeSurvey: Instrument interactions with form [extensions/GrowthExperiments] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/746925 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan)
[13:17:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32990/console" [puppet] - 10https://gerrit.wikimedia.org/r/747108 (owner: 10Jbond)
[13:18:43] <wikibugs>	 (03PS2) 10Hashar: build: add mypy types [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104
[13:19:17] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: Add addtional port to configuration [puppet] - 10https://gerrit.wikimedia.org/r/747108 (owner: 10Jbond)
[13:19:31] <wikibugs>	 (03CR) 10Hashar: "I have added types-requests and types-PyYAML as suggested by Volans :)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747104 (owner: 10Hashar)
[13:19:44] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] wmfdb/addr: Add addr.py to handle addresses. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745852 (owner: 10Kormat)
[13:20:55] <wikibugs>	 (03Merged) 10jenkins-bot: wmfdb/addr: Add addr.py to handle addresses. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745852 (owner: 10Kormat)
[13:21:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libsamplerate [puppet] - 10https://gerrit.wikimedia.org/r/747110
[13:21:57] <wikibugs>	 (03PS8) 10Kormat: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618)
[13:24:18] <wikibugs>	 (03PS1) 10Jgiannelos: kartographer: Enable tegola on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747111 (https://phabricator.wikimedia.org/T280767)
[13:25:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P18227 and previous config saved to /var/cache/conftool/dbconfig/20211214-132551-marostegui.json
[13:25:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:12] <wikibugs>	 (03PS1) 10Ladsgroup: Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747068 (https://phabricator.wikimedia.org/T297669)
[13:31:31] <wikibugs>	 (03PS1) 10Ladsgroup: Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747069 (https://phabricator.wikimedia.org/T297669)
[13:32:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet
[13:32:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:53] <Amir1>	 jouncebot: nowandnext
[13:32:53] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1300)
[13:32:54] <jouncebot>	 In 0 hour(s) and 27 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1400)
[13:33:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747068 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup)
[13:33:08] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747069 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup)
[13:37:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet
[13:37:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] sre.hosts.dhcp: add support for Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/747099 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans)
[13:39:26] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668)
[13:39:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM but unsure of the original issue" [puppet] - 10https://gerrit.wikimedia.org/r/747067 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[13:39:52] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Pmacct add sflow listener" try #2 [puppet] - 10https://gerrit.wikimedia.org/r/747067 (https://phabricator.wikimedia.org/T263277) (owner: 10Ayounsi)
[13:40:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[13:40:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P18228 and previous config saved to /var/cache/conftool/dbconfig/20211214-134056-marostegui.json
[13:41:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:55] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668)
[13:42:11] <wikibugs>	 (03PS9) 10Kormat: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618)
[13:42:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[13:43:03] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=POST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[13:44:55] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[13:46:23] <wikibugs>	 (03PS3) 10Jcrespo: mediabackup: Add an encryption key to store private file securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668)
[13:47:32] <Lucas_WMDE>	 jouncebot: nowandnext
[13:47:32] <jouncebot>	 For the next 0 hour(s) and 12 minute(s): Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1300)
[13:47:32] <jouncebot>	 In 0 hour(s) and 12 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1400)
[13:47:40] <Lucas_WMDE>	 aha
[13:48:02] <wikibugs>	 (03PS1) 10Jbond: WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116
[13:51:05] <vgutierrez>	 !log depool cp4025 to be reimaged as cache::upload_envoy - T271421
[13:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:10] <stashbot>	 T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421
[13:51:44] <wikibugs>	 (03CR) 10Jcrespo: "I am thinking of installing age, but that is not available on buster only starting in bullseye: https://packages.debian.org/search?keyword" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[13:52:26] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.dhcp: add support for Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/747099 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans)
[13:52:35] <wikibugs>	 (03PS3) 10Vgutierrez: site: Reimage cp4025 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/746891 (https://phabricator.wikimedia.org/T271421)
[13:53:43] <wikibugs>	 (03Merged) 10jenkins-bot: Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747068 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup)
[13:54:03] <wikibugs>	 (03PS4) 10Jcrespo: mediabackup: Add an encryption key to store private files securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668)
[13:54:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond)
[13:54:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10serviceops: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10akosiaris) p:05Triage→03Low Code found. https://github.com/znuny...
[13:55:19] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.dhcp: add support for Ganeti hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/747099 (https://phabricator.wikimedia.org/T296832) (owner: 10Volans)
[13:55:37] <Lucas_WMDE>	 !log Deployed patch for T297570
[13:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T277354)', diff saved to https://phabricator.wikimedia.org/P18229 and previous config saved to /var/cache/conftool/dbconfig/20211214-135601-marostegui.json
[13:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:07] <stashbot>	 T277354: "chemical" major mime type was never added to production database - https://phabricator.wikimedia.org/T277354
[13:56:17] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) @aborrero  are we doing trunk so i can assign this task to netops?
[13:57:02] <wikibugs>	 (03Merged) 10jenkins-bot: Reuse the query result in addCategoryLinks instead of relying on cache [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747069 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup)
[13:57:55] <Lucas_WMDE>	 (I’m done)
[13:58:08] <wikibugs>	 (03PS10) 10Kormat: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618)
[13:59:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[13:59:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:04] <jouncebot>	 hashar and dancy: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1400).
[14:00:32] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] site: Reimage cp4025 as cache::upload_envoy [puppet] - 10https://gerrit.wikimedia.org/r/746891 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[14:01:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[14:01:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10Papaul)
[14:02:48] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reimage for host cp4025.ulsfo.wmnet with OS buster
[14:02:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster
[14:09:12] <wikibugs>	 10SRE, 10DBA, 10Sustainability (Incident Followup): Improve automatic query killer under high load - https://phabricator.wikimedia.org/T293532 (10Marostegui) p:05Triage→03Medium
[14:10:38] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Graphite: Upgrade firmware on graphite1004 if upgrade available. - https://phabricator.wikimedia.org/T297433 (10Marostegui)
[14:12:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[14:12:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to <wmf group> for <Elena Lappen> - https://phabricator.wikimedia.org/T297652 (10Marostegui) p:05Triage→03Medium
[14:13:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[14:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:49] <icinga-wm>	 PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.28% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[14:15:57] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi)
[14:16:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade netflow VMs to Bullseye - https://phabricator.wikimedia.org/T297595 (10ayounsi) 05Open→03Resolved a:03ayounsi All done!
[14:16:30] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat)
[14:16:31] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/OutputPage.php: Backport: [[gerrit:747068|Reuse the query result in addCategoryLinks instead of relying on cache (T297669)]] (duration: 00m 57s)
[14:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:37] <stashbot>	 T297669: Noticeable increase in db load after wmf.12 roll out - https://phabricator.wikimedia.org/T297669
[14:17:11] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] helmfile.d: add the istio pod security policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/746880 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[14:17:46] <wikibugs>	 (03Merged) 10jenkins-bot: wmfdb/cli_admin: Add db_mysql [software/wmfdb] - 10https://gerrit.wikimedia.org/r/745857 (https://phabricator.wikimedia.org/T297618) (owner: 10Kormat)
[14:19:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10observability, 10Sustainability (Incident Followup): large MX queues should page - https://phabricator.wikimedia.org/T297144 (10Marostegui) p:05Triage→03Medium
[14:19:37] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) Tests are successful:  I tested it by configuring sflow on the non-yet-prod asw1-b12-drmrs switch: `lang=diff [edit protoc...
[14:20:12] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Installation issues on PowerEdge R440 Kafka main codfw servers with buster / firmware update needed - https://phabricator.wikimedia.org/T297422 (10Marostegui) p:05Triage→03Medium
[14:20:43] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.48.171:9042 on restbase2026 is OK: TCP OK - 0.033 second response time on 10.192.48.171 port 9042 https://phabricator.wikimedia.org/T93886
[14:22:13] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2026 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:22:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: (Need By: TBD) rack/setup/install backup2008 - https://phabricator.wikimedia.org/T294973 (10jcrespo)
[14:22:21] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.48.172:7001 on restbase2026 is OK: SSL OK - Certificate restbase2026-c valid until 2023-12-09 16:37:44 +0000 (expires in 725 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[14:24:20] <wikibugs>	 (03PS1) 10MVernon: admin: add elapps to ldap_only_users (T297652) [puppet] - 10https://gerrit.wikimedia.org/r/747120
[14:26:00] <wikibugs>	 (03PS1) 10Jcrespo: install_server: Add backup1008/backup2008 to partman [puppet] - 10https://gerrit.wikimedia.org/r/747123 (https://phabricator.wikimedia.org/T294973)
[14:26:49] <wikibugs>	 (03PS2) 10Jcrespo: install_server: Add backup1008/backup2008 to partman [puppet] - 10https://gerrit.wikimedia.org/r/747123 (https://phabricator.wikimedia.org/T294973)
[14:28:00] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] install_server: Add backup1008/backup2008 to partman [puppet] - 10https://gerrit.wikimedia.org/r/747123 (https://phabricator.wikimedia.org/T294973) (owner: 10Jcrespo)
[14:28:41] <wikibugs>	 (03PS1) 10JMeybohm: cert-manager: Allow ingress to webhook from k8s master and nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/747124 (https://phabricator.wikimedia.org/T294560)
[14:29:36] <wikibugs>	 (03CR) 10Marostegui: "The change itself looks good. The user matches the ldap one." [puppet] - 10https://gerrit.wikimedia.org/r/747120 (owner: 10MVernon)
[14:30:49] <wikibugs>	 (03PS2) 10MVernon: admin: add elapps to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/747120 (https://phabricator.wikimedia.org/T297652)
[14:31:31] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10hashar) The issue appeared with wmf.12 which is fully deployed now and it does not seem we will roll it back....
[14:32:14] <icinga-wm>	 RECOVERY - Check systemd state on aqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:33:08] <wikibugs>	 (03PS1) 10Vgutierrez: role::cache: Add missing upload_envoy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/747127 (https://phabricator.wikimedia.org/T271421)
[14:33:47] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] role::cache: Add missing upload_envoy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/747127 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[14:34:59] <wikibugs>	 (03CR) 10MVernon: admin: add elapps to ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747120 (https://phabricator.wikimedia.org/T297652) (owner: 10MVernon)
[14:37:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] admin: add elapps to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/747120 (https://phabricator.wikimedia.org/T297652) (owner: 10MVernon)
[14:38:23] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] admin: add elapps to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/747120 (https://phabricator.wikimedia.org/T297652) (owner: 10MVernon)
[14:38:40] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.16.206:9042 on aqs1011 is OK: TCP OK - 0.000 second response time on 10.64.16.206 port 9042 https://phabricator.wikimedia.org/T93886
[14:40:00] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) >>! In T297517#7569567, @hashar wrote: > The issue appeared with wmf.12 which is fully deployed now a...
[14:40:48] <icinga-wm>	 RECOVERY - cassandra-b service on aqs1011 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:42:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to <wmf group> for <Elena Lappen> - https://phabricator.wikimedia.org/T297652 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon Hi, This is now done. Thanks, Matthew
[14:43:12] <wikibugs>	 (03PS1) 10Herron: mx: make exim queue alert paging [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144)
[14:47:28] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1018.eqiad.wmnet with OS buster
[14:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1018.eqiad.wmnet with OS buster
[14:47:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=envoy site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:49:54] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1019.eqiad.wmnet with OS buster
[14:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1019.eqiad.wmnet with OS buster
[14:50:49] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] maps: add stub values for tegola swift credentials [labs/private] - 10https://gerrit.wikimedia.org/r/746895 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan)
[14:50:55] <hashar>	 I am going to sync mediawiki wmf.13 code to the cluster but without promoting any wikis to it
[14:50:58] <hashar>	 cause of some blockers
[14:51:06] <hashar>	 but at least the code will be around
[14:52:10] <logmsgbot>	 !log hashar@deploy1002 Started scap: Push wmf.13 without promoting any wikis
[14:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:54] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) So I have been working on this on several fronts (with Daniel and Tim). The [[https://gerrit.wikime...
[14:56:06] <wikibugs>	 (03PS1) 10Vgutierrez: cache::envoy: Fix ocsp systemd config file content [puppet] - 10https://gerrit.wikimedia.org/r/747130 (https://phabricator.wikimedia.org/T271421)
[14:57:33] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32991/console" [puppet] - 10https://gerrit.wikimedia.org/r/747130 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[14:58:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:59:01] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Fix ocsp systemd config file content [puppet] - 10https://gerrit.wikimedia.org/r/747130 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[14:59:54] <vgutierrez>	 hnowlan: you got a commit pending to be merged 
[15:01:42] <hnowlan>	 vgutierrez: oops, on labs-private? that's safe to merge 
[15:01:49] <vgutierrez>	 indeed
[15:01:55] <hnowlan>	 sorry about that
[15:02:03] <vgutierrez>	 merging
[15:02:05] <vgutierrez>	 (done)
[15:04:20] <wikibugs>	 (03PS1) 10Kormat: wmfdb/addr: If _dc_map doesn't find a dc ID, use DNS. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747132
[15:06:59] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1012 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur
[15:06:59] <icinga-wm>	 unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:07:17] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1015 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur
[15:07:18] <icinga-wm>	 unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:07:23] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] wmfdb/addr: If _dc_map doesn't find a dc ID, use DNS. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747132 (owner: 10Kormat)
[15:08:17] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[15:08:34] <wikibugs>	 (03Merged) 10jenkins-bot: wmfdb/addr: If _dc_map doesn't find a dc ID, use DNS. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/747132 (owner: 10Kormat)
[15:08:48] <elukey>	 the aqs endpoints are the new cluster being currently worked on by Data Engineer (no user traffic)
[15:09:05] <icinga-wm>	 RECOVERY - Ensure hosts are not performing a change on every puppet run on cumin2002 is OK: OK: all nodes running as expected https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[15:09:27] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[15:09:31] <logmsgbot>	 !log vgutierrez@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host cp4025.ulsfo.wmnet with OS buster
[15:09:33] <icinga-wm>	 PROBLEM - Apache HTTP on wtp1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[15:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:39] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster completed: - cp4025 (**FAIL*...
[15:09:41] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1011 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views retur
[15:09:41] <icinga-wm>	 unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:09:43] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4025.ulsfo.wmnet with OS buster executed with errors: - cp40...
[15:09:56] <btullis>	 These aqs alerts are to do with me. 
[15:10:23] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1031 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:10:25] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1046 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:10:31] <icinga-wm>	 RECOVERY - Apache HTTP on wtp1032 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[15:10:41] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:10:49] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:11:33] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[15:12:07] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] kartographer: Enable tegola on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747111 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos)
[15:13:58] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1018.eqiad.wmnet with OS buster
[15:14:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1018.eqiad.wmnet with OS buster completed: - lvs1018 (**PASS**)...
[15:15:21] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1019.eqiad.wmnet with OS buster
[15:15:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1019.eqiad.wmnet with OS buster completed: - lvs1019 (**PASS**)...
[15:21:06] <wikibugs>	 (03PS1) 10Ladsgroup: cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747072 (https://phabricator.wikimedia.org/T297669)
[15:21:41] <logmsgbot>	 !log hashar@deploy1002 Finished scap: Push wmf.13 without promoting any wikis (duration: 29m 31s)
[15:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:28] <wikibugs>	 (03PS1) 10Ladsgroup: cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747073 (https://phabricator.wikimedia.org/T297669)
[15:22:32] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747072 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup)
[15:22:35] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747073 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup)
[15:25:08] <wikibugs>	 (03PS2) 10Jelto: Rakefile: remove helm2 from Rakefile, bump scaffold to v2 api [deployment-charts] - 10https://gerrit.wikimedia.org/r/746864 (https://phabricator.wikimedia.org/T251305)
[15:25:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Rakefile: remove helm2 from Rakefile, bump scaffold to v2 api [deployment-charts] - 10https://gerrit.wikimedia.org/r/746864 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[15:28:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] imagecatalog: Install and configure OCI image catalog on deploy hosts [puppet] - 10https://gerrit.wikimedia.org/r/742574 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus)
[15:30:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Nice, but you should add the user to all clusters." [puppet] - 10https://gerrit.wikimedia.org/r/745202 (https://phabricator.wikimedia.org/T287130) (owner: 10JMeybohm)
[15:31:33] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10Kubernetes: Helm chart dependencies no longer in requirements.yaml - https://phabricator.wikimedia.org/T295750 (10MatthewVernon)
[15:31:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add the ability to inject php files for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/747101 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto)
[15:31:55] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10serviceops: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10MatthewVernon)
[15:35:13] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: add the ability to inject php files for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/747101 (https://phabricator.wikimedia.org/T297613) (owner: 10Giuseppe Lavagetto)
[15:35:42] <wikibugs>	 (03PS1) 10Jgiannelos: tegola-vector-tiles: Disable pregeneration on eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/747136 (https://phabricator.wikimedia.org/T280767)
[15:39:35] <wikibugs>	 (03PS1) 10Vgutierrez: cache::envoy: Strip [] from X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/747137 (https://phabricator.wikimedia.org/T271421)
[15:41:46] <wikibugs>	 10Puppet, 10Infrastructure-Foundations: Role hieradata for non-existent roles - https://phabricator.wikimedia.org/T296533 (10MatthewVernon)
[15:42:12] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:42:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:26] <wikibugs>	 10SRE, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Set max execution time for several expensive mediawiki actions - https://phabricator.wikimedia.org/T297708 (10Ladsgroup)
[15:43:32] <wikibugs>	 (03Merged) 10jenkins-bot: cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747072 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup)
[15:44:04] <wikibugs>	 (03Merged) 10jenkins-bot: cache: Add four fields to LinkCache::getSelectFields [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747073 (https://phabricator.wikimedia.org/T297669) (owner: 10Ladsgroup)
[15:44:18] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add mini-textfile-exporter [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946)
[15:44:20] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946)
[15:45:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[15:46:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: add mini-textfile-exporter [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[15:46:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] cert-manager: Allow ingress to webhook from k8s master and nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/747124 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm)
[15:47:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ganeti2018.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
[15:48:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2018.codfw.wmnet with reason: Temporarily remove node from Ganeti for reimage
[15:48:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:49] <wikibugs>	 (03PS6) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555
[15:49:00] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1010.eqiad.wmnet
[15:49:02] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host aqs1010.eqiad.wmnet
[15:49:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:03] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff)
[15:49:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:10] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10MoritzMuehlenhoff) One more; ganeti2018. Ready to be powered off any time.
[15:49:43] <wikibugs>	 (03Merged) 10jenkins-bot: cert-manager: Allow ingress to webhook from k8s master and nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/747124 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm)
[15:50:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:50:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:58] <moritzm>	 !log drain primary/secondary instances off ganeti2023 T296622
[15:51:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:03] <stashbot>	 T296622: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622
[15:51:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[15:51:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:07] <wikibugs>	 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10herron)
[15:52:40] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: don't probe services not deployed in the current site [puppet] - 10https://gerrit.wikimedia.org/r/747055 (https://phabricator.wikimedia.org/T291946)
[15:52:42] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add mini-textfile-exporter [puppet] - 10https://gerrit.wikimedia.org/r/747139 (https://phabricator.wikimedia.org/T291946)
[15:52:44] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946)
[15:53:01] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1010.eqiad.wmnet
[15:53:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:00] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.12/includes/cache/LinkCache.php: Backport: [[gerrit:747073|cache: Add four fields to LinkCache::getSelectFields (T297669)]] (duration: 00m 57s)
[15:54:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:05] <stashbot>	 T297669: Noticeable increase in db load after wmf.12 roll out - https://phabricator.wikimedia.org/T297669
[15:54:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[15:56:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM some minor nits (and will also need to copy the package)" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[15:58:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[15:58:58] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) I need to go to a meeting but after that, I'll run a rolling restart
[15:59:04] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1010.eqiad.wmnet
[15:59:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:14] <wikibugs>	 (03PS7) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555
[16:00:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1011.eqiad.wmnet
[16:00:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:06] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/747048 (owner: 10Muehlenhoff)
[16:00:59] <wikibugs>	 (03PS8) 10Elukey: knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555
[16:01:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: don't probe services not deployed in the current site [puppet] - 10https://gerrit.wikimedia.org/r/747055 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[16:02:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: "The idea here is to be able to 'target' only production services (e.g. for paging purposes) with an expression like the following:" [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[16:02:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] mx: make exim queue alert paging [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) (owner: 10Herron)
[16:04:00] <wikibugs>	 (03CR) 10Dzahn: "It seems uncontroversial that we want it to page. Just the actual threshold was "yet to be determined" per the original comment. +0.5" [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) (owner: 10Herron)
[16:05:23] <wikibugs>	 (03PS1) 10Vgutierrez: envoyproxy: Allow disabling x-request-id generation [puppet] - 10https://gerrit.wikimedia.org/r/747150 (https://phabricator.wikimedia.org/T271421)
[16:05:25] <wikibugs>	 (03PS1) 10Vgutierrez: cache::envoy: Disable x-request-id generation [puppet] - 10https://gerrit.wikimedia.org/r/747151 (https://phabricator.wikimedia.org/T271421)
[16:08:04] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32992/console" [puppet] - 10https://gerrit.wikimedia.org/r/747150 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[16:08:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache::envoy: Strip [] from X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/747137 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[16:08:20] <wikibugs>	 (03CR) 10Jcrespo: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[16:09:41] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[16:10:15] <dancy>	 jouncebot now
[16:10:15] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 49 minute(s)
[16:11:13] <icinga-wm>	 PROBLEM - Check systemd state on wtp1034 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:11:45] <dancy>	 OOM errors hitting wtp* hosts again (T297517)
[16:11:45] <stashbot>	 T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517
[16:11:51] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[16:12:27] <wikibugs>	 (03PS1) 10Elukey: helmfile.d: Add Istio Egress config for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/747153 (https://phabricator.wikimedia.org/T294414)
[16:15:10] <wikibugs>	 (03PS1) 10Volans: WIP [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155
[16:15:47] <wikibugs>	 (03PS1) 10Elukey: helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414)
[16:17:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "As i said, this is for fairness of tests rather than actually a desirable result - generating request IDs should happen somewhere at the e" [puppet] - 10https://gerrit.wikimedia.org/r/747150 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[16:17:42] <wikibugs>	 (03PS2) 10Volans: remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155
[16:19:33] <wikibugs>	 (03PS1) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583)
[16:19:45] <icinga-wm>	 RECOVERY - Check systemd state on wtp1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:19:49] <wikibugs>	 (03PS5) 10Jcrespo: mediabackup: Add an encryption key to store private files securely [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668)
[16:19:58] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] envoyproxy: Allow disabling x-request-id generation [puppet] - 10https://gerrit.wikimedia.org/r/747150 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[16:20:03] <wikibugs>	 (03CR) 10Jcrespo: "done" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[16:20:39] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.16.204:9042 on aqs1011 is CRITICAL: connect to address 10.64.16.204 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[16:20:50] <logmsgbot>	 !log accraze@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[16:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:05] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.16.206:9042 on aqs1011 is CRITICAL: connect to address 10.64.16.206 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[16:21:10] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host mirror1001.wikimedia.org with OS bullseye
[16:21:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye
[16:21:18] <logmsgbot>	 !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mirror1001.wikimedia.org with OS bullseye
[16:21:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye executed with...
[16:21:56] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32993/console" [puppet] - 10https://gerrit.wikimedia.org/r/747151 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[16:22:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Update import hook to import logstash 6.8.21 [puppet] - 10https://gerrit.wikimedia.org/r/747048 (owner: 10Muehlenhoff)
[16:22:28] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Disable x-request-id generation [puppet] - 10https://gerrit.wikimedia.org/r/747151 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[16:23:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans)
[16:24:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] mediabackup: Add an encryption key to store private files securely (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[16:24:43] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host mirror1001.wikimedia.org with OS bullseye
[16:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye
[16:24:51] <logmsgbot>	 !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mirror1001.wikimedia.org with OS bullseye
[16:24:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye executed with...
[16:25:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans)
[16:28:14] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Add dummy age private key for mediabackups [labs/private] - 10https://gerrit.wikimedia.org/r/747160 (https://phabricator.wikimedia.org/T262668)
[16:28:25] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Add dummy age private key for mediabackups [labs/private] - 10https://gerrit.wikimedia.org/r/747160 (https://phabricator.wikimedia.org/T262668)
[16:28:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[16:30:45] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host mirror1001.wikimedia.org with OS bullseye
[16:30:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye
[16:30:52] <logmsgbot>	 !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mirror1001.wikimedia.org with OS bullseye
[16:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye executed with...
[16:32:28] <icinga-wm>	 PROBLEM - cassandra-a service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:32:49] <wikibugs>	 (03PS3) 10Jcrespo: mediabackup: Add dummy age private key for mediabackups [labs/private] - 10https://gerrit.wikimedia.org/r/747160 (https://phabricator.wikimedia.org/T262668)
[16:33:08] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host mirror1001.wikimedia.org with OS bullseye
[16:33:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:14] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye
[16:33:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Cmjohnson)
[16:34:50] <icinga-wm>	 PROBLEM - cassandra-b service on aqs1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:36:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Cmjohnson)
[16:36:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic-Icebox: Q2:(Need By: ASAP) rack/setup/install lvs10[17-20] - https://phabricator.wikimedia.org/T295804 (10Cmjohnson) 05Open→03Resolved The servers are finished with rack and initial setup, cross row connections should be handled in a separate task.
[16:37:52] <wikibugs>	 (03CR) 10Jcrespo: "Will deploy https://gerrit.wikimedia.org/r/c/labs/private/+/747160 first to test compilation." [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[16:40:06] <wikibugs>	 (03CR) 10Jcrespo: [V: 03+2 C: 03+2] mediabackup: Add dummy age private key for mediabackups [labs/private] - 10https://gerrit.wikimedia.org/r/747160 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[16:41:07] <wikibugs>	 (03CR) 10Accraze: [C: 03+1] helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey)
[16:41:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes1022 - https://phabricator.wikimedia.org/T294301 (10Cmjohnson)
[16:42:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: add support for istio egress gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/745555 (owner: 10Elukey)
[16:43:11] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "2000, seems pretty reasonable, since we have about 900 messages sitting in the queue on mx1001 at the moment." [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) (owner: 10Herron)
[16:43:46] <wikibugs>	 (03CR) 10Jcrespo: "Surprisingly, seems to work as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/32994/" [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[16:43:48] <Amir1>	 !log rolling restart of php-fpm on all mediawiki hosts (T297517 T297667)
[16:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:54] <stashbot>	 T297667: mysqli/mysqlnd memory leak - https://phabricator.wikimedia.org/T297667
[16:43:54] <stashbot>	 T297517: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517
[16:46:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163
[16:48:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163 (owner: 10Giuseppe Lavagetto)
[16:48:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] helmfile.d: Add Istio Egress config for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/747153 (https://phabricator.wikimedia.org/T294414) (owner: 10Elukey)
[16:51:26] <icinga-wm>	 PROBLEM - Check systemd state on wtp1045 is CRITICAL: CRITICAL - degraded: The following units failed: phpsessionclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:51:31] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163
[16:51:54] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:53:01] <wikibugs>	 (03PS2) 10Elukey: helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414)
[16:53:21] <jynus>	 jbond, I am about to run "reprepro -C main includedeb buster-wikimedia age_1.0.0~rc1-2+b3_amd64.deb" on apt1001- I double checked the sha256sum and tested it on a buster host (no new dependencies)
[16:53:46] <jbond>	 jynus: cool 
[16:55:14] <icinga-wm>	 PROBLEM - Check systemd state on wtp1041 is CRITICAL: CRITICAL - degraded: The following units failed: phpsessionclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:55:45] <Amir1>	 can it be because of the rolling restart?
[16:55:49] <jbond>	 jynus: i did a test install on sretest1001 and look ok to me
[16:55:54] <Amir1>	 if so, it should recover
[16:56:04] <jynus>	 jbond, cool
[16:56:20] <wikibugs>	 (03PS3) 10Elukey: helmfile.d: Configure all ml-services to use the Istio egress gw [deployment-charts] - 10https://gerrit.wikimedia.org/r/747156 (https://phabricator.wikimedia.org/T294414)
[16:56:22] <wikibugs>	 (03PS1) 10Elukey: knative-serving: fix net_istio_egress template [deployment-charts] - 10https://gerrit.wikimedia.org/r/747164
[16:56:24] <wikibugs>	 (03PS3) 10Volans: remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155
[16:56:26] <wikibugs>	 (03PS2) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583)
[16:56:28] <wikibugs>	 (03PS1) 10Volans: pylint: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/747165
[16:56:33] <jynus>	 I will send a patch to install it on puppet masters if you are ok with that (I think it may be useful outside of mediabackups)
[16:56:51] <jbond>	 jynus: sgtm
[16:57:22] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mirror1001.wikimedia.org with OS bullseye
[16:57:25] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:57:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Setup new mirror server (copernicium.wikimedia.org) - https://phabricator.wikimedia.org/T286898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host mirror1001.wikimedia.org with OS bullseye completed: - m...
[16:59:06] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.48.172:9042 on restbase2026 is OK: TCP OK - 0.033 second response time on 10.192.48.172 port 9042 https://phabricator.wikimedia.org/T93886
[17:00:04] <jouncebot>	 jbond and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:12] <rzl>	 ✅
[17:01:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: fix net_istio_egress template [deployment-charts] - 10https://gerrit.wikimedia.org/r/747164 (owner: 10Elukey)
[17:03:23] <wikibugs>	 (03PS1) 10BBlack: Add mediawiki redirects for WME typo domains [puppet] - 10https://gerrit.wikimedia.org/r/747167 (https://phabricator.wikimedia.org/T296445)
[17:04:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163 (owner: 10Giuseppe Lavagetto)
[17:04:27] <wikibugs>	 (03PS1) 10BBlack: Define enterprise.(wm|wp).o for MW-level redirects [dns] - 10https://gerrit.wikimedia.org/r/747168 (https://phabricator.wikimedia.org/T296445)
[17:04:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to <wmf group> for <Elena Lappen> - https://phabricator.wikimedia.org/T297652 (10elappen-WMF) Thank you so much @MatthewVernon!
[17:05:37] <wikibugs>	 (03PS2) 10BBlack: Add MW and ncredir redirects for WME typo domains [puppet] - 10https://gerrit.wikimedia.org/r/747167 (https://phabricator.wikimedia.org/T296445)
[17:06:10] <Amir1>	 the rolling restart is done now
[17:06:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudbackup100[34] - https://phabricator.wikimedia.org/T293934 (10Cmjohnson)
[17:06:44] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10:pooled=yes; selector: name=restbase2026.codfw.wmnet
[17:06:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:53] <wikibugs>	 (03PS1) 10Volans: sre.hosts.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583)
[17:07:14] <dancy>	 Amir1: The mem usage chart I'm looking at for parsoid dropped down a lot.
[17:07:25] <wikibugs>	 (03CR) 10Volans: "Example usage in I38b4bccee29e3222654c078f8544dfba03a8ca16" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans)
[17:07:30] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: actually mount the debug volume [deployment-charts] - 10https://gerrit.wikimedia.org/r/747163 (owner: 10Giuseppe Lavagetto)
[17:07:34] <Amir1>	 yeah but that's sorta expected, it's fresh and without the leak
[17:07:53] <Amir1>	 the leak is happening but hopefully with slower pace with lower number of db queries happening 
[17:08:09] <dancy>	 nod.
[17:08:12] <dancy>	 Desired.
[17:09:15] <Amir1>	 for a while we can run the rolling restart until the underlying issue gets fixed
[17:09:30] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] tegola-vector-tiles: Disable pregeneration on eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/747136 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos)
[17:09:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.provision: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans)
[17:10:09] <wikibugs>	 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF)
[17:10:14] <icinga-wm>	 RECOVERY - cassandra-b service on aqs1011 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:10:20] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.16.206:9042 on aqs1011 is OK: TCP OK - 0.000 second response time on 10.64.16.206 port 9042 https://phabricator.wikimedia.org/T93886
[17:10:24] <icinga-wm>	 RECOVERY - Check systemd state on wtp1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:10:33] <wikibugs>	 (03PS1) 10Jcrespo: puppetmaster: Install 'age' on puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668)
[17:10:44] <icinga-wm>	 RECOVERY - Check systemd state on wtp1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:10:52] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[17:10:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:04] <wikibugs>	 (03CR) 10Volans: "Example cookbook usage for Icc10491cf2c90d2bc51122c7ec3d2e168327afba . CI is expected to fail until the linked patch is merged and release" [cookbooks] - 10https://gerrit.wikimedia.org/r/747169 (https://phabricator.wikimedia.org/T271583) (owner: 10Volans)
[17:11:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: Install 'age' on puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[17:12:20] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.16.204:9042 on aqs1011 is OK: TCP OK - 0.000 second response time on 10.64.16.204 port 9042 https://phabricator.wikimedia.org/T93886
[17:12:20] <icinga-wm>	 RECOVERY - cassandra-a service on aqs1011 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:12:30] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1011.eqiad.wmnet
[17:12:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:06] <wikibugs>	 (03PS2) 10Jcrespo: puppetmaster: Install 'age' on puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668)
[17:14:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] mediabackup: Add an encryption key to store private files securely (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747113 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[17:14:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans)
[17:15:05] <wikibugs>	 (03Merged) 10jenkins-bot: tegola-vector-tiles: Disable pregeneration on eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/747136 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos)
[17:15:13] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[17:15:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:18] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:15:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] pylint: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/747165 (owner: 10Volans)
[17:17:20] <wikibugs>	 (03CR) 10Volans: [C: 03+2] pylint: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/747165 (owner: 10Volans)
[17:18:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1012.eqiad.wmnet
[17:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): (Need By: TBD) rack/setup/install prometheus100[56] - https://phabricator.wikimedia.org/T294967 (10Cmjohnson)
[17:20:08] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10thcipriani) Documenting my understanding of this problem after reading this task (along with T297669 and T2976...
[17:20:27] <wikibugs>	 (03CR) 10Jcrespo: "I think this will be a useful tool no matter what, but if we use age for key generation, this is almost a requirement." [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[17:21:27] <mutante>	 !log icinga - re-enabling active monitoring checks on mx2001 (T297128)
[17:21:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:32] <stashbot>	 T297128: Bringing mx2001 back into service - https://phabricator.wikimedia.org/T297128
[17:23:08] <wikibugs>	 (03Merged) 10jenkins-bot: pylint: fix newly reported issues [software/spicerack] - 10https://gerrit.wikimedia.org/r/747165 (owner: 10Volans)
[17:23:16] <mutante>	 !log elastic1043 is down and alerting since > 6h
[17:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:27] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[17:23:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:13] <mutante>	 jhathaway: just added mirror1001 in puppet?
[17:24:33] <jhathaway>	 mutante: yes, just re-imaged it
[17:24:38] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) Am I right in assuming that this data has the same schema as the original `netflow`?
[17:24:43] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1012.eqiad.wmnet
[17:24:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:24] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1013.eqiad.wmnet
[17:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:45] <mutante>	 jhathaway: confirmed there are new monitoring checks for in Icinga in the state "pending". So soon it might start talking about these. Though the cookbook would first set a downtime for 2 hours or so.
[17:26:04] <mutante>	 this means the host is in puppetdb and it worked, basically
[17:26:43] <jhathaway>	 ok thanks
[17:27:15] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:33] <mutante>	 (the pending ones are actually just the mgmt interface, other checks already green but with disabled notifications )
[17:30:35] <wikibugs>	 (03PS4) 10Volans: remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155
[17:30:37] <wikibugs>	 (03PS3) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583)
[17:31:11] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
[17:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:31:29] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:31:55] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1013.eqiad.wmnet
[17:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:29] <mutante>	 aphlict1001 ran out of disk, people2002 dpkg error, cr1 OSPF alerts,  stat1007 broken product-analytics-movement-service, and 20 other alerts
[17:34:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff)
[17:35:01] <logmsgbot>	 !log bblack@cumin1001 START - Cookbook sre.dns.netbox
[17:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:05] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Ladsgroup) I don't have strong opinions but I think wmf.12 issues are "mitigated" (but not resolved) and wmf.1...
[17:35:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[17:35:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans)
[17:35:54] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus::ops: Gather full metrics for cache::envoy [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421)
[17:35:56] <wikibugs>	 (03PS1) 10Vgutierrez: prometheus::ops: Add varnish/ATS metrics for cache::upload_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421)
[17:36:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus::ops: Gather full metrics for cache::envoy [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[17:36:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155 (owner: 10Volans)
[17:36:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus::ops: Add varnish/ATS metrics for cache::upload_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[17:38:09] <mutante>	 !log aphlict1001 - (Phabricator realtime notifications) - out of disk, attempting to gzip a large log
[17:38:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Cmjohnson)
[17:39:00] <logmsgbot>	 !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:13] <wikibugs>	 (03PS2) 10Vgutierrez: prometheus::ops: Gather full metrics for cache::envoy [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421)
[17:39:15] <wikibugs>	 (03PS2) 10Vgutierrez: prometheus::ops: Add varnish/ATS metrics for cache::upload_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421)
[17:39:51] <wikibugs>	 (03PS5) 10Volans: remote: wait_reboot_since, intercept bad uptimes [software/spicerack] - 10https://gerrit.wikimedia.org/r/747155
[17:39:53] <wikibugs>	 (03PS4) 10Volans: spicerack.redfish: add support for Redfish API [software/spicerack] - 10https://gerrit.wikimedia.org/r/747157 (https://phabricator.wikimedia.org/T271583)
[17:40:26] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Joe) FWIW, I wholeheartedly agree with @thcipriani's opinions above.  As for the remaining work: we need to ru...
[17:41:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1014.eqiad.wmnet
[17:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:41:39] <topranks>	 !log Temporarily deactivated BGP peering to AS8932 at AMS-IX (cr2-esams) as peer is constantly tripping max-prefix configuration for a few days, and according to peeringdb they should be within limit.
[17:41:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:43] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:42:48] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32996/console" [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[17:43:47] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10ssastry) >>! In T297517#7570257, @thcipriani wrote: > > - I would prefer we either (a) abandon wmf.12 and roll...
[17:44:06] <wikibugs>	 (03PS2) 10Andrew Bogott: cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary [puppet] - 10https://gerrit.wikimedia.org/r/745950 (https://phabricator.wikimedia.org/T289888)
[17:44:08] <wikibugs>	 (03PS1) 10BBlack: lvs1020: lvs role and iface/addr metadata [puppet] - 10https://gerrit.wikimedia.org/r/747173 (https://phabricator.wikimedia.org/T295804)
[17:45:22] <wikibugs>	 (03PS1) 10Andrew Bogott: Replace cloudmetrics1001 with cloudmetrics1003 [dns] - 10https://gerrit.wikimedia.org/r/747174 (https://phabricator.wikimedia.org/T297712)
[17:46:14] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] prometheus::ops: Gather full metrics for cache::envoy [puppet] - 10https://gerrit.wikimedia.org/r/747171 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[17:46:29] <wikibugs>	 (03PS1) 10BBlack: lvs1020: add to homer lvs_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804)
[17:46:51] <vgutierrez>	 jynus: you got a pending commit on labs-private repo to be merged
[17:47:27] <icinga-wm>	 RECOVERY - DPKG on people2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[17:47:28] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1014.eqiad.wmnet
[17:47:31] <jynus>	 vgutierrez, oh
[17:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:33] <jynus>	 let me fix that
[17:47:37] <jynus>	 I always forget
[17:48:00] <jynus>	 sorry, done
[17:48:01] <icinga-wm>	 RECOVERY - Disk space on aphlict1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=aphlict1001&var-datasource=eqiad+prometheus/ops
[17:48:31] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32997/console" [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[17:48:55] <mutante>	 !log people2002 - apt-get install --reinstall linux-image-5.10.0-9-amd64   to fix Icinga DPKG alert
[17:48:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:25] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] prometheus::ops: Add varnish/ATS metrics for cache::upload_envoy role [puppet] - 10https://gerrit.wikimedia.org/r/747172 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[17:51:09] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:51:09] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:51:09] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:51:09] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T283582 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:51:28] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1015.eqiad.wmnet
[17:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:57] <majavah>	 i'm deploying something
[17:53:18] <dancy>	 hehe
[17:54:35] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10aborrero) 05Open→03Stalled Yes, we will be doing trunk. Thanks @Papaul I think we're fine here from DCops side f...
[17:54:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero)
[17:55:11] <wikibugs>	 (03PS3) 10Hnowlan: maps: write tegola swift credentials out to file [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700)
[17:56:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) 05Open→03Stalled We just re-shifted team priorities...
[17:56:36] <wikibugs>	 (03CR) 10Hnowlan: maps: write tegola swift credentials out to file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan)
[17:57:58] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1015.eqiad.wmnet
[17:58:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) 05Open→03Stalled FYI network details for these servers are blocked on {T296411}, which is in turn stalled, so marking...
[18:00:04] <jouncebot>	 chrisalbon and accraze: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1800).
[18:04:56] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10Zabe) >>! In T297517#7570358, @ssastry wrote: > [...] does that mean if wmf.13 had to be rolled back, it will...
[18:10:50] <wikibugs>	 (03CR) 10Muehlenhoff: logstash: use logstash-oss for gelf_relay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite)
[18:17:02] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic rolling restart - ryankemper@cumin1001 - T297468
[18:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:20] <ryankemper>	 !log T297468 `sudo cookbook sre.elasticsearch.rolling-operation cloudelastic "cloudelastic rolling restart" --nodes-per-run 3 --start-datetime 2021-12-14T01:27:58 --task-id T297468`  on `ryankemper@cumin1001` tmux `elastic_restarts`
[18:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:33] <ryankemper>	 !log T297468 [Elastic] Performing manual rolling restart of `relforge`. Starting with `ryankemper@relforge1004:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service` (non-master node)
[18:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "Fine with this, but the puppet masters are on buster and age is only included as of bullseye." [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[18:22:54] <wikibugs>	 (03CR) 10Cwhite: logstash: use logstash-oss for gelf_relay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite)
[18:24:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite)
[18:25:31] <ottomata>	 !log repooling eventgate-main discovery to include codfw - T296699 - confctl --object-type discovery select 'dnsdisc=eventgate-main,name=codfw' set/pooled=true
[18:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:36] <stashbot>	 T296699: Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699
[18:25:39] <logmsgbot>	 !log otto@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventgate-main,name=codfw
[18:25:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:26:31] <wikibugs>	 10SRE, 10Analytics, 10Event-Platform, 10Sustainability (Incident Followup): Pool eventgate-main in both datacenters (active/active) - https://phabricator.wikimedia.org/T296699 (10Ottomata) Ran  ` root@puppetmaster1001:~# confctl --object-type discovery select 'dnsdisc=eventgate-main,name=codfw' set/pooled=...
[18:28:32] <ryankemper>	 !log T297468 [Elastic] `ryankemper@relforge1003:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service`
[18:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:28:57] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) 05Open→03Stalled stalled on https://gerrit.wikimedia.org/r/c/operations/puppet/+/736596/5
[18:30:17] <wikibugs>	 (03CR) 10Dzahn: "While this is waiting for follow-up, can I simply add parsoid to "mw" alias to get the ticket resolved, while forgetting about the rest of" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn)
[18:31:49] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "interfaces.yaml bits all look good, rest also makes sense but I'm not as familiar with that." [puppet] - 10https://gerrit.wikimedia.org/r/747173 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack)
[18:32:32] <wikibugs>	 (03PS1) 10Kosta Harlan: betalabs: Enable Watchlist Echo notifications feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747186 (https://phabricator.wikimedia.org/T203941)
[18:32:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: cumin: reorganize mediawiki aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn)
[18:32:57] <wikibugs>	 (03PS1) 10Urbanecm: Make fix-staging-perms also fix /srv/patches permissions [puppet] - 10https://gerrit.wikimedia.org/r/747187
[18:33:13] <wikibugs>	 (03CR) 10Dzahn: "still wanna get this done? I think Filippo's latest comment about a typo is still current" [puppet] - 10https://gerrit.wikimedia.org/r/715636 (owner: 10Legoktm)
[18:34:13] <wikibugs>	 (03CR) 10Dzahn: cumin: reorganize mediawiki aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn)
[18:34:36] <majavah>	 !log deployed updated patch for T297322
[18:34:37] * majavah done
[18:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:16] <wikibugs>	 (03CR) 10Jcrespo: "0:-)" [puppet] - 10https://gerrit.wikimedia.org/r/747170 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[18:38:34] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [analytics/refinery@92c63c9]: Regular analytics weekly train [analytics/refinery@92c63c9]
[18:38:36] <wikibugs>	 (03PS6) 10Dzahn: cumin: add parsoid servers to all-mw-* aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802)
[18:38:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:21] <wikibugs>	 (03PS7) 10Dzahn: cumin: add parsoid servers to all-mw-* aliases [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802)
[18:39:40] <wikibugs>	 (03CR) 10Dzahn: "amended, rebased, recycled. ok to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn)
[18:39:58] <bblack>	 !log lvs1016: downtimed for attempt at moving its role to lvs1020 (expect a few minor related alerts, such as BGP ones for eqiad routers)
[18:40:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:44] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) Proof it is working:  {F34883926}  {F34883925}
[18:40:49] <bblack>	 !log lvs1016: puppet agent disabled, pybal stopped
[18:40:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:05] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:41:29] <bblack>	 ^ expected from lvs1016/lvs1020 work
[18:41:45] <mutante>	 thanks! and ACK @ no touching pybal
[18:42:19] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black lvs1016/lvs1020 swap process https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:42:19] <icinga-wm>	 ACKNOWLEDGEMENT - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black lvs1016/lvs1020 swap process https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:43:52] <wikibugs>	 (03PS1) 10Clare Ming: Prevent A/B test enrollment hook from firing for unsampled [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747075 (https://phabricator.wikimedia.org/T297662)
[18:50:56] <jynus>	 is elastic ok?
[18:52:10] <jynus>	 ah, it is cloudelastic, not elastic
[18:52:26] <jynus>	 and seems to be getting better
[18:52:28] <wikibugs>	 (03CR) 10Herron: [C: 03+1] prometheus: export service catalog metrics [puppet] - 10https://gerrit.wikimedia.org/r/747140 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi)
[18:52:30] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747190 (https://phabricator.wikimedia.org/T296269)
[18:55:15] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] "LGTM" [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747075 (https://phabricator.wikimedia.org/T297662) (owner: 10Clare Ming)
[18:56:59] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[18:58:23] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [analytics/refinery@92c63c9]: Regular analytics weekly train [analytics/refinery@92c63c9] (duration: 19m 49s)
[18:58:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:40] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] lvs1020: lvs role and iface/addr metadata [puppet] - 10https://gerrit.wikimedia.org/r/747173 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack)
[18:59:13] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28
[18:59:29] <bblack>	 !log lvs1020: running puppet agent with lvs role + config for first time
[18:59:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 RoanKattouw and Urbanecm: (Dis)respected human, time to deploy UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T1900). Please do the needful.
[19:00:05] <jouncebot>	 nn1l2, nemo-yiannis, and cjming: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:12] <nn1l2>	 hi
[19:00:12] <cjming>	 o/
[19:00:30] <nemo-yiannis>	 hey
[19:00:43] <urbanecm>	 Hey
[19:01:07] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [analytics/refinery@92c63c9] (thin): Regular analytics weekly train THIN [analytics/refinery@92c63c9]
[19:01:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:14] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [analytics/refinery@92c63c9] (thin): Regular analytics weekly train THIN [analytics/refinery@92c63c9] (duration: 00m 07s)
[19:01:17] <logmsgbot>	 !log milimetric@deploy1002 Started deploy [analytics/refinery@92c63c9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@92c63c9]
[19:01:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:27] <urbanecm>	 cjming: want to deploy today? Or should i?
[19:01:52] <MatmaRex>	 hello, i have a patch too if i'm not too late. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/747190
[19:02:01] <MatmaRex>	 i'll add it to the table
[19:02:15] <urbanecm>	 MatmaRex: feel free to
[19:02:21] <cjming>	 urbanecm: do you mind doing it? i'm trying to get another patch out the door
[19:02:46] <urbanecm>	 cjming: not at all
[19:02:56] <cjming>	 ty 🙌
[19:03:10] <wikibugs>	 (03PS1) 10BBlack: lvs1020: add interface_tweaks data [puppet] - 10https://gerrit.wikimedia.org/r/747192 (https://phabricator.wikimedia.org/T295804)
[19:03:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Prevent A/B test enrollment hook from firing for unsampled [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747075 (https://phabricator.wikimedia.org/T297662) (owner: 10Clare Ming)
[19:03:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] kartographer: Enable tegola on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747111 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos)
[19:04:39] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] lvs1020: add interface_tweaks data [puppet] - 10https://gerrit.wikimedia.org/r/747192 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack)
[19:05:16] <wikibugs>	 (03Merged) 10jenkins-bot: kartographer: Enable tegola on jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747111 (https://phabricator.wikimedia.org/T280767) (owner: 10Jgiannelos)
[19:06:51] <MatmaRex>	 oops, i got disconnected, hope i didn't miss anything
[19:08:11] <logmsgbot>	 !log milimetric@deploy1002 Finished deploy [analytics/refinery@92c63c9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@92c63c9] (duration: 06m 54s)
[19:08:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:21] <urbanecm>	 MatmaRex: nope
[19:08:50] <urbanecm>	 nemo-yiannis: your patch is at mwdebug1001
[19:08:53] <urbanecm>	 can you have a look?
[19:09:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:04] <urbanecm>	 nn1l2: sorry for not being clear yesterday. I meant that the patch ideally should be scheduled with a +1 from someone. I tend to not have time to do enough review during B&C
[19:10:08] <urbanecm>	 i can try to do it at the end
[19:10:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:10:09] <urbanecm>	 but no guarantees
[19:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:25] <wikibugs>	 (03PS2) 10Urbanecm: VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747190 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński)
[19:10:40] <nn1l2>	 understood
[19:11:12] <urbanecm>	 it's not because the patch is complicated or something. it's...quite large (compared to other config patches)
[19:12:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747190 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński)
[19:12:26] <urbanecm>	 nemo-yiannis: hey, how is your test going?
[19:12:41] <nemo-yiannis>	 diff looks ok, i am having some hard time navigating ja.wikipedia.org 
[19:12:53] <urbanecm>	 nemo-yiannis: try to clear your cookies
[19:12:59] <urbanecm>	 (it's a...known issue)
[19:13:26] <urbanecm>	 the gadget that caused it was disabled, but...obviously we can't remove the cookies from visitors ourselves :(
[19:13:34] <wikibugs>	 (03Merged) 10jenkins-bot: VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747190 (https://phabricator.wikimedia.org/T296269) (owner: 10Bartosz Dziewoński)
[19:15:02] <urbanecm>	 to roots: can someone tell me what is `mwdebug1001:/srv/mediawiki/w/debug/vardump.php`? it's root-owned in a root-owned directory, and scap complains about it
[19:15:31] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: use logstash-oss for gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/746971 (https://phabricator.wikimedia.org/T297468) (owner: 10Cwhite)
[19:15:33] <urbanecm>	 nemo-yiannis: pulled the patch to mwdebug1002 as well -- I'm not sure if the error i missed when originally pulling broke scap pull or not
[19:16:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:16:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:03] <bblack>	 !log lvs1020 - rebooting on new config
[19:18:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:01] <nemo-yiannis>	 i cant find a page on jawiki to reproduce the issue but patch should be fairly straightforward (we've already tried in many wikis the past few days)
[19:19:13] <nemo-yiannis>	 its more of a matter of rollout
[19:19:27] <urbanecm>	 nemo-yiannis: so ok to deploy?
[19:19:30] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic rolling restart - ryankemper@cumin1001 - T297468
[19:19:31] <urbanecm>	 or do you want more time?
[19:19:32] <nemo-yiannis>	 i think so yeah
[19:19:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:39] <urbanecm>	 okay
[19:19:47] <urbanecm>	 once i get clarification re `mwdebug1001:/srv/mediawiki/w/debug/vardump.php`, I'll push it
[19:22:59] <wikibugs>	 (03Merged) 10jenkins-bot: Prevent A/B test enrollment hook from firing for unsampled [skins/Vector] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747075 (https://phabricator.wikimedia.org/T297662) (owner: 10Clare Ming)
[19:27:24] <wikibugs>	 (03PS1) 10Jbond: P:age::store: Add profile and class to configure age secret store [puppet] - 10https://gerrit.wikimedia.org/r/747193
[19:27:26] <wikibugs>	 (03PS1) 10Jbond: O:puppetmaster: Add age::store to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/747194
[19:27:54] <nemo-yiannis>	 urbanecm: found an article from the logs, change should be ok
[19:28:04] <urbanecm>	 okay
[19:28:16] <urbanecm>	 I'm discussing the suspicious file with others in a different channel
[19:28:17] <urbanecm>	 stay turned
[19:28:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:41] <wikibugs>	 (03PS2) 10Jbond: O:puppetmaster: Add age::store to puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/747194
[19:29:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[19:29:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32999/console" [puppet] - 10https://gerrit.wikimedia.org/r/747194 (owner: 10Jbond)
[19:32:18] <wikibugs>	 (03CR) 10AntiCompositeNumber: [C: 04-1] "WV should not be removed from trwikivoyage. Everything else looks fine from here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2)
[19:33:53] <wikibugs>	 (03CR) 10AntiCompositeNumber: [C: 03+1] "Correction: not an issue, WV is already a default alias on wikivoyages." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/745220 (https://phabricator.wikimedia.org/T296643) (owner: 104nn1l2)
[19:34:38] <AntiComposite>	 definitely would have been better to split that into smaller patches
[19:37:54] <icinga-wm>	 PROBLEM - PyBal BGP sessions are established on lvs1020 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad+prometheus/ops
[19:38:09] <nn1l2>	 Thanks AntiComposite
[19:39:10] <urbanecm>	 bblack: not sure if the alert above is expected or not -- saw you rebooted lvs1020 recently
[19:39:42] <wikibugs>	 (03PS1) 10Jgiannelos: Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196
[19:40:32] <bblack>	 urbanecm: thanks.  yes, expected/ok
[19:40:41] <urbanecm>	 thanks for checking bblack
[19:41:47] <wikibugs>	 (03CR) 10Jgiannelos: [C: 04-1] "Block until next deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (owner: 10Jgiannelos)
[19:42:10] <wikibugs>	 (03PS2) 10Jgiannelos: Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366)
[19:46:07] <urbanecm>	 nemo-yiannis: going to sync your patch soon, rzl's dealing with the file
[19:46:17] <nemo-yiannis>	 sounds good
[19:46:42] <urbanecm>	 scap at mwdebug1001 completes w/o errors
[19:47:16] <wikibugs>	 (03PS3) 10JHathaway: debian mirrors: add new mirror, mirror1001 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/745612 (https://phabricator.wikimedia.org/T286898)
[19:47:54] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7f4ae4cc678aa64b0795be7bc4c9a6f1ba4c1929: kartographer: Enable tegola on jawiki (T280767) (duration: 00m 58s)
[19:47:58] <urbanecm>	 nemo-yiannis: and, live
[19:47:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:59] <stashbot>	 T280767: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767
[19:48:20] <urbanecm>	 MatmaRex: sorry this takes so long. your patch is at mwdebug1001
[19:48:22] <urbanecm>	 please test
[19:48:31] <MatmaRex>	 looking
[19:48:42] <nemo-yiannis>	 cc mbsantos ^
[19:48:43] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] debian mirrors: add new mirror, mirror1001 in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/745612 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway)
[19:49:13] <urbanecm>	 cjming: your patch is at mwdebug1002, can you have a look?
[19:49:52] <cjming>	 urbanecm: gtg
[19:49:55] <MatmaRex>	 urbanecm: seems good
[19:50:02] <urbanecm>	 cjming: that was quick, thanks
[19:50:03] <urbanecm>	 syncing both
[19:50:07] <MatmaRex>	 hopefully this is the last time you hear about VE on zhwiki :)
[19:50:36] <cjming>	 and A/B test enrollment fixes 🤞
[19:50:41] <urbanecm>	 hehe
[19:51:27] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 40f0cff8da7c4484e1fe93b9d649fd03f462e434: VE on zh.wiki: Enable single-edit-tab mode, and other config like en.wiki (T296269) (duration: 00m 57s)
[19:51:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:34] <stashbot>	 T296269: Enable VisualEditor for Chinese Wikipedia - https://phabricator.wikimedia.org/T296269
[19:51:38] <urbanecm>	 MatmaRex: i wouldn't be _that_ sure about zhwiki and VE. I'm working for Growth, which was the team that kinda created the need for VE there :))
[19:52:01] <MatmaRex>	 hah
[19:53:00] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.12/skins/Vector/resources/skins.vector.es6/AB.js: 62e84e7467c1765986cd1f80b466b8cacc6d91f6: Prevent A/B test enrollment hook from firing for unsampled (T297662) (duration: 00m 56s)
[19:53:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:06] <stashbot>	 T297662: mediawiki_web_ab_test_enrollment schema is logging users in the unsampled bucket - https://phabricator.wikimedia.org/T297662
[19:53:07] <urbanecm>	 cjming: should be live
[19:53:23] <cjming>	 urbanecm: thanks!
[19:53:27] <urbanecm>	 with the exception of nn1l2's patch, we're done 
[19:53:43] <nn1l2>	 I'm still around
[19:53:44] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Deprecate unused maps event stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos)
[19:53:59] <nn1l2>	 if you want to deploy it :)
[19:54:11] <urbanecm>	 nn1l2: can we leave it for tomorrow?
[19:54:19] <nn1l2>	 Of course!
[19:54:28] <urbanecm>	 it's reviewed now, but i'm afraid there's not enough team for testing it
[19:54:31] <urbanecm>	 *time
[19:54:41] <urbanecm>	 thanks :)
[19:54:44] <nn1l2>	 No problem
[19:54:46] <wikibugs>	 (03PS5) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463)
[19:54:47] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10hashar) We will promote testwikis to wmf.13 in a few minutes.  Tomorrow evening we would had wmf.12 running on...
[19:54:48] <urbanecm>	 !log UTC evening B&C window done
[19:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:32] <wikibugs>	 (03CR) 10Jgiannelos: [C: 04-1] "Is there anything else other than this patch that we need to do to remove the deprecated stream?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747196 (https://phabricator.wikimedia.org/T293366) (owner: 10Jgiannelos)
[19:55:40] <MatmaRex>	 thanks urbanecm
[19:55:43] <urbanecm>	 np
[19:55:55] <urbanecm>	 thanks for the VE work :))
[19:56:13] <wikibugs>	 (03PS2) 10Urbanecm: zhwiki: Promote Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746831 (https://phabricator.wikimedia.org/T287884)
[19:56:16] <urbanecm>	 actually...let me also quickly push this
[19:56:27] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] zhwiki: Promote Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746831 (https://phabricator.wikimedia.org/T287884) (owner: 10Urbanecm)
[19:57:09] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Promote Growth features out of dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/746831 (https://phabricator.wikimedia.org/T287884) (owner: 10Urbanecm)
[19:58:11] <MatmaRex>	 urbanecm: actually, a quick question
[19:58:17] <urbanecm>	 yes?
[19:58:34] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e127f4c6459cd9bc708b35a75c1f272b96fc3211: zhwiki: Promote Growth features out of dark mode (T287884) (duration: 00m 57s)
[19:58:35] <MatmaRex>	 urbanecm: so zhwiki still has the wiktiext editor as the default mode, is that okay for the Growth features?
[19:58:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:39] <stashbot>	 T287884: Deploy Growth features on Chinese Wikipedia - https://phabricator.wikimedia.org/T287884
[19:58:40] * urbanecm now fully done with deployment
[19:58:42] <MatmaRex>	 visual is available, but the user has to switch to it
[19:58:57] <urbanecm>	 very good question
[19:59:00] <urbanecm>	 let me check that
[19:59:37] <MatmaRex>	 i am guessing that your code probably makes sure to open VE when it needs VE
[19:59:47] <MatmaRex>	 but i haven't tested and i don't know if you've enabled it on wikis with this config before
[20:00:05] <jouncebot>	 hashar and dancy: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211214T2000).
[20:00:11] <urbanecm>	 for non-structured edits, we're telling the newcomer to press "Edit" (and highlighting it with a blinking dot)
[20:01:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:26] <wikibugs>	 (03PS6) 10AOkoth: gitlab: restore script keep_config options [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463)
[20:01:28] <urbanecm>	 at my WMF acc, it works
[20:01:42] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:02:07] <urbanecm>	 testing with a new one
[20:02:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:02:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:29] <hashar>	 good morning :)
[20:02:46] <RhinosF1>	 Hey hashar
[20:02:49] <urbanecm>	 hello hashar 
[20:02:57] <hashar>	 will do the testwikis promotion 
[20:03:04] <hashar>	 since apparently the deployments are done aren't they?
[20:03:06] <urbanecm>	 hashar: can you wait for a second?
[20:03:10] <MatmaRex>	 urbanecm: it's the same config as enwiki btw
[20:03:13] <urbanecm>	 MatmaRex pointed out a reason why i should revert
[20:03:13] <hashar>	 sure!
[20:03:23] <urbanecm>	 or...maybe not?
[20:03:44] <hashar>	 please take your time. There is no rush ;)
[20:03:47] <MatmaRex>	 enwiki and eswiki, frwiktionary, hewiki
[20:03:49] <urbanecm>	 thanks
[20:04:06] <MatmaRex>	 so if it's enabled on any of these as well, you're probably good
[20:04:12] <MatmaRex>	 (wmgVisualEditorIsSecondaryEditor)
[20:04:21] <urbanecm>	 with zhwiki, we're at all Wikipedias except pwnwiki
[20:04:55] <urbanecm>	 i double checked it, and VE loads as expected
[20:05:01] <urbanecm>	 hashar: over to you :))
[20:05:17] <MatmaRex>	 okay great :D
[20:05:21] <hashar>	 launching!
[20:05:32] <urbanecm>	 and thanks MatmaRex for raising that up
[20:06:49] <hashar>	 actually I am doing group 0 not testwikis
[20:08:02] <hashar>	 Promote group0 from 1.38.0-wmf.12 to 1.38.0-wmf.12 refs T293954 [y/N] y
[20:08:03] <stashbot>	 T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954
[20:08:08] <hashar>	 so hmm I broke the script :D 
[20:08:19] * urbanecm was just typing "fingers crossed"
[20:08:21] <urbanecm>	 too late i guess
[20:08:22] <hashar>	 it tries to promote from 12 to 12
[20:08:52] <hashar>	 I have hit ^C before pressing enter
[20:09:27] <hashar>	 fun
[20:09:54] <hashar>	 so I am blocked until I figure out why it can't find out the new version
[20:09:56] <urbanecm>	 i thought the script has an argument of target version?
[20:11:43] <hashar>	 hmm maybe
[20:11:47] <hashar>	 but really it should just work
[20:11:57] <urbanecm>	 docs in https://github.com/wikimedia/mediawiki-tools-release/blob/master/bin/deploy-promote#L45 say "defaults to last version in wikiversions.json"
[20:12:57] <hashar>	 or the doc is outdated
[20:13:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack)
[20:13:45] <urbanecm>	 https://github.com/wikimedia/mediawiki-tools-release/blob/master/bin/deploy-promote#L339 looks to query scap wikiversions-inuse --staging, which outputs only 1.38.0-wmf.12
[20:13:50] <hashar>	 https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Wait_for_deploy_window shows ~/release/bin/deploy-promote group0
[20:14:19] <wikibugs>	 (03PS1) 10Hashar: group0 wikis to 1.38.0-wmf.13  refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747200
[20:14:22] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.38.0-wmf.13  refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747200 (owner: 10Hashar)
[20:14:25] <urbanecm>	 and wikiversions.json does not have wmf.13 in it
[20:14:32] <hashar>	 yeah
[20:14:43] <urbanecm>	 so i think that the "Sync to cluster and verify on testwiki" step was not done
[20:15:09] <urbanecm>	 (and the later step you quote assumes testwiki is already at the new version, so promoting to "newest version that's in use" works)
[20:15:10] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.13  refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747200 (owner: 10Hashar)
[20:15:14] <urbanecm>	 my 2c on the bug :))
[20:15:26] <hashar>	 I have amended the wiki doc
[20:15:32] <hashar>	 ohh
[20:15:36] <hashar>	 yeah testwikis that is it
[20:15:37] <hashar>	 bah
[20:15:40] <hashar>	 thank you urbanecm !
[20:15:46] <urbanecm>	 any time!
[20:16:19] <hashar>	 20:15:51 Check 'Logstash Error rate for mw1416.eqiad.wmnet' failed: ERROR: 92% OVER_THRESHOLD (Avg. Error rate: Before: 0.00, After: 14.00, Threshold: 1.00)
[20:16:23] <urbanecm>	 :(
[20:16:31] <urbanecm>	 that looks like a very short window to me
[20:16:32] <hashar>	 it is a single canary though
[20:16:50] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.13  refs T293954
[20:16:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:55] <stashbot>	 T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954
[20:17:59] <wikibugs>	 (03CR) 10AOkoth: gitlab: restore script keep_config options (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth)
[20:18:09] <zabe>	 ouch, I can't even go onto https://test.wikipedia.org/
[20:18:20] <wikibugs>	 (03PS2) 10BBlack: eqiad lvs_neighbors: swap lvs1020 for lvs1016 [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804)
[20:18:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:55] <urbanecm>	 hashar: i think you did not build i18n
[20:19:00] <urbanecm>	 (ie full scap sync-world)
[20:19:12] <urbanecm>	 group0 is fully down
[20:19:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 4704 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:19:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:56] <urbanecm>	 hashar: please revert
[20:20:04] <hashar>	 yeah 
[20:20:06] <hashar>	 trying
[20:20:16] <hashar>	 FileNotFoundError: [Errno 2] ExtensionMessages not found in /srv/mediawiki-staging/wmf-config/ExtensionMessages-1.38.0-wmf.13.php: '/srv/mediawiki-staging/wmf-config/ExtensionMessages-1.38.0-wmf.13.php'
[20:20:17] <hashar>	 :(
[20:20:29] <urbanecm>	 yeah, missing scap sync-world in the proces
[20:20:36] <urbanecm>	 i18n not getting build
[20:20:57] <hashar>	 I did sync-world earlier
[20:21:21] <hashar>	 anyway I can't seem to be able to rollback due the above error bah
[20:21:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack)
[20:21:31] <urbanecm>	 hashar: did you use --force?
[20:21:32] <majavah>	 --force?
[20:21:52] <hashar>	 ditto
[20:22:03] <hashar>	 or maybe I can sync-file wikiversions.json
[20:22:15] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] eqiad lvs_neighbors: swap lvs1020 for lvs1016 [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack)
[20:22:49] <wikibugs>	 (03Merged) 10jenkins-bot: eqiad lvs_neighbors: swap lvs1020 for lvs1016 [homer/public] - 10https://gerrit.wikimedia.org/r/747175 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack)
[20:23:05] <urbanecm>	 hashar: that won't work
[20:23:13] <urbanecm>	 It has a compile step in it
[20:23:19] <hashar>	 yeah :-\
[20:23:35] <urbanecm>	 Compile manually and syncing the php version should work
[20:23:38] <hashar>	 so I gotta trigger a rebuild of the l10n 
[20:23:56] <hashar>	 cause of course we no more have the l10n update helper in scap :/
[20:23:56] <urbanecm>	 That'll fail for same reason
[20:24:06] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:24:39] <hashar>	 FileNotFoundError: [Errno 2] Directory not found: '/srv/mediawiki-staging/php-1.38.0-wmf.13/cache/l10n'
[20:24:40] <hashar>	 :-(
[20:24:45] <hashar>	 so clearly I am in trouble
[20:24:54] <urbanecm>	 Just group0, fortunately
[20:25:00] <urbanecm>	 I can try to fix it myself in a minite
[20:25:17] <hashar>	 I don't even understand why the l10n cache did not get build in the first place
[20:26:35] <majavah>	 dancy: ^ any chance that's related to the new scap version?
[20:27:42] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service andrew bogott I dont know what this is, but mediawiki behavior in codfw1dev barely matters -- its a rapidly deprecating test/dev site. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:27:50] <majavah>	 mediawiki.org is still down BTW
[20:27:57] <urbanecm>	 yeah
[20:27:59] <urbanecm>	 opened my laptop
[20:28:31] <hashar>	 !log group0 wikis (eg mediawiki.org) are unavailable due to a deployment issue. We are working on it # T293954
[20:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:36] <stashbot>	 T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954
[20:29:04] <rzl>	 here if you need anything from SRE, staying hands-off otherwise
[20:29:09] <urbanecm>	 compiled php version
[20:29:16] <urbanecm>	 hashar: mind if i try to sync it?
[20:29:20] <brennen>	 here if there's anything i can help with.
[20:29:24] <hashar>	 please try yes
[20:29:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] debian mirrors: add new mirror, mirror1001 in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/745612 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway)
[20:29:37] <hashar>	 cause I can't find a way to rebuild the localization cache from scratch
[20:30:31] <urbanecm>	 doing
[20:30:34] <hashar>	 the issue might be that I ran `scap sync-world` while wmf.13 was not listed in wikiversions.json
[20:30:44] <urbanecm>	 sounds plausible
[20:30:48] <hashar>	 which I guess might not have caused the generation of the l10n cache
[20:31:04] <brennen>	 yeah, if the testwikis step was missed, that would make sense.
[20:31:07] <hashar>	 so when I then run sync-wikiversion there is no l10n cache pushed anywere and the sites explode
[20:31:12] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wikiversions.php: rollback group0 (duration: 00m 41s)
[20:31:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:16] <hashar>	 oh
[20:31:18] <urbanecm>	 wikis are up
[20:31:37] <hashar>	 I did the git revert directly on the deploy server
[20:31:40] <urbanecm>	 yup
[20:31:41] <hashar>	 will send it to gerrit
[20:31:43] <urbanecm>	 made use of that
[20:31:48] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:32:54] <wikibugs>	 (03PS1) 10Hashar: Revert "group0 wikis to 1.38.0-wmf.13  refs T293954" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747202
[20:33:17] <urbanecm>	 FTR, i did `sudo -u mwdeploy cp /srv/mediawiki-staging/wikiversions.json /srv/mediawiki/wikiversions.json`, then `scap wikiversions-compile`, then `cp /srv/mediawiki/wikiversions.php /srv/mediawiki-staging/wikiversions.php` followed by `scap sync-file --force wikiversions.php 'rollback group0'`
[20:33:29] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Deployed by Urbanecm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747202 (owner: 10Hashar)
[20:33:48] <hashar>	 that is clever urbanecm !
[20:34:08] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.38.0-wmf.13  refs T293954" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747202 (owner: 10Hashar)
[20:34:20] <hashar>	 so now I guess I should bump testwiki to wmf.13
[20:34:25] <hashar>	 and run scap sync-world
[20:34:33] <hashar>	 which would trigger the l10n cache build for wmf.13
[20:34:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "alright, yea, let's ship it" [puppet] - 10https://gerrit.wikimedia.org/r/747128 (https://phabricator.wikimedia.org/T297144) (owner: 10Herron)
[20:34:51] <hashar>	 !log Group 0 wikis are available again and still on 1.38.0-wmf.12
[20:34:52] <urbanecm>	 !log Manually rollback group0 to wmf.12 by running `sudo -u mwdeploy cp /srv/mediawiki-staging/wikiversions.json /srv/mediawiki/wikiversions.json && scap wikiversions-compile && cp /srv/mediawiki/wikiversions.php /srv/mediawiki-staging/wikiversions.php && scap sync-file --force wikiversions.php 'rollback group0'`
[20:34:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:34:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:09] <urbanecm>	 log'ed the magic sequence, as it's an usual operation to perform
[20:35:17] * urbanecm logs off from deployment infra
[20:35:50] <logmsgbot>	 !log hashar@deploy1002 Started scap: testwiki to php-1.38.0-wmf.13 and rebuild l10n cache
[20:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:00] <urbanecm>	 hashar: what you suggested as next steps make sense to me
[20:36:06] <hashar>	 doing that
[20:36:10] <urbanecm>	 great
[20:36:28] <hashar>	 so my rookie mistake is that I did a sync world without any entries in wikiversion.json being at wmf.13
[20:36:34] <hashar>	 thus syncing solely the code
[20:36:40] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:36:42] <urbanecm>	 sounds like it
[20:37:04] <hashar>	 20:36:51 Updating ExtensionMessages-1.38.0-wmf.13.php
[20:37:04] <hashar>	 20:36:51 Updating LocalisationCache for 1.38.0-wmf.13 using 30 thread(s)
[20:37:12] <urbanecm>	 sounds about right
[20:37:23] <hashar>	 urbanecm: please order yourself a "I fixed the website" t-shirt :]
[20:37:36] <urbanecm>	 where do i do that? :))
[20:37:43] <hashar>	 no idea haha
[20:38:15] <hashar>	 ages ago we had a t-shirt "I broke wikipedia and I fixed it"
[20:38:17] <subbu>	 Isn't it "I broke wikipedia and I fixed it" .. you each get half of it .. lol.
[20:38:27] <hashar>	 which was not to brag about breaking wikipedia, cause at the time it was super easy to do
[20:38:40] <hashar>	 but really that one managed to fix it using any leverage needed
[20:38:59] <hashar>	 the key point being to be totally transparent about what has happened including being honest with mistake
[20:39:09] <hashar>	 the second point is screaming for help as soon as possible :D
[20:39:19] <urbanecm>	 i think both things happened here :))
[20:39:26] <brennen>	 i'm now imagining a little 2-piece friendship necklace like were popular in the 90s with half of the phrase on each side.
[20:39:38] <hashar>	 :wiki_love: :D
[20:39:47] <urbanecm>	 leaving it to hashar now
[20:39:52] <hashar>	 I would be very honored to share such a necklace with urbanecm 
[20:40:38] <hashar>	 urbanecm: thank you so much. I am back on track!
[20:40:41] <wikibugs>	 (03PS1) 10BBlack: pybal: peer all eqiad lvses with eqiad routers [puppet] - 10https://gerrit.wikimedia.org/r/747203 (https://phabricator.wikimedia.org/T295804)
[20:41:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:41:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:30] <hashar>	 and the canary error was probably legit but since it only got detected on one canary that was not enough to abort
[20:41:38] <hashar>	 I guess cause group0 does not have that much traffic
[20:42:35] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] pybal: peer all eqiad lvses with eqiad routers [puppet] - 10https://gerrit.wikimedia.org/r/747203 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack)
[20:43:17] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/747203 (https://phabricator.wikimedia.org/T295804) (owner: 10BBlack)
[20:43:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[20:43:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:35] <hashar>	 brennen: dancy: thank you :)
[20:44:28] <icinga-wm>	 RECOVERY - PyBal BGP sessions are established on lvs1020 is OK: NaN https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=eqiad+prometheus/ops
[20:45:00] <majavah>	 NaN?
[20:45:16] <bblack>	 https://en.wikipedia.org/wiki/NaN
[20:45:36] <bblack>	 don't ask me why it reports the NaN or why that's ok, but the thing it's checking is actually functioning :)
[20:45:59] <AntiComposite>	 the linked grafana dashboard ain't much help either :)
[20:46:08] <majavah>	 that's exactly what I was asking :-)
[20:47:27] <majavah>	 AntiComposite: it actually is, once you get the datasource and server parameters fixed
[20:47:34] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[20:48:32] <dancy>	 hashar: Sorry abut the trouble.  Reading scrollback to see what happened.
[20:49:12] <hashar>	 dancy: I did the scap prep and then a scap sync-world but wmf.13 was not in wikiversions.json so the l10n cache did not get build
[20:49:19] <hashar>	 which I guess is working as expected
[20:49:23] <dancy>	 ooh, edge case
[20:49:37] <hashar>	 the issue is that I should have promoted "testwiki" to wmf.13 which would have caused the sync-world to build the cache
[20:49:52] <hashar>	 then an hour ago I did the scap wikiversions which well only synced that file
[20:50:01] <hashar>	 promoting all of group0 wikis to wmf.13 as expected
[20:50:06] <hashar>	 but without any l10n cache :-\
[20:50:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:50:46] <hashar>	 so yeah I should have followed the process to the letter :/
[20:50:50] <dancy>	 nod.
[20:50:52] <dancy>	 hugs
[20:51:05] <bblack>	 trying to make sense of all the backlog, I've been off working on unrelated things: are we fully-ok now on whatever happened with train stuff?
[20:51:16] <hashar>	 yup
[20:51:30] <bblack>	 ok thanks
[20:51:34] <hashar>	 I have head back to the start of the process and now doing the testwiki update
[20:51:49] <hashar>	 then will promote group0 to wmf.13
[20:52:27] <bblack>	 ok
[20:52:42] <bblack>	 I have a semi-risky test to execute on the lvs stuff, but I'll wait till after you're done just in case
[20:53:06] <hashar>	 I will let you know as soon as it has completed
[20:53:36] <hashar>	 sync is roughly 40% done
[20:57:22] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:59:34] <dancy>	 majavah: Looks like I'm off the hook!
[21:02:48] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:04:41] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) a:05Papaul→03None
[21:06:00] <wikibugs>	 (03PS1) 10Eric Gardner: Don't attempt to scroll to a non-existing result [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747078
[21:09:37] <logmsgbot>	 !log hashar@deploy1002 Finished scap: testwiki to php-1.38.0-wmf.13 and rebuild l10n cache (duration: 33m 47s)
[21:09:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:46] <hashar>	 now promoting group0 wikis
[21:10:20] <hashar>	 $ ~/release/bin/deploy-promote group0
[21:10:20] <hashar>	 Promote group0 from 1.38.0-wmf.12 to 1.38.0-wmf.13 refs T293954
[21:10:20] <stashbot>	 T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954
[21:10:28] <hashar>	 this time deploy promote works as expected
[21:10:34] <wikibugs>	 (03PS1) 10Hashar: group0 wikis to 1.38.0-wmf.13  refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747204
[21:10:36] <dancy>	 👍🏾
[21:10:36] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.38.0-wmf.13  refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747204 (owner: 10Hashar)
[21:10:45] <hashar>	 I feel dumb really
[21:10:54] <dancy>	 Flying too close to the sun
[21:11:06] <hashar>	 I have been running that for ages and I still manage to screw up something whenever I try to outsmart the process
[21:11:09] <hashar>	 yeah
[21:11:13] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.13  refs T293954 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747204 (owner: 10Hashar)
[21:11:41] <hashar>	 funnily I was saying this week-end that holidays and week-end are usually super quiet
[21:11:53] <dancy>	 hehe
[21:11:56] <hashar>	 indicating that humans touching computers are the root cause of all issues and outages
[21:12:13] <hashar>	 and that really we should be replaced by cron jobs :D
[21:12:17] <dancy>	 nod. then squirrels.
[21:13:36] <hashar>	 logstash has 67k queries missing for mediawiki.org from 20:15 to 20:30
[21:15:20] <hashar>	 a huge chunk of them being for /w/index.php?title=Special:HideBanners&duration=604800&category=fundraising&reason=close
[21:15:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[21:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[21:18:12] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.13  refs T293954
[21:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:19] <stashbot>	 T293954: 1.38.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T293954
[21:19:24] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:19:48] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:19:55] <hashar>	 bblack: I have updated the group0 wikis to wmf.13 and there is no log error so it is probably a good one :]
[21:19:58] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:20:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:20:16] <bblack>	 hashar: thanks!
[21:20:41] <hashar>	 I will go off for the night soonish
[21:20:59] <dancy>	 I'll keep watch
[21:21:00] <hashar>	 and dancy  is the backup if something needs assistant on the mediawiki train end
[21:21:50] <jinxer-wm>	 (Traffic on tunnel link) firing: Traffic on tunnel link   - https://alerts.wikimedia.org
[21:22:34] <bblack>	 how did we lose two links?
[21:23:39] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: wtp* hosts: Out of memory (allocated 39845888) (tried to allocate 131072 bytes) in OutputHandler.php - https://phabricator.wikimedia.org/T297517 (10hashar) After some deployment issue, 1.38.0-wmf.13 has reached group 0 wikis.
[21:25:16] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5006 is CRITICAL: 2.483e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006
[21:25:16] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5001 is CRITICAL: 2.7e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001
[21:25:18] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5013 is CRITICAL: 2.283e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013
[21:25:42] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5012 is CRITICAL: 2.298e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012
[21:25:48] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5015 is CRITICAL: 2.604e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015
[21:26:00] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5016 is CRITICAL: 2.243e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5016
[21:26:00] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5004 is CRITICAL: 2.382e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5004
[21:26:06] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 9763 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011
[21:26:06] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5008 is CRITICAL: 2.288e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008
[21:26:06] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5010 is CRITICAL: 2.418e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010
[21:26:18] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5005 is CRITICAL: 2.412e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005
[21:26:30] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5003 is CRITICAL: 2.814e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003
[21:26:30] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5014 is CRITICAL: 2.492e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014
[21:26:32] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5002 is CRITICAL: 5.544e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002
[21:27:10] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 1.804e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007
[21:27:18] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: 3.278e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009
[21:28:20] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:29:39] <wikibugs>	 (03PS1) 10Legoktm: Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747079 (https://phabricator.wikimedia.org/T297744)
[21:29:49] <wikibugs>	 (03PS1) 10Legoktm: Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747080 (https://phabricator.wikimedia.org/T297744)
[21:30:23] <bblack>	 kafka brokers and purge events, something funky is going on
[21:30:31] <bblack>	 anyone have an idea?
[21:30:40] <bblack>	 https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-instance=cp5001
[21:30:52] <bblack>	 seems like codfw datacenter is sending purges, was only eqiad before
[21:31:02] <bblack>	 and latency is up in kafka
[21:31:03] <ottomata>	 i repooled codfw eventgate-main today
[21:31:07] <ottomata>	 must be related
[21:31:27] <ottomata>	 https://phabricator.wikimedia.org/T296699
[21:31:39] <ottomata>	 but, just pooling it shouldn't matter...
[21:31:45] <bblack>	 looks like the graph event might've started ~3h ago
[21:31:50] <bblack>	 but just started causing those alerts above
[21:31:52] <ottomata>	 yeah that seems about right
[21:31:59] <ottomata>	 it is supposed to be active/active
[21:32:12] <ottomata>	 it just wasn't for  a while due to a bug somewhere
[21:32:13] <bblack>	 the broker latency didn't start spiking until ~15 mins ago though
[21:32:27] <hashar>	 dancy: I am off to bed. Logstash seems happy nothing concerning since I have promoted group 0 :]
[21:32:29] <ottomata>	 oh wow yeah
[21:32:39] <dancy>	 👍🏾
[21:32:47] <hashar>	 dancy: will do some triage tomorrow morning and file tasks as needed. But it seems to be quiet train.  Have a good afternoon!
[21:32:55] <dancy>	 Have a good night!
[21:33:22] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] maps: write tegola swift credentials out to file [puppet] - 10https://gerrit.wikimedia.org/r/746897 (https://phabricator.wikimedia.org/T292700) (owner: 10Hnowlan)
[21:37:02] <ottomata>	 as far as i can tell it is all on the consumer side
[21:37:25] <ottomata>	 its just eqsin?
[21:37:41] <ottomata>	 bblack:  do yo know how the eqsin consumers are configured?  how do they know which main kafka cluster cluster to consume from?
[21:41:18] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5015 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015
[21:41:25] <ottomata>	 okay, found it in puppet
[21:41:26] <ottomata>	 profile::cache::purge::kafka_cluster_name
[21:41:50] <jinxer-wm>	 (Traffic on tunnel link) resolved: Traffic on tunnel link   - https://alerts.wikimedia.org
[21:41:58] <ottomata>	 most use main-eqiad, codfw and ulsfo use main-codfw
[21:42:09] <ottomata>	 bblack:  what's up with this link stuff?
[21:42:21] <ottomata>	 this is an eqsin  consumer reading from eqiad
[21:42:39] <ottomata>	 if there is link latency, that would cause increase in RTT and consumer latency, right?
[21:43:21] <wikibugs>	 10SRE, 10MediaWiki-Revision-backend, 10Performance-Team (Radar): Compress data at external storage - https://phabricator.wikimedia.org/T106386 (10Krinkle)
[21:45:46] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011
[21:47:44] <bblack>	 ottomata: yeah I guess so, but the link stuff shouldn't have impacted eqsin, I don't think
[21:47:51] <bblack>	 maybe I'm missing something there
[21:48:17] <bblack>	 the latency for ulsfo should've increased, but not eqsin
[21:48:26] <bblack>	 (and even ulsfo, shouldn't be by that much
[21:48:27] <bblack>	 )
[21:51:58] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5015 is CRITICAL: 3.155e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015
[21:53:21] <bblack>	 oh yeah, I guess the primary eqsin transport is via-ulsfo
[21:53:24] <bblack>	 so this impacts that as well
[21:53:33] <bblack>	 hmmmm
[21:54:44] <ottomata>	 ah
[21:56:36] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 5.217e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011
[21:57:36] <wikibugs>	 (03PS1) 10Jbond: populate_puppetdb: update tp use config class [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/747207
[22:13:01] <wikibugs>	 (03CR) 10Dzahn: gitlab: restore script keep_config options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741675 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth)
[22:28:46] <icinga-wm>	 PROBLEM - puppet last run on wcqs1001 is CRITICAL: CRITICAL: Puppet has been disabled for 604921 seconds, message: Debugging nginx - jetty request handling - ebernhardson, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[22:34:22] <legoktm>	 I'm going to deploy a security patch
[22:36:58] <wikibugs>	 10SRE, 10ops-eqdfw: cr2-eqdfw: PEM 1 Input Voltage Out Of Range flapping - https://phabricator.wikimedia.org/T294009 (10Papaul) 05Open→03Resolved The was a breaker problem . This is now resolved
[22:37:12] <wikibugs>	 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul)
[22:42:46] <jinxer-wm>	 (Traffic on tunnel link) firing: Traffic on tunnel link   - https://alerts.wikimedia.org
[22:47:28] <wikibugs>	 10SRE, 10MediaWiki-Revision-backend, 10Performance-Team: Compress data at external storage - https://phabricator.wikimedia.org/T106386 (10Krinkle)
[22:57:56] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP Cathal Mooney Telia IC-331929 to cr3-eqsin down. - The acknowledgement expires at: 2021-12-16 09:00:50. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:59:11] <icinga-wm>	 ACKNOWLEDGEMENT - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP Cathal Mooney Telia IC-331929 to cr1-codfw down - The acknowledgement expires at: 2021-12-15 22:58:51. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:02:46] <jinxer-wm>	 (Traffic on tunnel link) resolved: Traffic on tunnel link   - https://alerts.wikimedia.org
[23:03:21] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:03:35] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:04:13] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5016 is OK: (C)5000 gt (W)3000 gt 1683 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5016
[23:04:17] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5010 is OK: (C)5000 gt (W)3000 gt 2038 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010
[23:04:17] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5008 is OK: (C)5000 gt (W)3000 gt 1168 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008
[23:04:35] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5005 is OK: (C)5000 gt (W)3000 gt 679 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005
[23:04:51] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5003 is OK: (C)5000 gt (W)3000 gt 445.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5003
[23:04:53] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5014 is OK: (C)5000 gt (W)3000 gt 451.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5014
[23:04:53] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5002 is OK: (C)5000 gt (W)3000 gt 315.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002
[23:05:09] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5006 is OK: (C)5000 gt (W)3000 gt 383.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006
[23:05:11] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5015 is OK: (C)5000 gt (W)3000 gt 390.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5015
[23:05:11] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5012 is OK: (C)5000 gt (W)3000 gt 364.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012
[23:05:41] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5004 is OK: (C)5000 gt (W)3000 gt 418.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5004
[23:05:47] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 589.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011
[23:08:49] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 354.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007
[23:08:51] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5013 is OK: (C)5000 gt (W)3000 gt 579.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5013
[23:10:03] <legoktm>	 !log deploying patch for T297416
[23:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:10:49] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp5012 is CRITICAL: cluster=cache_text instance=cp5012 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012
[23:10:59] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5001 is OK: (C)5000 gt (W)3000 gt 437.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5001
[23:12:33] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp5012 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012
[23:15:34] <bblack>	 !log lvs1014 (upload) - disabling pybal, will over traffic to lvs1020 (to test lvs1020 sanity)
[23:15:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:50] <bblack>	 (expect a couple of pybal/bgp alerts here)
[23:18:12] <wikibugs>	 (03PS1) 10Eric Gardner: Remove multiple instance of VUEX initialization [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747081 (https://phabricator.wikimedia.org/T297690)
[23:20:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1014 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[23:20:46] <bblack>	 ^ expected
[23:21:13] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:21:19] <icinga-wm>	 PROBLEM - pybal on lvs1014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[23:21:39] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1014 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal
[23:22:09] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: (C)5000 gt (W)3000 gt 380.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009
[23:26:27] <bblack>	 !log lvs1014 (upload) restart pybal, back to normal
[23:26:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:49] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:27:53] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1014 is OK: OK: 36 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal
[23:27:55] <icinga-wm>	 RECOVERY - pybal on lvs1014 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[23:28:39] <bblack>	 !log lvs1013 (text) - disabling pybal, will fail over traffic to lvs1020 (to test lvs1020 sanity)
[23:28:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:29:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:33:39] <icinga-wm>	 PROBLEM - pybal on lvs1013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[23:33:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1013 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[23:34:05] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:34:23] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1013 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[23:34:28] <bblack>	 ^ again all expected
[23:35:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "merging this. also checked in codesearch it's not used in other code" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn)
[23:38:14] <wikibugs>	 (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn)
[23:39:31] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) the "all-mw-*" aliases now include parsoid servers:   ` before:  [cumin1001:~] $ sudo cumin A:all-mw-eqiad 'uptime' 157 hosts will be targeted: mw[1302-1456].eqi...
[23:41:42] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: parsoid servers are not matched by mw* cumin aliases - https://phabricator.wikimedia.org/T294802 (10Dzahn) 05Stalled→03Resolved I did add them to "all-mw" while not touching core "mw". Based on Gerrit comments etc. Hope this still resolves it!
[23:44:09] <bblack>	 !log lvs1013 (text) restart pybal, back to normal
[23:44:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:44:41] <icinga-wm>	 RECOVERY - pybal on lvs1013 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[23:44:45] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:45:09] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 99, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:46:49] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1013 is OK: OK: 12 connections established with conf1004.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[23:49:09] <bblack>	 !log lvs1015 (internal services) - disabling pybal, will fail over traffic to lvs1020 (to test lvs1020 sanity)
[23:49:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:51:35] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Remove multiple instance of VUEX initialization [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747081 (https://phabricator.wikimedia.org/T297690) (owner: 10Eric Gardner)
[23:51:50] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Don't attempt to scroll to a non-existing result [extensions/MediaSearch] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747078 (owner: 10Eric Gardner)
[23:51:59] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.13) - 10https://gerrit.wikimedia.org/r/747079 (https://phabricator.wikimedia.org/T297744) (owner: 10Legoktm)
[23:52:04] <wikibugs>	 (03CR) 10Catrope: [C: 03+2] Revert "Replace deprecated methods IContextSource::getWikiPage && IContextSource::canUseWikiPage" [core] (wmf/1.38.0-wmf.12) - 10https://gerrit.wikimedia.org/r/747080 (https://phabricator.wikimedia.org/T297744) (owner: 10Legoktm)
[23:52:42] <RoanKattouw>	 --^ Early +2s for the upcoming backport window, because CI takes forever to merge them
[23:53:09] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10colewhite) I tested Logstash 7.10 writing api feature usage logs to an ES 6 instance in cloud.  Somewhere in the pipeline, the api feature usage logs...
[23:53:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudmetrics: make cloudmetrics1003 the primary, 1004 the secondary [puppet] - 10https://gerrit.wikimedia.org/r/745950 (https://phabricator.wikimedia.org/T289888) (owner: 10Andrew Bogott)
[23:53:47] <EricGardner>	 Thanks Roan
[23:54:01] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:54:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Replace cloudmetrics1001 with cloudmetrics1003 [dns] - 10https://gerrit.wikimedia.org/r/747174 (https://phabricator.wikimedia.org/T297712) (owner: 10Andrew Bogott)
[23:54:41] <icinga-wm>	 PROBLEM - pybal on lvs1015 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[23:54:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[23:55:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:56:47] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=77) https://wikitech.wikimedia.org/wiki/PyBal
[23:57:33] <bblack>	 does the restbase-dev1005 alert have any known cause?
[23:57:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:58:31] <bblack>	 hopefully just a blip!