[00:00:30] (03CR) 10RLazarus: [V: 03+1 C: 03+2] imagecatalog: Pass cluster names along with config paths [puppet] - 10https://gerrit.wikimedia.org/r/748799 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [00:03:00] 10SRE-swift-storage, 10Commons, 10affects-Kiwix-and-openZIM: JPEG image is reported with the wrong mime-type application/octet-stream - https://phabricator.wikimedia.org/T298011 (10BilalShirwani) [00:03:33] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10BilalShirwani) [00:05:36] 10SRE, 10Domains, 10Phabricator, 10serviceops-radar: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10BilalShirwani) [00:07:25] 10SRE, 10Domains, 10Phabricator, 10serviceops-radar: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10JJMC89) 05duplicate→03Open [00:12:55] 10SRE: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T297948 (10JJMC89) 05duplicate→03Stalled [00:13:44] 10SRE-swift-storage, 10Observability-Metrics, 10serviceops: thanos-be hosts filing up root filesystem with logs - https://phabricator.wikimedia.org/T297959 (10JJMC89) 05duplicate→03Open [00:14:28] 10SRE: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10JJMC89) 05duplicate→03Open [00:17:52] 10SRE-swift-storage, 10Commons, 10affects-Kiwix-and-openZIM: JPEG image is reported with the wrong mime-type application/octet-stream - https://phabricator.wikimedia.org/T298011 (10JJMC89) 05duplicate→03Open [00:18:41] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10JJMC89) 05duplicate→03Open [00:28:48] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) Alright, thanks for approving @thcipriani! @Zabe I am on clinic duty this week, handling access requests. Since you have been approved we'll now continue with the "volunteer NDA" process (htt... [00:30:23] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10Dzahn) a:03Dzahn [00:30:32] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10Dzahn) 05Open→03In progress [00:36:46] 10SRE, 10ops-codfw: ms-be2065 failed drive sdq - https://phabricator.wikimedia.org/T297933 (10Papaul) Create Dispatch: Success You have successfully submitted request SR1079308386. [00:42:37] (03PS1) 10Dzahn: admin: give Sneha Patel access to Superset/Hive UIs w/ private data [puppet] - 10https://gerrit.wikimedia.org/r/748854 (https://phabricator.wikimedia.org/T297927) [00:43:57] (03PS2) 10Dzahn: admin: give Sneha Patel access to Superset/Hive UIs w/ private data [puppet] - 10https://gerrit.wikimedia.org/r/748854 (https://phabricator.wikimedia.org/T297927) [00:49:29] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Dzahn) [00:50:08] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Dzahn) [00:50:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:51:18] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:51:33] 10SRE, 10Maps: Allow Wikimedia Maps usage on wikijournal.org - https://phabricator.wikimedia.org/T297948 (10Dzahn) [00:52:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:53] 10SRE, 10Domains, 10Phabricator, 10serviceops-radar: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) 05Open→03In progress [00:53:18] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:53:49] 10SRE, 10Domains, 10Phabricator, 10serviceops-radar, 10User-revi: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) a:03revi [01:06:46] !log depooling mw1312 for benchmarking [01:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10wiki_willy) Thanks @Joe, I think that covers what we need. (cc @Papaul) >>! In T294962#7580115, @Joe wrote: > Yes sorry, I dropped the ball on this. > > We need th... [01:24:40] (03PS1) 10RLazarus: Return a set, not a list, from active_images() [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/748873 [01:28:35] (03PS1) 10RLazarus: imagecatalog: Add an hourly systemd timer to scan for what's currently running [puppet] - 10https://gerrit.wikimedia.org/r/748876 (https://phabricator.wikimedia.org/T287130) [01:29:14] (03CR) 10jerkins-bot: [V: 04-1] imagecatalog: Add an hourly systemd timer to scan for what's currently running [puppet] - 10https://gerrit.wikimedia.org/r/748876 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [01:30:39] (03PS2) 10RLazarus: imagecatalog: Add an hourly systemd timer to scan for what's currently running [puppet] - 10https://gerrit.wikimedia.org/r/748876 (https://phabricator.wikimedia.org/T287130) [01:33:06] (03PS1) 10JHathaway: exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) [01:34:51] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33061/console" [puppet] - 10https://gerrit.wikimedia.org/r/748876 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [01:37:15] (03CR) 10JHathaway: "Regarding, https://phabricator.wikimedia.org/T298038, here is an alternative take to the system_filter, both should work, but his one is a" [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [02:04:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.14 [core] (wmf/1.38.0-wmf.14) - 10https://gerrit.wikimedia.org/r/748887 [02:07:01] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.14 [core] (wmf/1.38.0-wmf.14) - 10https://gerrit.wikimedia.org/r/748887 (owner: 10TrainBranchBot) [02:08:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:54] (03PS4) 10Legoktm: mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [02:12:59] (03PS5) 10Legoktm: mediawiki: Redirect Special:CodeReview to static archives [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [02:14:03] (03CR) 10Legoktm: "PS4: Address joe's comments. PS5: match against Special:Code/MediaWiki/r123 too (note the optional "r" prefix)." [puppet] - 10https://gerrit.wikimedia.org/r/724049 (https://phabricator.wikimedia.org/T205361) (owner: 10Majavah) [02:26:41] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.14 [core] (wmf/1.38.0-wmf.14) - 10https://gerrit.wikimedia.org/r/748887 (owner: 10TrainBranchBot) [02:27:32] !log repooling mw1312 [02:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:06] !log depooling mw1450 [02:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:14] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:27:39] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10KFrancis) Thanks @Dzahn, @Zabe Please send me the following info: Full legal name Mailing address Email address If you prefer, you can send it to kfrancis@wikimedia.org [05:29:34] (03PS1) 10Marostegui: Revert "install_server: Allow reimage dbproxy2004" [puppet] - 10https://gerrit.wikimedia.org/r/748282 [05:30:38] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage dbproxy2004" [puppet] - 10https://gerrit.wikimedia.org/r/748282 (owner: 10Marostegui) [05:32:26] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:37:38] (03CR) 10Marostegui: [C: 03+1] "just a minor thing, feel free to ignore" [software] - 10https://gerrit.wikimedia.org/r/748723 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [05:38:14] (03CR) 10Marostegui: [C: 03+1] "ha, interesting way of detecting the active." [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [06:34:14] 10SRE, 10Domains, 10Phabricator, 10serviceops-radar, 10User-revi: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10revi) 05In progress→03Resolved OOOOOOOOHHHHHHHHHH Was doing some cleanups for my now-defunct server (wh... [06:36:00] <_< [06:38:00] 10SRE, 10Domains, 10Phabricator, 10serviceops-radar, 10User-revi: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10revi) (To be honest, I am very surprised that someone other than me was using the redirection, lol) [07:13:18] (03CR) 10Legoktm: "Why not https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&formatversion=2 and check "wmfMasterDatacenter"?" [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [07:44:20] (03CR) 10Giuseppe Lavagetto: "This code is python, so it should be really easy to use conftool libraries to actually fecth the information from the datastore directly. " [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [07:58:54] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:00:58] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:11:03] (03PS1) 10Ema: Revert "cache: enable single backend experiment on cp4021" [puppet] - 10https://gerrit.wikimedia.org/r/749131 (https://phabricator.wikimedia.org/T288106) [08:13:25] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33062/console" [puppet] - 10https://gerrit.wikimedia.org/r/749131 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [08:14:35] !log cp4021: depool to revert single backend experiment T288106 [08:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:42] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [08:15:05] (03CR) 10Ema: [V: 03+1 C: 03+2] Revert "cache: enable single backend experiment on cp4021" [puppet] - 10https://gerrit.wikimedia.org/r/749131 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [08:28:51] (03CR) 10Nikerabbit: [C: 03+1] Set ContentTranslationContentImportForSectionTranslation for SX [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747794 (https://phabricator.wikimedia.org/T294642) (owner: 10KartikMistry) [08:29:12] !log cp4021: pool with single backend experiment reverted T288106 [08:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:17] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [08:32:41] (03PS1) 10Ema: Revert "cache: enable single backend experiment on cp3051" [puppet] - 10https://gerrit.wikimedia.org/r/749132 (https://phabricator.wikimedia.org/T288106) [08:40:25] (03CR) 10JMeybohm: "Apart from a nit in commit message, I think we're good to go with this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [08:41:58] (03CR) 10JMeybohm: Kubernetes 1.22 support, update chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [08:43:02] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33063/console" [puppet] - 10https://gerrit.wikimedia.org/r/749132 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [08:45:41] !log cp3051: depool to revert single backend experiment T288106 [08:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:46] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [08:46:00] (03CR) 10Ema: [V: 03+1 C: 03+2] Revert "cache: enable single backend experiment on cp3051" [puppet] - 10https://gerrit.wikimedia.org/r/749132 (https://phabricator.wikimedia.org/T288106) (owner: 10Ema) [08:50:48] !log cp3051: pool with single backend experiment reverted T288106 [08:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:53] T288106: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 [09:01:05] (03CR) 10Varac: Kubernetes 1.22 support, update chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [09:03:52] (03CR) 10Volans: [C: 03+2] "Self-merging to test it with some new hosts, @jbond I'd be happy to address any post-merge comment in a followup commit." [cookbooks] - 10https://gerrit.wikimedia.org/r/748761 (owner: 10Volans) [09:06:38] (03Merged) 10jenkins-bot: sre.hosts.provision: refactor to be more flexible [cookbooks] - 10https://gerrit.wikimedia.org/r/748761 (owner: 10Volans) [09:07:48] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "check" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [09:07:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2007.codfw.wmnet with OS buster [09:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:01] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2007.codfw.wmnet with OS buster [09:08:07] _joe_: recheck not check if you want to run jenkins [09:08:18] <_joe_> majavah: yeah sigh [09:08:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [09:08:41] <_joe_> I also forgot to change my vote to +1 [09:09:40] (03CR) 10Giuseppe Lavagetto: "I think you can abandon this patch. Sorry for the additional useless work 😞" [deployment-charts] - 10https://gerrit.wikimedia.org/r/748734 (owner: 10Varac) [09:14:36] (03CR) 10JMeybohm: Kubernetes 1.22 support, update chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [09:17:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "The diffs in https://integration.wikimedia.org/ci/job/helm-lint/6424/console LGTM, merging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [09:17:58] _joe_: there is another thing about this change that is generally interesting [09:18:38] <_joe_> jayme: I have an idea, but please go on :) [09:18:52] apiVersion is a read-only field :-) [09:19:14] oh, you merged already...the commit msg is a bit missleading still [09:19:50] apiVersion being read-only means that chart upgrades will fail (at least I think so) [09:20:12] <_joe_> uhm why would they? [09:20:28] because you cant change the apiVersion of an existion object [09:20:31] <_joe_> doesn't helm detect that and replace the resource? [09:20:31] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:53] <_joe_> jayme: interesting anyways, can we test on staging I guess [09:20:54] helm2 did not IIRC, not sure about helm3 [09:21:07] <_joe_> yeah one hopes some lessons were learned [09:21:13] haha [09:21:19] <_joe_> but sure, we can try, and btw [09:21:32] <_joe_> I think charts.wikimedia.org needs a catalog [09:21:45] (03Merged) 10jenkins-bot: Kubernetes 1.22 support, update chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [09:21:49] charts.wikimedia.org does noe even exist :-p [09:22:04] <_joe_> oh it's chartsmuseum? so sad [09:22:28] <_joe_> or what else? [09:22:39] <_joe_> tell me we didn't go with softwarename.wikimedia.org [09:22:43] it's https://helm-charts.wikimedia.org/ - I'm just joking [09:23:01] IIRC we had the helm-charts name before chartmuseum [09:23:14] <_joe_> so, does chartsmuseum have a web proxy in front? [09:23:27] it's behind our caches [09:23:35] <_joe_> if it does, we can generate a simple set of banner pages I guess [09:23:57] (03CR) 10Ayounsi: [C: 03+2] "Thanks!" [homer/public] - 10https://gerrit.wikimedia.org/r/748775 (https://phabricator.wikimedia.org/T282787) (owner: 10Majavah) [09:24:13] it just has envoy in front for tls termination [09:24:36] (03Merged) 10jenkins-bot: Add drmrs addresses [homer/public] - 10https://gerrit.wikimedia.org/r/748775 (https://phabricator.wikimedia.org/T282787) (owner: 10Majavah) [09:25:15] (03CR) 10Ayounsi: "We added capacity to eqiad since. I believe this patch is not needed anymore." [dns] - 10https://gerrit.wikimedia.org/r/574493 (owner: 10CDanis) [09:25:16] <_joe_> ok, and that would mean we'll have to implement an additional route which means modifying the standard tls terminator. Sometimes standardizing stuff sucks :D [09:26:33] aiui you want some kind of frontend, like we have for the docker-registry? [09:26:42] <_joe_> yes [09:27:11] <_joe_> https://github.com/chartmuseum/ui uhm [09:27:20] maybe there is an easy way to mirror to artifacthub [09:27:48] ah, yeah...there is this ui as well [09:27:51] <_joe_> I don't like how artifacthub generates vuln reports :P [09:27:56] hrhr [09:28:12] <_joe_> "oh look they have libc 2.28!" expcept it's patched [09:28:40] <_joe_> I fear they're using clair, if clair is that bad at detecting false positives it's basically useless for us :/ [09:30:04] publishers can disable security scanning. [09:30:29] <_joe_> yeah I see a lot of our charts there published by $people [09:30:56] I meant to have it more like a searchable catalog (for visibility) / UI rather than to use the other capabilities [09:31:07] so that we don't have to deal with the UI stuff.. [09:31:22] (03CR) 10Ayounsi: Ferm: allow dhcp request from infra IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737020 (https://phabricator.wikimedia.org/T282787) (owner: 10Ayounsi) [09:31:46] <_joe_> https://artifacthub.io/packages/search?repo=wikimedia&sort=relevance&page=1 [09:32:06] hmmm [09:32:22] so, nothing to do - move along :D [09:33:09] looks like an automated thingy, no? [09:33:22] <_joe_> some user submitted our charts repo there [09:33:47] that user submits random other stuff as well...so mitght be a bot [09:34:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2007.codfw.wmnet with OS buster [09:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:14] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2007.codfw.wmnet with OS buster completed: - ganeti2007 (**PASS**) - Downtimed on Icinga... [09:34:56] <_joe_> jayme: I'll just try to deploy eventrouter on staging [09:35:19] yeah, feel free [09:35:53] <_joe_> I guess I need to be root to deploy eventrouter right? [09:37:29] !log oblivian@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:37:31] !log oblivian@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:46] _joe_: yes. It's part of admin_ng [09:37:52] <_joe_> oh and I see sudo is not enough [09:38:29] !log oblivian@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:44] !log oblivian@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:38] <_joe_> jayme: upgrade went without a hiccup [09:41:51] <_joe_> not doing everywhere else as it's basically a noop, or should I? [09:42:13] did you see a diff an apply? [09:43:04] PROBLEM - Debian mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [09:43:06] "kubectl -n kube-system get clusterroles.rbac.authorization.k8s.io eventrouter -o yaml" shows the object unchanged since 292 days and already having apiVersion: rbac.authorization.k8s.io/v1 [09:43:24] <_joe_> jayme: on which cluster? [09:43:33] <_joe_> and yes, I did see the diff [09:43:41] <_joe_> and it included the change of the apiVersion [09:43:41] staging eqiad [09:44:04] ah, sorry. creation is 292d [09:44:10] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:39] <_joe_> but yes, it's interesting to see that the objects in codfw already have the correct apiVersion... [09:46:49] yeah. There might be "something" involved already during creation...but good for us I guess [09:48:17] afaik kubernetes stores it just as a ClusterRole and the api versions only get involved when creating or deleting them, meaning that you can write it as v1beta1 and then use the v1 api to get a v1 object out [09:48:33] only get involved when accessing them* [09:50:32] hm...but I remember having to actually migrate deployments at some point [09:51:00] and ther must be some level of validation as well depending on the apiversion [09:51:36] (03CR) 10Ayounsi: "What do you think of using `$facts['lldp']['parent']` everywhere instead of a duplicate `lldp_parent` ?" [puppet] - 10https://gerrit.wikimedia.org/r/715242 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [09:52:24] (03PS3) 10Ayounsi: lldp: update lldp_parent to use lldp['parent'] [puppet] - 10https://gerrit.wikimedia.org/r/715242 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [09:53:07] _joe_: regarding artifacthub: The docs say we can claim ownership of our repos if we want to... would be nice to know how Varac stumbled upon the eventrouter chart :) [09:54:09] (03Abandoned) 10Ayounsi: CHANGELOG: add changelogs for release v0.2.7 [software/homer] - 10https://gerrit.wikimedia.org/r/681350 (owner: 10Ayounsi) [09:57:48] (03PS1) 10PipelineBot: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/749141 [09:59:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2007.codfw.wmnet [09:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:50] 10SRE-tools, 10Discovery, 10Infrastructure-Foundations, 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10Volans) Current status is that all newly provisioned hosts that are still in STAGED status in Netbox have the AAAA record... [10:01:56] (03PS1) 10PipelineBot: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/749142 [10:02:52] ^^ legoktm: should I be doing anything to these auto chart promote commits? afaik there should not be any changes to the production image [10:03:08] no, I'll abandon them in a minute [10:03:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2007.codfw.wmnet [10:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:37] (03PS1) 10PipelineBot: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/749143 [10:05:47] (03Abandoned) 10Ayounsi: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [10:06:45] (03CR) 10Jbond: [C: 03+1] CAS: Update to 6.4.4.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/748722 (owner: 10Muehlenhoff) [10:13:36] !log disabling puppet on mx1001 T298038 [10:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:06] (03CR) 10Jcrespo: [C: 03+2] exim: update system filter [puppet] - 10https://gerrit.wikimedia.org/r/748820 (https://phabricator.wikimedia.org/T298038) (owner: 10Herron) [10:15:44] (03PS1) 10Jbond: mirrors.wikimedia.org: Add new mirror server to dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/749146 (https://phabricator.wikimedia.org/T286898) [10:16:26] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10Gehel) For the moment, elasticsearch is configured explicitly with the IPv4 address of the host. Int... [10:16:33] (03PS4) 10Jbond: mirrors.wikimedia.org: point to new mirror [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [10:16:41] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/749143 (owner: 10PipelineBot) [10:16:42] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/749142 (owner: 10PipelineBot) [10:16:44] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/749141 (owner: 10PipelineBot) [10:16:47] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748354 (owner: 10PipelineBot) [10:16:49] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748353 (owner: 10PipelineBot) [10:16:51] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748350 (owner: 10PipelineBot) [10:17:08] (03CR) 10Jbond: [C: 03+2] mirrors.wikimedia.org: Add new mirror server to dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/749146 (https://phabricator.wikimedia.org/T286898) (owner: 10Jbond) [10:17:23] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/740605 (owner: 10PipelineBot) [10:17:25] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/740602 (owner: 10PipelineBot) [10:17:27] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/740597 (owner: 10PipelineBot) [10:17:29] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736912 (owner: 10PipelineBot) [10:17:31] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736910 (owner: 10PipelineBot) [10:17:33] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736908 (owner: 10PipelineBot) [10:17:35] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736003 (owner: 10PipelineBot) [10:17:37] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736002 (owner: 10PipelineBot) [10:17:39] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/736000 (owner: 10PipelineBot) [10:17:41] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734782 (owner: 10PipelineBot) [10:17:43] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734781 (owner: 10PipelineBot) [10:17:45] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734780 (owner: 10PipelineBot) [10:17:47] (03Abandoned) 10Legoktm: shellbox-timeline: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734738 (owner: 10PipelineBot) [10:17:49] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734737 (owner: 10PipelineBot) [10:17:51] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/734735 (owner: 10PipelineBot) [10:17:53] (03Abandoned) 10Legoktm: shellbox: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/732766 (owner: 10PipelineBot) [10:17:55] (03Abandoned) 10Legoktm: shellbox-constraints: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/732765 (owner: 10PipelineBot) [10:18:04] legoktm: getting things done? :-p [10:18:26] (03CR) 10Majavah: "um, I think the plan here was to drop the exemption: https://phabricator.wikimedia.org/T298042" [puppet] - 10https://gerrit.wikimedia.org/r/749146 (https://phabricator.wikimedia.org/T286898) (owner: 10Jbond) [10:18:38] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Jason.nlw) I can't speak to the technical implications of this task however in terms of mission alignment I would just like to add a little background to our Wikimedia work in Wales. In the last year we... [10:18:39] just cleared 20+ patches from my code review backlog! [10:18:53] (03Abandoned) 10Majavah: apple-search: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/748349 (owner: 10PipelineBot) [10:19:04] (03CR) 10Jbond: [C: 03+1] "+1 (assuming a confirmation of the inline Q)" [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [10:19:20] 10SRE, 10Infrastructure-Foundations: Migrate codfw Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296622 (10MoritzMuehlenhoff) [10:19:48] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: drop APT repositories NAT exception [puppet] - 10https://gerrit.wikimedia.org/r/748771 (https://phabricator.wikimedia.org/T298042) [10:22:14] (03CR) 10Majavah: mirrors.wikimedia.org: Add new mirror server to dmz_cidr (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/749146 (https://phabricator.wikimedia.org/T286898) (owner: 10Jbond) [10:23:32] !log volans@cumin1001 START - Cookbook sre.dns.netbox [10:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: drop APT repositories NAT exception [puppet] - 10https://gerrit.wikimedia.org/r/748771 (https://phabricator.wikimedia.org/T298042) (owner: 10Arturo Borrero Gonzalez) [10:24:37] (03CR) 10Jbond: [C: 03+1] admin: give Sneha Patel access to Superset/Hive UIs w/ private data [puppet] - 10https://gerrit.wikimedia.org/r/748854 (https://phabricator.wikimedia.org/T297927) (owner: 10Dzahn) [10:26:55] (03PS1) 10Giuseppe Lavagetto: Fix helmfile.d service examples [deployment-charts] - 10https://gerrit.wikimedia.org/r/749147 [10:26:57] (03PS1) 10Giuseppe Lavagetto: mwdebug: improve helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/749148 [10:28:25] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: improve helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/749148 (owner: 10Giuseppe Lavagetto) [10:28:28] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:39] (03PS2) 10Jbond: exim: add the ability to silently drop senders [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [10:29:42] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [10:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:06] (03Abandoned) 10Ayounsi: coherence: Alert on ACTIVE devices with names future- or spare. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/550051 (https://phabricator.wikimedia.org/T237464) (owner: 10CRusnov) [10:31:10] (03CR) 10Ayounsi: [C: 03+2] cloud: drop APT repositories NAT exception [homer/public] - 10https://gerrit.wikimedia.org/r/748774 (https://phabricator.wikimedia.org/T298042) (owner: 10Arturo Borrero Gonzalez) [10:31:16] (03CR) 10Jbond: "This looks good to me, however it seems like the initial issue may have been resolved in https://gerrit.wikimedia.org/r/c/operations/puppe" [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [10:31:44] (03Merged) 10jenkins-bot: cloud: drop APT repositories NAT exception [homer/public] - 10https://gerrit.wikimedia.org/r/748774 (https://phabricator.wikimedia.org/T298042) (owner: 10Arturo Borrero Gonzalez) [10:35:12] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [10:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:34] (03CR) 10Jbond: exim: add the ability to silently drop senders (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/748884 (https://phabricator.wikimedia.org/T298038) (owner: 10JHathaway) [10:41:28] (03CR) 10Jbond: sre.hosts.provision: refactor to be more flexible (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/748761 (owner: 10Volans) [10:43:55] (03PS1) 10Jbond: Revert "mirrors.wikimedia.org: Add new mirror server to dmz_cidr" [puppet] - 10https://gerrit.wikimedia.org/r/748284 [10:44:21] (03PS3) 10JMeybohm: Add support for returning bundles instead of certs from sign calls [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748143 (https://phabricator.wikimedia.org/T294560) [10:44:23] (03PS1) 10JMeybohm: Update simple-cfssl to wmf branch [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/749151 (https://phabricator.wikimedia.org/T294560) [10:44:34] (03CR) 10jerkins-bot: [V: 04-1] Revert "mirrors.wikimedia.org: Add new mirror server to dmz_cidr" [puppet] - 10https://gerrit.wikimedia.org/r/748284 (owner: 10Jbond) [10:45:37] (03PS2) 10JMeybohm: Update simple-cfssl to wmf branch [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/749151 (https://phabricator.wikimedia.org/T294560) [10:45:39] (03PS4) 10JMeybohm: Add support for returning bundles instead of certs from sign calls [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/748143 (https://phabricator.wikimedia.org/T294560) [10:45:56] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Update simple-cfssl to wmf branch [software/cfssl-issuer] - 10https://gerrit.wikimedia.org/r/749151 (https://phabricator.wikimedia.org/T294560) (owner: 10JMeybohm) [10:47:51] (03CR) 10Jbond: [C: 03+1] mirrors.wikimedia.org: point to new mirror (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [10:51:02] (03CR) 10Jelto: "Change of SAL logging looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/749147 (owner: 10Giuseppe Lavagetto) [10:53:46] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [10:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:00] (03PS1) 10Jbond: C:monitoring: uyse ['lldp']['parent'] instead of lldp_parent [puppet] - 10https://gerrit.wikimedia.org/r/749152 [10:59:02] (03PS1) 10Jbond: lldp: drop legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/749153 (https://phabricator.wikimedia.org/T289679) [10:59:19] (03Abandoned) 10Jbond: lldp: update lldp_parent to use lldp['parent'] [puppet] - 10https://gerrit.wikimedia.org/r/715242 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [10:59:30] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] CAS: Update to 6.4.4.2 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/748722 (owner: 10Muehlenhoff) [10:59:51] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [10:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:33] !log reenabled puppet on mx1001 T298038 [11:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:07] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1086.mgmt.eqiad.wmnet with reboot policy FORCED [11:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:17] !og imported cas 6.4.4.2 to apt.wikimedia.org/buster-wikimedia [11:08:08] moritzm: you missed the l on !log [11:10:03] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1086.mgmt.eqiad.wmnet with reboot policy FORCED [11:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:16] (03CR) 10Varac: "Hi," [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [11:10:36] (03PS3) 10Arturo Borrero Gonzalez: wmcs: add print_output controls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/748777 [11:17:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: add print_output controls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/748777 (owner: 10Arturo Borrero Gonzalez) [11:23:52] (03PS1) 10Volans: sre.hosts.provision: log the results of the import [cookbooks] - 10https://gerrit.wikimedia.org/r/749154 [11:25:10] (03CR) 10Volans: [C: 03+1] "LGTM if they are not used anymore both in prod and wmcs" [puppet] - 10https://gerrit.wikimedia.org/r/749153 (https://phabricator.wikimedia.org/T289679) (owner: 10Jbond) [11:26:00] !log imported cas 6.4.4.2 to apt.wikimedia.org/buster-wikimedia [11:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:05] oh indeed :-) [11:28:57] (03PS1) 10Majavah: aptrepo: Fix bullseye description [puppet] - 10https://gerrit.wikimedia.org/r/749156 [11:29:20] (03CR) 10Jbond: [C: 03+1] sre.hosts.provision: log the results of the import [cookbooks] - 10https://gerrit.wikimedia.org/r/749154 (owner: 10Volans) [11:29:30] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: log the results of the import [cookbooks] - 10https://gerrit.wikimedia.org/r/749154 (owner: 10Volans) [11:31:38] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: add_grid_webgrid_generic_node: refresh logger messages [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749157 [11:32:02] (03Merged) 10jenkins-bot: sre.hosts.provision: log the results of the import [cookbooks] - 10https://gerrit.wikimedia.org/r/749154 (owner: 10Volans) [11:32:17] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10Aklapper) [11:34:36] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: add_grid_webgrid_generic_node: refresh logger messages [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749157 (owner: 10Arturo Borrero Gonzalez) [11:40:26] (03CR) 10Jforrester: "This'd be awesome to land, but perhaps not during a general deployment freeze. :-) In the new year?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [11:42:08] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: add_grid_webgrid_generic_node: refresh logger messages [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749157 [11:44:02] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1086.mgmt.eqiad.wmnet with reboot policy FORCED [11:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:05] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1086.mgmt.eqiad.wmnet with reboot policy FORCED [11:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:40] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [11:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:00] (03PS2) 10Giuseppe Lavagetto: Fix helmfile.d service examples [deployment-charts] - 10https://gerrit.wikimedia.org/r/749147 [11:46:02] (03PS2) 10Giuseppe Lavagetto: mwdebug: improve helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/749148 [11:47:32] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: improve helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/749148 (owner: 10Giuseppe Lavagetto) [11:48:43] (03CR) 10Ladsgroup: auto_schema: Automatic detection of active dc (031 comment) [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [11:49:48] (03CR) 10Volans: "Did a first pass, some question/comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [11:50:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: add_grid_webgrid_generic_node: refresh logger messages [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749157 (owner: 10Arturo Borrero Gonzalez) [11:50:40] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: add_grid_webgrid_generic_node: longer instance start timeout [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749161 [11:50:46] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [11:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:44] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1087.mgmt.eqiad.wmnet with reboot policy FORCED [11:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:27] (03CR) 10Jbond: [C: 03+1] aptrepo: Fix bullseye description [puppet] - 10https://gerrit.wikimedia.org/r/749156 (owner: 10Majavah) [11:55:02] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/749156 (owner: 10Majavah) [11:56:11] (03PS10) 10Jbond: reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 [11:56:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: toolforge: add_grid_webgrid_generic_node: longer instance start timeout [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749161 (owner: 10Arturo Borrero Gonzalez) [11:56:51] (03CR) 10jerkins-bot: [V: 04-1] reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [11:58:31] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1087.mgmt.eqiad.wmnet with reboot policy FORCED [11:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:11] (03PS3) 10Ladsgroup: auto_schema: Automatic detection of active dc [software] - 10https://gerrit.wikimedia.org/r/748726 (https://phabricator.wikimedia.org/T288235) [12:04:40] (03PS2) 10Ladsgroup: auto_schema: Refactor bash to make it a bit cleaner [software] - 10https://gerrit.wikimedia.org/r/748723 (https://phabricator.wikimedia.org/T288235) [12:04:59] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Refactor bash to make it a bit cleaner (031 comment) [software] - 10https://gerrit.wikimedia.org/r/748723 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [12:05:34] (03Merged) 10jenkins-bot: auto_schema: Refactor bash to make it a bit cleaner [software] - 10https://gerrit.wikimedia.org/r/748723 (https://phabricator.wikimedia.org/T288235) (owner: 10Ladsgroup) [12:11:47] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1087.mgmt.eqiad.wmnet with reboot policy FORCED [12:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:09] (03PS2) 10Muehlenhoff: Remove LDAP entry which is already present for shell access [puppet] - 10https://gerrit.wikimedia.org/r/747872 [12:19:54] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1087.mgmt.eqiad.wmnet with reboot policy FORCED [12:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:22] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP entry which is already present for shell access [puppet] - 10https://gerrit.wikimedia.org/r/747872 (owner: 10Muehlenhoff) [12:21:58] (03Abandoned) 10Muehlenhoff: druid: Pass -Dlog4j2.formatMsgNoLookups=true to JVM opts [puppet] - 10https://gerrit.wikimedia.org/r/746834 (owner: 10Muehlenhoff) [12:22:23] (03PS2) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 [12:29:44] 10SRE, 10DBA, 10observability, 10Patch-For-Review, 10User-Ladsgroup: Send metrics of db errors of mediawiki to prometheus - https://phabricator.wikimedia.org/T297435 (10Marostegui) [12:29:48] 10SRE, 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor/dashboard number of queries killed by the automatic query killer - https://phabricator.wikimedia.org/T293531 (10Marostegui) [12:30:43] 10SRE, 10DBA, 10observability, 10Sustainability (Incident Followup): Monitor/dashboard number of queries killed by the automatic query killer - https://phabricator.wikimedia.org/T293531 (10Marostegui) I have merged this into T297435 after talking to Amir. This per se won't add much value as we already have... [12:34:10] (03PS3) 10Muehlenhoff: sre.ganeti.addnode: Pass the Ganeti group to gnt-node add [cookbooks] - 10https://gerrit.wikimedia.org/r/743356 [12:37:46] (03PS11) 10Jbond: reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 [12:38:17] (03CR) 10Jbond: reposync: add initial repo sync class and profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [12:38:46] (03CR) 10jerkins-bot: [V: 04-1] reposync: add initial repo sync class and profile [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [12:38:59] (03CR) 10Jbond: [C: 03+2] P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 (owner: 10Jbond) [12:39:05] (03PS8) 10Jbond: P:environment: Add a simple zshrc file to the home dir [puppet] - 10https://gerrit.wikimedia.org/r/747891 [12:39:41] (03PS3) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [12:50:47] (03PS4) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [12:51:44] (03CR) 10jerkins-bot: [V: 04-1] O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:52:18] (03CR) 10JMeybohm: Kubernetes 1.22 support, update chart version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/742909 (owner: 10Varac) [12:54:23] (03PS12) 10Jbond: reposync: add initial repo sync class [puppet] - 10https://gerrit.wikimedia.org/r/747091 [12:54:25] (03PS5) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [12:55:11] (03CR) 10jerkins-bot: [V: 04-1] reposync: add initial repo sync class [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [12:55:37] (03CR) 10jerkins-bot: [V: 04-1] O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:56:03] (03CR) 10Jelto: [C: 03+1] "looks good to me now. I will proceed with the cleanup of helmfiles soon." [deployment-charts] - 10https://gerrit.wikimedia.org/r/749147 (owner: 10Giuseppe Lavagetto) [12:57:08] (03PS13) 10Jbond: reposync: add initial repo sync class [puppet] - 10https://gerrit.wikimedia.org/r/747091 [12:57:15] (03PS6) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [12:57:25] (03Abandoned) 10Cparle: Enable changes to mediasearch tab order [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738293 (https://phabricator.wikimedia.org/T284208) (owner: 10Seddon) [12:59:09] (03CR) 10jerkins-bot: [V: 04-1] O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:59:25] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/747551 (owner: 10Ayounsi) [13:02:43] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Should be same effect as previously." [homer/public] - 10https://gerrit.wikimedia.org/r/748080 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [13:02:45] (03PS7) 10Jbond: O:cluster::management: Add reposync [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) [13:03:07] (03CR) 10Jbond: "Ready for review, i have moved the profile stuff to the next PS" [puppet] - 10https://gerrit.wikimedia.org/r/747091 (owner: 10Jbond) [13:03:25] (03CR) 10Jbond: [C: 03+1] "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/747855 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [13:11:15] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:11:30] (03PS1) 10Majavah: O:puppetdb: fix motd to match actual class name [puppet] - 10https://gerrit.wikimedia.org/r/749171 [13:13:21] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:15:28] (03CR) 10Jbond: [C: 03+2] P:environment: Add a simple zshrc file to the home dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/747891 (owner: 10Jbond) [13:15:48] (03CR) 10Cathal Mooney: [C: 03+1] "Looks good! Nice work." [homer/public] - 10https://gerrit.wikimedia.org/r/748098 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [13:17:07] (03CR) 10Jbond: [C: 03+2] "LGTM will merge thanks" [puppet] - 10https://gerrit.wikimedia.org/r/749171 (owner: 10Majavah) [13:19:02] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1088.mgmt.eqiad.wmnet with reboot policy FORCED [13:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:54] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1088.mgmt.eqiad.wmnet with reboot policy FORCED [13:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:18] 10SRE, 10Analytics-Radar, 10Infrastructure-Foundations, 10netops: Review the Analytics Firewall rules on cr1/cr2 - https://phabricator.wikimedia.org/T157806 (10elukey) [13:33:50] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/748111 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [13:38:41] (03PS1) 10Arturo Borrero Gonzalez: wmcs: better support for GridQueueInfo OK status [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749174 [13:41:03] !log installing vim security updates on bullseye [13:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:40] (03CR) 10jerkins-bot: [V: 04-1] wmcs: better support for GridQueueInfo OK status [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749174 (owner: 10Arturo Borrero Gonzalez) [13:48:30] (03PS2) 10Arturo Borrero Gonzalez: wmcs: better support for GridQueueInfo OK status [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749174 [14:02:24] (03PS1) 10Ladsgroup: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) [14:05:04] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [14:06:34] (03PS2) 10Ladsgroup: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) [14:13:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: better support for GridQueueInfo OK status [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749174 (owner: 10Arturo Borrero Gonzalez) [14:16:42] (03Merged) 10jenkins-bot: wmcs: better support for GridQueueInfo OK status [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749174 (owner: 10Arturo Borrero Gonzalez) [14:18:16] (03PS1) 10Majavah: cloud cumin: add support for project puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/749178 [14:18:33] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/749178 (owner: 10Majavah) [14:20:38] !log installing lldpd security updates on bullseye [14:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:40] (03PS2) 10Majavah: cloud cumin: add support for project puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/749178 [14:29:50] (03CR) 10Majavah: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/749178 (owner: 10Majavah) [14:30:06] 10SRE, 10ops-codfw: Installation issues on PowerEdge R440 Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T296856 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff no problem closing this task now [14:36:15] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [14:37:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Machine-Learning-Team: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [14:40:21] PROBLEM - SSH on rdb1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:45:31] (03PS1) 10Ottomata: Add airflow instance profile script, use it from scheduler and webserver [puppet] - 10https://gerrit.wikimedia.org/r/749180 (https://phabricator.wikimedia.org/T295201) [14:48:12] (03PS2) 10Ottomata: Add airflow instance profile script, use it from scheduler and webserver [puppet] - 10https://gerrit.wikimedia.org/r/749180 (https://phabricator.wikimedia.org/T295201) [14:48:43] !log installing squashfs-tools security updates [14:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:52] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33065/console" [puppet] - 10https://gerrit.wikimedia.org/r/749180 (https://phabricator.wikimedia.org/T295201) (owner: 10Ottomata) [14:49:59] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Add airflow instance profile script, use it from scheduler and webserver [puppet] - 10https://gerrit.wikimedia.org/r/749180 (https://phabricator.wikimedia.org/T295201) (owner: 10Ottomata) [14:50:22] (03PS2) 10Jbond: WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 [14:53:27] (03PS1) 10Ottomata: Remove already absented airflow from an-test-coord [puppet] - 10https://gerrit.wikimedia.org/r/749183 [14:55:22] (03CR) 10Ottomata: [C: 03+2] Remove already absented airflow from an-test-coord [puppet] - 10https://gerrit.wikimedia.org/r/749183 (owner: 10Ottomata) [14:55:47] (03PS1) 10Ottomata: Fix airflow CLI wrapper profile_file reference [puppet] - 10https://gerrit.wikimedia.org/r/749185 (https://phabricator.wikimedia.org/T295201) [14:56:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix airflow CLI wrapper profile_file reference [puppet] - 10https://gerrit.wikimedia.org/r/749185 (https://phabricator.wikimedia.org/T295201) (owner: 10Ottomata) [14:56:36] (03CR) 10jerkins-bot: [V: 04-1] WIP: add reposync [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (owner: 10Jbond) [14:57:30] <_joe_> !log upgrading php 7.2 everywhere, T297667 [14:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:35] T297667: mysqli/mysqlnd memory leak - https://phabricator.wikimedia.org/T297667 [15:00:41] (03PS1) 10Ottomata: Make airflow instance profile file readable by all [puppet] - 10https://gerrit.wikimedia.org/r/749186 (https://phabricator.wikimedia.org/T295201) [15:01:12] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Make airflow instance profile file readable by all [puppet] - 10https://gerrit.wikimedia.org/r/749186 (https://phabricator.wikimedia.org/T295201) (owner: 10Ottomata) [15:02:33] (03PS1) 10Ottomata: Subscribe airflow services to instance profile.sh file [puppet] - 10https://gerrit.wikimedia.org/r/749187 (https://phabricator.wikimedia.org/T295201) [15:02:51] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Subscribe airflow services to instance profile.sh file [puppet] - 10https://gerrit.wikimedia.org/r/749187 (https://phabricator.wikimedia.org/T295201) (owner: 10Ottomata) [15:06:38] (03PS1) 10Ottomata: Airflow kerberos renewer service should also use CLI wrapper [puppet] - 10https://gerrit.wikimedia.org/r/749189 (https://phabricator.wikimedia.org/T295201) [15:07:57] (03CR) 10Ottomata: [C: 03+2] Airflow kerberos renewer service should also use CLI wrapper [puppet] - 10https://gerrit.wikimedia.org/r/749189 (https://phabricator.wikimedia.org/T295201) (owner: 10Ottomata) [15:09:25] (03PS1) 10Volans: dhcp: fix file removal check in dry-run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/749190 [15:09:27] (03PS1) 10Volans: redfish: DellSCP tell if any change was made [software/spicerack] - 10https://gerrit.wikimedia.org/r/749191 [15:10:29] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [15:12:26] <_joe_> !log pruning docker images on deneb [15:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:29] (03PS1) 10Giuseppe Lavagetto: profile::docker::storage::loopback: switch to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/749192 [15:27:31] (03PS1) 10Volans: sre.hosts.provision: fix boot order and PXE [cookbooks] - 10https://gerrit.wikimedia.org/r/749194 [15:27:54] (03PS1) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [15:28:57] (03CR) 10Jbond: "See notes i think this could benefit from the sre batch base classes. I have created an equivalent CR here" [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [15:30:03] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10Gehel) [15:32:10] (03CR) 10Jbond: Add MySQL upgrade cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [15:32:42] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10bking) a:03bking [15:36:13] !log running sudo perf record -ag -F 99 -- sleep 3600 on integration-agent-docker-1008 and 1009 (T225730) [15:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:18] T225730: Reduce runtime of MW shared gate Jenkins jobs to 5 min - https://phabricator.wikimedia.org/T225730 [15:38:08] (03CR) 10Jbond: [C: 03+1] redfish: DellSCP tell if any change was made [software/spicerack] - 10https://gerrit.wikimedia.org/r/749191 (owner: 10Volans) [15:38:38] (03CR) 10Jbond: [C: 03+1] dhcp: fix file removal check in dry-run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/749190 (owner: 10Volans) [15:38:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/749194 (owner: 10Volans) [15:39:05] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: fix boot order and PXE [cookbooks] - 10https://gerrit.wikimedia.org/r/749194 (owner: 10Volans) [15:39:54] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: - https://phabricator.wikimedia.org/T271143 (10bking) a:05bking→03None [15:40:19] (03CR) 10Ladsgroup: Add MySQL upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [15:41:16] (03CR) 10Volans: [C: 03+2] dhcp: fix file removal check in dry-run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/749190 (owner: 10Volans) [15:41:21] (03CR) 10Volans: [C: 03+2] redfish: DellSCP tell if any change was made [software/spicerack] - 10https://gerrit.wikimedia.org/r/749191 (owner: 10Volans) [15:41:31] RECOVERY - SSH on rdb1006.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:42:03] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some Search Platform / Discovery clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271143 (10bking) [15:42:04] (03Merged) 10jenkins-bot: sre.hosts.provision: fix boot order and PXE [cookbooks] - 10https://gerrit.wikimedia.org/r/749194 (owner: 10Volans) [15:44:45] (03PS1) 10Ottomata: Use distinct deployment and config for analytics-test airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/749201 (https://phabricator.wikimedia.org/T295380) [15:45:30] (03PS2) 10Ottomata: Use distinct deployment and config for analytics-test airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/749201 (https://phabricator.wikimedia.org/T295380) [15:46:59] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33066/console" [puppet] - 10https://gerrit.wikimedia.org/r/749201 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:48:23] (03Merged) 10jenkins-bot: dhcp: fix file removal check in dry-run mode [software/spicerack] - 10https://gerrit.wikimedia.org/r/749190 (owner: 10Volans) [15:48:25] (03Merged) 10jenkins-bot: redfish: DellSCP tell if any change was made [software/spicerack] - 10https://gerrit.wikimedia.org/r/749191 (owner: 10Volans) [15:48:53] (03CR) 10Jbond: [C: 03+1] "LGTM minor optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/749178 (owner: 10Majavah) [15:51:24] (03PS3) 10Majavah: cloud cumin: add support for project puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/749178 [15:51:50] (03CR) 10Majavah: cloud cumin: add support for project puppetdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/749178 (owner: 10Majavah) [15:52:02] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:03] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Use distinct deployment and config for analytics-test airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/749201 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [15:57:07] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [15:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:23] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [15:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:53] (03CR) 10Jbond: Add MySQL upgrade cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/749176 (https://phabricator.wikimedia.org/T239814) (owner: 10Ladsgroup) [16:01:35] (03PS1) 10Ottomata: Remove now unneeded analytics-test airflow scap target override [puppet] - 10https://gerrit.wikimedia.org/r/749206 (https://phabricator.wikimedia.org/T295380) [16:02:38] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33067/console" [puppet] - 10https://gerrit.wikimedia.org/r/749206 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [16:03:23] (03CR) 10Ottomata: [V: 03+1 C: 03+2] Remove now unneeded analytics-test airflow scap target override [puppet] - 10https://gerrit.wikimedia.org/r/749206 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [16:05:28] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [16:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:10] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [16:06:18] !log otto@deploy1002 Started deploy [airflow-dags/analytics-test@fa11cb4]: (no justification provided) [16:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:25] !log otto@deploy1002 Finished deploy [airflow-dags/analytics-test@fa11cb4]: (no justification provided) (duration: 00m 07s) [16:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:00] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/749178 (owner: 10Majavah) [16:13:48] (03PS2) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [16:16:29] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [16:16:45] (03PS3) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [16:16:58] (03PS4) 10Jbond: Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) [16:18:39] (03CR) 10Cwhite: [C: 03+2] profile: upgrade to ecs 1.11.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/748230 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite) [16:20:08] (03CR) 10jerkins-bot: [V: 04-1] Add MySQL upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/749195 (https://phabricator.wikimedia.org/T239814) (owner: 10Jbond) [16:22:28] (03PS1) 10Cwhite: profile: upgrade collector7 to ecs 1.11.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/749208 (https://phabricator.wikimedia.org/T294581) [16:24:42] (03CR) 10Cwhite: [C: 03+2] profile: upgrade collector7 to ecs 1.11.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/749208 (https://phabricator.wikimedia.org/T294581) (owner: 10Cwhite) [16:25:44] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [16:29:56] (03PS2) 10Giuseppe Lavagetto: profile::docker::storage::loopback: remove from puppet [puppet] - 10https://gerrit.wikimedia.org/r/749192 [16:29:58] (03PS1) 10Giuseppe Lavagetto: role::builder: only rebuild images on deneb [puppet] - 10https://gerrit.wikimedia.org/r/749209 [16:31:10] (03CR) 10jerkins-bot: [V: 04-1] role::builder: only rebuild images on deneb [puppet] - 10https://gerrit.wikimedia.org/r/749209 (owner: 10Giuseppe Lavagetto) [16:32:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33068/console" [puppet] - 10https://gerrit.wikimedia.org/r/749192 (owner: 10Giuseppe Lavagetto) [16:33:21] (03PS1) 10Arturo Borrero Gonzalez: wmcs: vps: remove_instance: refresh code for better logging [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749211 [16:37:53] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/749192 (owner: 10Giuseppe Lavagetto) [16:39:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] profile::docker::storage::loopback: remove from puppet [puppet] - 10https://gerrit.wikimedia.org/r/749192 (owner: 10Giuseppe Lavagetto) [16:47:21] !log otto@deploy1002 Started deploy [airflow-dags/analytics@27a4f7a]: (no justification provided) [16:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:15] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@27a4f7a]: (no justification provided) (duration: 01m 53s) [16:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:42] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1086.mgmt.eqiad.wmnet with reboot policy FORCED [16:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:56] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1087.mgmt.eqiad.wmnet with reboot policy FORCED [16:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:42] (03PS1) 10Kormat: wmfdb/mycnf: Add support for parsing my.cnf files. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/749213 [16:57:46] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1086.mgmt.eqiad.wmnet with reboot policy FORCED [16:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:41] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker.service,docker.socket https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:01] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1087.mgmt.eqiad.wmnet with reboot policy FORCED [16:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:57] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1088.mgmt.eqiad.wmnet with reboot policy FORCED [17:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:14] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [17:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:29] RECOVERY - Debian mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/debian is over 0 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [17:00:29] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1088.mgmt.eqiad.wmnet with reboot policy FORCED [17:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: vps: remove_instance: refresh code for better logging [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749211 (owner: 10Arturo Borrero Gonzalez) [17:00:46] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [17:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:07] (03PS1) 10Giuseppe Lavagetto: role::builder: allow using overlayfs [puppet] - 10https://gerrit.wikimedia.org/r/749215 [17:02:57] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::builder: allow using overlayfs [puppet] - 10https://gerrit.wikimedia.org/r/749215 (owner: 10Giuseppe Lavagetto) [17:03:07] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:37] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [17:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:40] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1084.mgmt.eqiad.wmnet with reboot policy FORCED [17:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:55] (03PS10) 10Giuseppe Lavagetto: deployment-prep: install php 7.4 on a mw appserver [puppet] - 10https://gerrit.wikimedia.org/r/738194 [17:05:33] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [17:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:23] (03PS1) 10Majavah: Add discovery service for puppetmaster frontend [puppet] - 10https://gerrit.wikimedia.org/r/749216 (https://phabricator.wikimedia.org/T291541) [17:06:36] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1085.mgmt.eqiad.wmnet with reboot policy FORCED [17:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:46] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host elastic1088.mgmt.eqiad.wmnet with reboot policy FORCED [17:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment-prep: install php 7.4 on a mw appserver [puppet] - 10https://gerrit.wikimedia.org/r/738194 (owner: 10Giuseppe Lavagetto) [17:10:00] (03PS1) 10JHathaway: mirror: allow analytics to pull from the new mirror [homer/public] - 10https://gerrit.wikimedia.org/r/749217 [17:10:21] 10SRE, 10Domains, 10Phabricator, 10serviceops-radar, 10User-revi: The phab.wiki domain redirect suddenly outputs "404, this domain is not configured" - https://phabricator.wikimedia.org/T298041 (10Dzahn) @revi Thank you very much! > (To be honest, I am very surprised that someone other than me was using... [17:10:30] (03CR) 10Jbond: "this looks good to me, and i don't think it introduces any other issues however im adding a few more people to comment on the general idea" [puppet] - 10https://gerrit.wikimedia.org/r/749216 (https://phabricator.wikimedia.org/T291541) (owner: 10Majavah) [17:11:27] (03PS1) 10Majavah: add discovery record for puppet [dns] - 10https://gerrit.wikimedia.org/r/749218 (https://phabricator.wikimedia.org/T291541) [17:11:29] (03PS1) 10Majavah: point puppet.SITE to discovery record [dns] - 10https://gerrit.wikimedia.org/r/749219 [17:11:51] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1088.mgmt.eqiad.wmnet with reboot policy FORCED [17:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:01] (03CR) 10Majavah: "related dns patches: https://gerrit.wikimedia.org/r/c/operations/dns/+/749218/ https://gerrit.wikimedia.org/r/c/operations/dns/+/749219/" [puppet] - 10https://gerrit.wikimedia.org/r/749216 (https://phabricator.wikimedia.org/T291541) (owner: 10Majavah) [17:12:19] (03CR) 10jerkins-bot: [V: 04-1] add discovery record for puppet [dns] - 10https://gerrit.wikimedia.org/r/749218 (https://phabricator.wikimedia.org/T291541) (owner: 10Majavah) [17:12:36] (03CR) 10jerkins-bot: [V: 04-1] point puppet.SITE to discovery record [dns] - 10https://gerrit.wikimedia.org/r/749219 (owner: 10Majavah) [17:13:25] (03PS2) 10Majavah: add discovery record for puppet [dns] - 10https://gerrit.wikimedia.org/r/749218 (https://phabricator.wikimedia.org/T291541) [17:13:28] (03PS2) 10Majavah: point puppet.SITE to discovery record [dns] - 10https://gerrit.wikimedia.org/r/749219 [17:16:58] (03PS2) 10Giuseppe Lavagetto: role::builder: only rebuild images on deneb [puppet] - 10https://gerrit.wikimedia.org/r/749209 [17:17:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/749217 (owner: 10JHathaway) [17:19:20] (03PS1) 10BryanDavis: toolhub: Bump container version to 2021-12-20-122341-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/749220 (https://phabricator.wikimedia.org/T271490) [17:20:01] (03CR) 10CDanis: "Overall I think this looks good! Just a couple nits." [puppet] - 10https://gerrit.wikimedia.org/r/740828 (https://phabricator.wikimedia.org/T224891) (owner: 10Jbond) [17:23:25] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) a:03Zabe [17:23:41] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) Thanks @KFrancis! Over to you @Zabe [17:24:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Volans) [17:24:40] (03CR) 10BryanDavis: [C: 04-2] "After talking with akosiaris on irc I'm going to pause my desire to get this done until after the end of year break so that nobody gets st" [deployment-charts] - 10https://gerrit.wikimedia.org/r/749220 (https://phabricator.wikimedia.org/T271490) (owner: 10BryanDavis) [17:24:54] (03CR) 10Ayounsi: mirror: allow analytics to pull from the new mirror (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/749217 (owner: 10JHathaway) [17:26:05] (03PS2) 10JHathaway: mirror: allow analytics to pull from the new mirror [homer/public] - 10https://gerrit.wikimedia.org/r/749217 [17:26:32] (03CR) 10JHathaway: mirror: allow analytics to pull from the new mirror (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/749217 (owner: 10JHathaway) [17:27:00] (03CR) 10Dzahn: [C: 03+2] admin: give Sneha Patel access to Superset/Hive UIs w/ private data [puppet] - 10https://gerrit.wikimedia.org/r/748854 (https://phabricator.wikimedia.org/T297927) (owner: 10Dzahn) [17:27:10] (03PS3) 10Dzahn: admin: give Sneha Patel access to Superset/Hive UIs w/ private data [puppet] - 10https://gerrit.wikimedia.org/r/748854 (https://phabricator.wikimedia.org/T297927) [17:27:21] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) [17:27:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: 2021-12-17) rack/setup/install elastic108[4-8] - https://phabricator.wikimedia.org/T294152 (10Volans) I've setup the above hosts with the exception of `elastic1085`. I'm running some additional tests on in. I've updated the t... [17:27:23] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [17:27:51] (03CR) 10JHathaway: [C: 03+2] mirror: allow analytics to pull from the new mirror [homer/public] - 10https://gerrit.wikimedia.org/r/749217 (owner: 10JHathaway) [17:28:33] (03Merged) 10jenkins-bot: mirror: allow analytics to pull from the new mirror [homer/public] - 10https://gerrit.wikimedia.org/r/749217 (owner: 10JHathaway) [17:30:21] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [17:30:35] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [17:34:50] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Zabe) >>! In T297323#7582289, @KFrancis wrote: > Thanks @Dzahn, @Zabe Please send me the following info: > > Full legal name > Mailing address > Email address > > If you prefer, you can send it... [17:36:04] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) a:05Zabe→03KFrancis [17:40:15] (03CR) 10Legoktm: Remove obsolete Timeline configuration and fonts submodule (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723652 (owner: 10Legoktm) [17:40:39] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33069/console" [puppet] - 10https://gerrit.wikimedia.org/r/749209 (owner: 10Giuseppe Lavagetto) [17:40:55] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10Dzahn) [17:43:08] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10Dzahn) user created and added to analytics-privatedata-users. stat1004 is a role(statistics::explorer) ` [stat1004:~] $ id spatel uid=36549(spatel... [17:44:33] !log LDAP - added uid=spatel to wmf group (T297927) [17:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:39] T297927: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 [17:46:13] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) [17:46:15] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [17:48:15] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10Dzahn) 05In progress→03Resolved @cchen @Ottomata done! the 2 requirements from https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Dash... [17:48:20] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Patch-For-Review: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10Dzahn) Oh, hi, there you are @Sneha :) Sorry, I missed that for a moment. See above, things should work now, feel free to try it out! [17:48:59] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [17:49:03] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [17:52:34] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) [17:55:20] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [17:56:59] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [18:00:01] (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) [18:00:03] (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [18:00:08] 10SRE, 10SRE-OnFire, 10Wikimedia-Incident: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 (10Dzahn) @Herron How do you see this is as the task creator, should this stay open until all subtasks are resolved? That means even though the actual incident is long ove... [18:00:23] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [18:02:49] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [18:02:50] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [18:03:27] 10SRE, 10serviceops: Connecting to https://api.svc.codfw.wmnet/ does not work - https://phabricator.wikimedia.org/T285517 (10Dzahn) [18:03:52] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Cmjohnson) rollback in progress [18:04:30] (03PS5) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) [18:04:32] (03PS5) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) [18:05:43] 10SRE, 10serviceops: Connecting to https://api.svc.codfw.wmnet/ does not work - https://phabricator.wikimedia.org/T285517 (10Dzahn) it's still as originally reported: ` [cumin1001:~] $ curl https://api.svc.codfw.wmnet/ curl: (60) SSL: no alternative certificate subject name matches target host name 'api.svc... [18:07:13] !log mforns@deploy1002 Started deploy [airflow-dags/analytics-test@27a4f7a]: (no justification provided) [18:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:21] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics-test@27a4f7a]: (no justification provided) (duration: 00m 07s) [18:07:21] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add cookbook to depool a node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749221 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [18:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:28] (03CR) 10jerkins-bot: [V: 04-1] wmcs: toolforge: grid: add remove instance cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/749222 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [18:08:45] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): cloudmetrics1003 seizes up under load - https://phabricator.wikimedia.org/T297814 (10Cmjohnson) The idrac is now accessible again [18:25:59] (03CR) 10Ottomata: [C: 03+2] Airflow 2.1.4 with extra dependencies [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/742813 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [18:26:01] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Airflow 2.1.4 with extra dependencies [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/742813 (https://phabricator.wikimedia.org/T295380) (owner: 10Ottomata) [18:31:00] (03PS1) 10Ottomata: Regenerate with recent pip-constraints.txt [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/749247 [18:31:53] (03CR) 10Ottomata: [C: 03+2] Regenerate with recent pip-constraints.txt [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/749247 (owner: 10Ottomata) [18:31:55] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Regenerate with recent pip-constraints.txt [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/749247 (owner: 10Ottomata) [18:41:45] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10Sneha) Thanks @Dzahn and approvers. I am able to use my wikitech a/c with no issues now. :) [18:42:14] (03PS1) 10Ottomata: Use updated miniconda installer version [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/749248 [18:44:00] 10SRE, 10SRE-Access-Requests, 10Product-Analytics: Requesting access to Superset for Spatel - https://phabricator.wikimedia.org/T297927 (10Dzahn) Great, @Sneha ! Appreciate the confirmation :) [18:44:09] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use updated miniconda installer version [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/749248 (owner: 10Ottomata) [19:05:28] 10ops-eqiad, 10DC-Ops, 10Graphite: Upgrade firmware on graphite1004 if upgrade available. - https://phabricator.wikimedia.org/T297433 (10wiki_willy) a:03Cmjohnson [19:05:45] duesen: joining? [19:06:03] mutante: we are waiting for you... someone is in the wrong channel [19:06:19] duesen: ? weird.. checking! [19:07:19] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10wiki_willy) a:05Jclark-ctr→03Cmjohnson [19:14:59] 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10lbowmaker) [19:16:45] 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Ottomata) Approved. [19:26:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 0:40:00 on graphite1004.eqiad.wmnet with reason: update firmware [19:26:50] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on graphite1004.eqiad.wmnet with reason: update firmware [19:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:05] PROBLEM - SSH on kubernetes1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:53:21] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@27a4f7a]: (no justification provided) [19:53:24] 10SRE, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi, 10Wikimedia-Incident: graphite1004 freezing - https://phabricator.wikimedia.org/T297265 (10Cmjohnson) [19:53:24] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@27a4f7a]: (no justification provided) (duration: 00m 03s) [19:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:31] 10ops-eqiad, 10DC-Ops, 10Graphite: Upgrade firmware on graphite1004 if upgrade available. - https://phabricator.wikimedia.org/T297433 (10Cmjohnson) 05Open→03Resolved graphite1004 f/w has been updated except the idrac. its version is too old and the system will not allow an update to the current or previo... [19:55:09] PROBLEM - SSH on db2086.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:57:01] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@27a4f7a]: (no justification provided) [19:57:04] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@27a4f7a]: (no justification provided) (duration: 00m 03s) [19:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:57] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@e970bd0]: (no justification provided) [19:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:03] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@e970bd0]: (no justification provided) (duration: 00m 06s) [19:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:22] (03PS1) 10Cmjohnson: adding ganeti servers to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/749256 (https://phabricator.wikimedia.org/T293909) [20:01:04] (03CR) 10Cmjohnson: [C: 03+2] adding ganeti servers to netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/749256 (https://phabricator.wikimedia.org/T293909) (owner: 10Cmjohnson) [20:08:13] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS buster [20:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet wi... [20:10:29] (03PS1) 10Ottomata: Define analytics-hive connection for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/749258 [20:11:58] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1026.eqiad.wmnet with OS buster [20:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster [20:12:37] (03CR) 10Ottomata: [C: 03+2] Define analytics-hive connection for airflow instances [puppet] - 10https://gerrit.wikimedia.org/r/749258 (owner: 10Ottomata) [20:12:43] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1027.eqiad.wmnet with OS buster [20:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster [20:13:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1028.eqiad.wmnet with OS buster [20:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster [20:29:23] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1027.eqiad.wmnet with OS buster [20:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster executed with... [20:29:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1027.eqiad.wmnet with OS buster [20:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster [20:29:45] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1025.eqiad.wmnet with OS buster [20:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster executed with... [20:29:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1025.eqiad.wmnet with OS buster [20:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:01] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1026.eqiad.wmnet with OS buster [20:30:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster [20:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster executed with... [20:30:17] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1026.eqiad.wmnet with OS buster [20:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster [20:30:22] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ganeti1028.eqiad.wmnet with OS buster [20:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster executed with... [20:30:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1028.eqiad.wmnet with OS buster [20:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster [20:32:02] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@053bfc0]: (no justification provided) [20:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:09] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@053bfc0]: (no justification provided) (duration: 00m 06s) [20:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:52] (03CR) 10JHathaway: [C: 03+2] mirrors.wikimedia.org: point to new mirror (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/747933 (https://phabricator.wikimedia.org/T286898) (owner: 10JHathaway) [20:45:28] RECOVERY - SSH on kubernetes1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:54:40] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1025.eqiad.wmnet with OS buster [20:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1025.eqiad.wmnet with OS buster completed: -... [20:55:20] (03PS1) 10Urbanecm: pwnwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749264 (https://phabricator.wikimedia.org/T298115) [20:59:57] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1027.eqiad.wmnet with OS buster [21:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster executed with... [21:00:53] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1028.eqiad.wmnet with OS buster [21:00:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster executed with... [21:02:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) 05Open→03In progress a:03Dzahn Hey Luke (@lbowmaker) I got this one since I happen to be on our rotating clinic duty this week. So i'll handle access reques... [21:04:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) [21:10:27] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1026.eqiad.wmnet with OS buster [21:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster executed with... [21:13:19] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1027.eqiad.wmnet with OS buster [21:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster [21:14:22] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1026.eqiad.wmnet with OS buster [21:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster [21:15:17] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) @DAbad Hi, do you approve this request as manager? cc: @ottomata The current setup is: requested group "analytics-platform-eng-admins contains members: [*p... [21:17:47] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1028.eqiad.wmnet with OS buster [21:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster [21:19:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Cmjohnson) a:05Volans→03Cmjohnson [21:21:42] (03PS3) 10RLazarus: imagecatalog: Add an hourly systemd timer to scan for what's currently running [puppet] - 10https://gerrit.wikimedia.org/r/748876 (https://phabricator.wikimedia.org/T287130) [21:23:34] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33070/console" [puppet] - 10https://gerrit.wikimedia.org/r/748876 (https://phabricator.wikimedia.org/T287130) (owner: 10RLazarus) [21:30:49] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Ottomata) > Should Luke be added as a member to Platform Engineering just like any other existing member there? That would cover this request and possibly more. Most l... [21:35:53] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10KFrancis) @Dzahn Thanks! I'll let you know when the agreement is complete. [21:37:29] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1027.eqiad.wmnet with OS buster [21:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1027.eqiad.wmnet with OS buster completed: -... [21:39:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1026.eqiad.wmnet with OS buster [21:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1026.eqiad.wmnet with OS buster completed: -... [21:42:40] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1028.eqiad.wmnet with OS buster [21:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1028.eqiad.wmnet with OS buster completed: -... [21:43:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Cmjohnson) [21:43:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q2:(Need By: TBD) rack/setup/install ganeti102[5-8] - https://phabricator.wikimedia.org/T293909 (10Cmjohnson) 05Open→03Resolved DC-Ops work is finished. [21:44:54] 10SRE, 10ops-eqiad: Rack msw2-eqiad in cab A8 for configuration - https://phabricator.wikimedia.org/T296271 (10Cmjohnson) The switch is racked and connected to scs-a8 port 8 temporarily. I zeroized the switch and attempting to install the updated OS. [21:50:11] (03PS1) 10Dzahn: admin: add Luke Bowmaker to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/749270 (https://phabricator.wikimedia.org/T298124) [21:54:33] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-platform-eng-admins for lbowmaker - https://phabricator.wikimedia.org/T298124 (10Dzahn) @Ottomata ACK, thanks. And.. actually.. despite my previous comments, I see we have other cases where we mix inclusion from another group w... [21:55:21] RECOVERY - SSH on db2086.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:55:46] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) @BTullis Have you had a chance to look and see if we're using the correct partman recipe? [21:55:47] 10SRE, 10LDAP-Access-Requests: Grant Access to Logstash for Zabe - https://phabricator.wikimedia.org/T297323 (10Dzahn) Great, thank you. Feel free to assign it back to me or just comment. Happy Holidays [22:30:49] 10SRE, 10ops-codfw: ms-be2065 failed drive sdq - https://phabricator.wikimedia.org/T297933 (10Papaul) Current Status: The Dell replacement part(s) for your POWEREDGE R740XD2 has been shipped by FedEX on tracking number 537569551940. [22:39:03] (03PS1) 10JHathaway: mirrors: revert to sodium mirror temporarily [dns] - 10https://gerrit.wikimedia.org/r/749274 [22:40:53] (03CR) 10JHathaway: [C: 03+2] mirrors: revert to sodium mirror temporarily [dns] - 10https://gerrit.wikimedia.org/r/749274 (owner: 10JHathaway) [23:01:25] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [23:03:27] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase