[00:00:05] brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211119T0000). [00:06:51] (03CR) 10Krinkle: gitlab-runner: restrict docker images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [00:08:32] !log end of UTC late deployment training window [00:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:37] I am going to be rolling wmf.9 back to testwikis due to https://phabricator.wikimedia.org/T296044 breaking articles [00:24:50] well, errors showing up on the citations anyway [00:28:04] (03CR) 10Dzahn: [C: 03+2] miscweb: try enabling TLS after nodePort is removed and we deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/739945 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [00:28:27] jeena: thanks for the rollback. Note it appears to actually save broken edits (which someone will need to semi-manually fix). [00:30:28] (03CR) 10Brennen Bearnes: gitlab-runner: restrict docker images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [00:31:41] (03Merged) 10jenkins-bot: miscweb: try enabling TLS after nodePort is removed and we deployed [deployment-charts] - 10https://gerrit.wikimedia.org/r/739945 (https://phabricator.wikimedia.org/T281538) (owner: 10Dzahn) [00:33:44] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [00:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:51] question i keep having and not knowing the answer to: what's the quick way to cross reference groups with what extensions are enabled? [00:37:13] "quick way" [00:37:42] Not sure that there is one [00:37:56] ctrl+f for "wmgUse" in InitialiseSettings :| [00:38:08] wikiapiary kind of knows where each extension is ued [00:38:21] example https://wikiapiary.com/wiki/Extension:Babel [00:38:37] jeena: hello. can we backport the revert patch instead? [00:38:43] at the bottom there is "Wikibooks (fa)" but amont all other mediawikis out there [00:38:46] among [00:38:57] yeah, cross-referencing CommonSettings.php and InitialiseSettings.php is the best I've got [00:39:43] MatmaRex: sure, I was in the middle fixing some conflicts in order to revert but I can abort and we can do that [00:40:05] wikiapiary links are from "check usage and version matrix" link in the infobox on an extension page like https://www.mediawiki.org/wiki/Extension:RSS [00:40:07] one of the "rethink MediaWiki/Wikimedia configuration" phab tasks may eventually result in a better solution [00:41:57] thx all. [00:42:13] * brennen puts a what-groups-is-extension-deployed-in utility on yak shaving list. [00:44:35] what did you mean about wikibooks though MatmaRex ? I didn't quite understand [00:45:23] jeena: that was mutante, i don't know what he meant :) [00:45:30] jeena: i want us to backport this patch: https://phabricator.wikimedia.org/T296044#7515599 [00:45:44] which is a small revert, and seems to fix the VE issue for me [00:46:19] oh whoops [00:46:27] I see [00:46:44] re: wikibooks, side convo about figuring out where an extension is deployed, safe to ignore. [00:46:58] That sounds good. [00:50:05] there is a list of wikis using a specific extension [00:50:13] one of them is wikibooks (fa) [00:50:14] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 90.35% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [00:50:30] this site lists all mediawikis it knows about that use a certain extension [00:50:41] it does not limit itself to wikimedia operated mediawikis [00:50:45] but it includes them [00:51:11] you can ask it about all the extensions and get a full list of mediawikis on the internet using it, incl ours [00:51:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:51:22] Thanks mutante . I accidentally attributed your message to conversation about doing a backport :P [00:51:42] and a version matrix to go with it, which probably lists ones that really should upgrade [00:52:28] jeena: ah, no, it was all to answer the question from brennen. how to find out which extension is used where [00:53:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:55:32] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor2005.codfw.wmnet [00:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:37] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor2006.codfw.wmnet [00:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:03] !log legoktm@cumin1001 conftool action : set/weight=5; selector: name=thumbor2005.codfw.wmnet [00:56:07] !log legoktm@cumin1001 conftool action : set/weight=5; selector: name=thumbor2006.codfw.wmnet [00:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:04] (03PS1) 10Dzahn: Revert "miscweb: remove nodePort and re-enable TLS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739835 [01:00:29] (03PS2) 10Dzahn: Revert "miscweb: remove nodePort and re-enable TLS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739835 [01:00:32] MatmaRex: fetching it to the deploy server now [01:02:36] jeena: i think we need to backport to wmf.9 in gerrit? [01:02:59] oh, no wonder something seemed weird [01:03:07] (03PS1) 10Bartosz Dziewoński: Revert "Use proper method for comparing linear data" [extensions/Cite] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739836 (https://phabricator.wikimedia.org/T296044) [01:03:10] I didn't realize that wasn't the backport, sorry [01:03:13] jeena: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Cite/+/739836 [01:03:19] :) [01:03:33] (i can't +2 on that branch) [01:03:38] I'll go ahead and +2 now [01:03:48] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "Use proper method for comparing linear data" [extensions/Cite] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739836 (https://phabricator.wikimedia.org/T296044) (owner: 10Bartosz Dziewoński) [01:05:38] since it's a revert shall I just go ahead and sync or do you want to test on mwdebug? [01:05:41] !log legoktm@cumin1001 conftool action : set/weight=10; selector: name=thumbor2005.codfw.wmnet [01:05:45] !log legoktm@cumin1001 conftool action : set/weight=10; selector: name=thumbor2006.codfw.wmnet [01:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:07] (03CR) 10Dzahn: [C: 03+2] Revert "miscweb: remove nodePort and re-enable TLS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739835 (owner: 10Dzahn) [01:06:12] we really need to get MatmaRex deploy permissions [01:06:27] please don't [01:06:32] haha [01:06:33] jeena: i think you can just sync [01:06:40] okay, thanks! [01:09:10] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor2001.codfw.wmnet [01:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:14] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=thumbor2002.codfw.wmnet [01:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:09:57] (03Merged) 10jenkins-bot: Revert "miscweb: remove nodePort and re-enable TLS" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739835 (owner: 10Dzahn) [01:18:47] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [01:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:20] !log dzahn@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'miscweb' for release 'main' . [01:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:52] (03Merged) 10jenkins-bot: Revert "Use proper method for comparing linear data" [extensions/Cite] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739836 (https://phabricator.wikimedia.org/T296044) (owner: 10Bartosz Dziewoński) [01:28:08] (03PS1) 10Legoktm: thumbor: Remove thumbor2001 and thumbor2002 from memcached [puppet] - 10https://gerrit.wikimedia.org/r/739955 (https://phabricator.wikimedia.org/T273141) [01:28:10] (03PS1) 10Legoktm: conftool: Remove thumbor2001 and thumbor2002 [puppet] - 10https://gerrit.wikimedia.org/r/739956 (https://phabricator.wikimedia.org/T273141) [01:28:12] (03PS1) 10Legoktm: Remove thumbor2001 and thumbor2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/739957 (https://phabricator.wikimedia.org/T273141) [01:30:10] !log legoktm@cumin1001 conftool action : set/pooled=inactive; selector: name=thumbor2001.codfw.wmnet [01:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:13] !log legoktm@cumin1001 conftool action : set/pooled=inactive; selector: name=thumbor2002.codfw.wmnet [01:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:45] (03CR) 10Legoktm: [C: 03+2] thumbor: Remove thumbor2001 and thumbor2002 from memcached [puppet] - 10https://gerrit.wikimedia.org/r/739955 (https://phabricator.wikimedia.org/T273141) (owner: 10Legoktm) [01:31:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:37] !log dzahn@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'miscweb' for release 'main' . [01:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:33:14] jeena: it merged finally btw, not sure if you synced it yet [01:33:24] I am syncing now [01:34:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [01:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:05] !log jhuneidi@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/Cite/modules/ve-cite/ve.dm.MWReferenceNode.js: Backport for T296044 (duration: 00m 55s) [01:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:08] T296044: VisualEditor expands/duplicates named references - https://phabricator.wikimedia.org/T296044 [01:35:26] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:35:58] legoktm: ^ managed to revert and fix on codfw .. [01:36:11] just the alert part. doing eqiad. then out [01:36:15] :D nice [01:37:00] !log dzahn@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [01:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:38] (03CR) 10Legoktm: [C: 03+2] conftool: Remove thumbor2001 and thumbor2002 [puppet] - 10https://gerrit.wikimedia.org/r/739956 (https://phabricator.wikimedia.org/T273141) (owner: 10Legoktm) [01:38:00] I'm not sure how to test it's fixed though [01:40:20] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts thumbor2001.codfw.wmnet [01:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:32] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:41:32] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [01:41:38] :) [01:41:52] there is more, on git-ssh instead of miscweb [01:42:14] that part is unrelated to k8s [01:42:40] jeena: thanks. i posted a minimal test case on the task, you could copy it to a production wiki and then try visual editing it [01:42:45] or just test on beta [01:42:54] okay, thanks for all your help! [01:42:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001-vcs.codfw.wmnet [01:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:15] i tested it locally though [01:45:36] yeah, I think it should be fine [01:45:52] !log I think git-ssh6_22 is down (see alerts lvs2008/2009) due to the v6 issue from ongoing lvs maintenance. depooled in conftool [01:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:12] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10Papaul) [01:48:09] thanks jeena [01:48:19] i'm off for tonight, i hope nothing else exciting happens [01:48:26] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=phab2001-vcs.codfw.wmnet [01:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:50] haha me too [01:48:59] jeena: all good? i'm about to call it for the day. [01:49:20] Yes, I think so [01:49:34] I also must leave the keyboard now [01:49:36] cool. thanks for assist MatmaRex. [01:51:10] np. see you [01:52:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab2001-vcs.codfw.wmnet [01:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:00] hmm, the change on asw-c-codfw.mgmt.wmnet in the decom script is removing thumbor2001 but also adding prometheus2006 [01:56:01] looks like https://phabricator.wikimedia.org/T294302#7515723 [01:56:18] papaul: ^ fyi, I'm going to accept the change that adds prometheus2006 [01:56:36] PROBLEM - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [01:57:11] ^ still me, but at least I know how to fix that from earlier [01:57:16] arg :) [01:57:20] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thumbor2001.codfw.wmnet [01:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:36] * legoktm off [01:58:55] this happens when a host is down and temp. errors, for example during reboot without depooling it first [01:58:56] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/git-ssh is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [01:59:10] and then there are stale error files in the confd path [01:59:15] even after its long back [02:01:03] !log [puppetmaster2001:/var/run/confd-template] $ sudo rm .git-ssh*.err [02:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:01:18] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [02:01:22] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission thorium.eqiad.wmnet - https://phabricator.wikimedia.org/T292075 (10wiki_willy) ++ops-eqiad project tag [02:02:14] !log [puppetmaster1001:/var/run/confd-template] $ sudo rm .git-ssh*.err [02:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:02:44] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [02:05:38] PROBLEM - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [02:05:38] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) https://wikitech.wikimedia.org/wiki/PyBal [02:05:42] is someone from SRE still around? I need to do an emergency deploy. The patch is not risky (JS only, loaded on a single special page) but would be nice to have coverage just in case I mess up the deployment process somehow [02:06:32] tgr: do it.. unfortunately still here trying to fix those alerts there .. [02:07:04] thanks! will probably need half an hour or so, it still needs to get through CI [02:07:16] ouch.. ok [02:07:34] kind of have been trying to leave since 2 hours:) but emergency is emergency [02:09:26] I can wait until tomorrow if you are about to leave. It's a product feature, not that tragic. I was just hoping someone is around anyway. [02:10:05] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2008 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) daniel_zahn pybal needs restart after maintenance it looks https://wikitech.wikimedia.org/wiki/PyBal [02:10:05] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([208.80.153.250:22, 2620:0:860:ed1a::3:fa:22]) daniel_zahn pybal needs restart after maintenance it looks https://wikitech.wikimedia.org/wiki/PyBal [02:10:37] (03PS1) 10Gergő Tisza: Lazy-load structured task JS files [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739837 (https://phabricator.wikimedia.org/T296049) [02:14:33] (03CR) 10Gergő Tisza: [C: 03+2] "Emergency deployment. Patch is low-risk (JS only, loaded on Special:Homepage only)." [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739837 (https://phabricator.wikimedia.org/T296049) (owner: 10Gergő Tisza) [02:18:36] tgr: I'm close enough to my laptop [02:18:36] (03CR) 10Gergő Tisza: [C: 04-2] "will do later." [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739837 (https://phabricator.wikimedia.org/T296049) (owner: 10Gergő Tisza) [02:20:33] thanks legoktm[m]! I can do it, just wanted to have someone around in case of unlikely mishaps. [02:20:56] will go forward then. [02:21:00] I appreciate that, legoktm[m] [02:21:16] (03CR) 10Gergő Tisza: [C: 03+2] Lazy-load structured task JS files [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739837 (https://phabricator.wikimedia.org/T296049) (owner: 10Gergő Tisza) [02:21:33] 👍 [02:21:53] (03CR) 10Gergő Tisza: [C: 03+2] "(little back-and-forth while trying to find SRE coverage)" [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739837 (https://phabricator.wikimedia.org/T296049) (owner: 10Gergő Tisza) [02:26:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab2001-vcs.codfw.wmnet [02:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:22] RECOVERY - PyBal IPVS diff check on lvs2008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [02:27:24] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [02:29:38] that was me. ^ out now then [02:38:30] (03Merged) 10jenkins-bot: Lazy-load structured task JS files [extensions/GrowthExperiments] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739837 (https://phabricator.wikimedia.org/T296049) (owner: 10Gergő Tisza) [02:42:14] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade + restart - ryankemper@cumin1001 - T295705 [02:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:18] T295705: Cleanup missing Commons index on Elasticsearch eqiad - https://phabricator.wikimedia.org/T295705 [02:45:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [02:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:34] !log tgr@deploy1002 Synchronized php-1.38.0-wmf.9/extensions/GrowthExperiments/modules: Backport: [[gerrit:739837|Lazy-load structured task JS files (T296049)]] (duration: 00m 55s) [02:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:38] T296049: [wmf.9 - ruwiki] Add link doesn't load - https://phabricator.wikimedia.org/T296049 [03:00:13] legoktm[m]: done, thanks! [03:10:16] Awesome [03:42:56] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [03:44:50] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [04:24:24] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:24:32] PROBLEM - ElasticSearch shard size check - 9200 on logstash2002 is CRITICAL: CRITICAL - logstash-mediawiki-2021.11.17(354gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [04:32:39] (03PS1) 10Ladsgroup: media: Store metadata of one-page documents correctly [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739838 (https://phabricator.wikimedia.org/T296001) [04:32:48] (03CR) 10Ladsgroup: [C: 03+2] media: Store metadata of one-page documents correctly [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739838 (https://phabricator.wikimedia.org/T296001) (owner: 10Ladsgroup) [04:35:56] PROBLEM - ElasticSearch shard size check - 9200 on logstash1035 is CRITICAL: CRITICAL - logstash-mediawiki-2021.11.17(371.3333333333333gb) https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [04:49:48] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:52:29] (03Merged) 10jenkins-bot: media: Store metadata of one-page documents correctly [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739838 (https://phabricator.wikimedia.org/T296001) (owner: 10Ladsgroup) [04:55:08] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/includes/media/DjVuImage.php: Backport: [[gerrit:739838|media: Store metadata of one-page documents correctly (T296001)]] (duration: 00m 56s) [04:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:55:12] T296001: DjVuHandler: getDimensionInfoFromMetaTree: PHP Notice: Undefined index: pages - https://phabricator.wikimedia.org/T296001 [04:56:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [04:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [05:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:20] PROBLEM - k8s API server requests latencies on kubemaster2002 is CRITICAL: instance=10.192.16.48 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:11:10] RECOVERY - k8s API server requests latencies on kubemaster2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:11:30] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.78 ms [06:13:48] (03PS1) 10Majavah: openstack: allow passwords for new service account [puppet] - 10https://gerrit.wikimedia.org/r/739965 [06:19:46] (03PS1) 10Marostegui: dbproxy1019: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/739966 [06:20:39] (03PS2) 10Marostegui: dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/739966 [06:21:24] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/739966 (owner: 10Marostegui) [06:23:25] !log Upgrade clouddb1019 [06:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:42] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/739839 [06:26:06] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/739839 (owner: 10Marostegui) [06:55:33] !log Reboot db1132 to pick up new kernel T288720 [06:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:37] T288720: Failover m5 master (db1128) to db1132 to upgrade its kernel - https://phabricator.wikimedia.org/T288720 [07:26:37] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:29:04] (03CR) 10Urbanecm: [C: 04-1] Bug:T291737 updated arywiki NSs and fixed tabulation issue (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738872 (owner: 10Ideophagous) [07:29:32] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [07:30:52] (03CR) 10jerkins-bot: [V: 04-1] arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [07:34:49] (03PS1) 10Elukey: kubernetes: expose internal CA bundle to helm [puppet] - 10https://gerrit.wikimedia.org/r/740083 (https://phabricator.wikimedia.org/T291905) [07:36:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32505/console" [puppet] - 10https://gerrit.wikimedia.org/r/740083 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:37:19] (03CR) 10Elukey: [V: 03+1 C: 03+2] Deploy the wmf_trusted_cas.jks bundle where Gobblin runs [puppet] - 10https://gerrit.wikimedia.org/r/739476 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:39:21] (03Abandoned) 10Ideophagous: Bug:T291737 updated arywiki NSs and fixed tabulation issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738872 (owner: 10Ideophagous) [07:39:47] (03PS1) 10Ladsgroup: Revert "Title: use PageStore instead of LinkCache" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739841 [07:40:20] (03Abandoned) 10Ideophagous: Bug:T291737 Squashed two commits into one, previous commit comments follow: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738870 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [07:40:22] (03PS2) 10RhinosF1: Revert "Title: use PageStore instead of LinkCache" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739841 (owner: 10Ladsgroup) [07:40:39] (03Abandoned) 10Ideophagous: Bug:T291737 Squashed two commits into one, previous commit comments follow: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 (owner: 10Ideophagous) [07:59:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: gobblin-webrequest.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:03] (03PS1) 10Elukey: Deploy the WMF Internal CAs bundle truststore to Hadoop test workers [puppet] - 10https://gerrit.wikimedia.org/r/740086 (https://phabricator.wikimedia.org/T291905) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211119T0800) [08:01:05] (03PS2) 10Elukey: Deploy the WMF Internal CAs bundle truststore to Hadoop test workers [puppet] - 10https://gerrit.wikimedia.org/r/740086 (https://phabricator.wikimedia.org/T291905) [08:03:30] (03PS3) 10Elukey: Deploy the WMF Internal CAs bundle truststore to Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/740086 (https://phabricator.wikimedia.org/T291905) [08:04:26] (03CR) 10Elukey: [C: 03+2] Deploy the WMF Internal CAs bundle truststore to Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/740086 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [08:05:23] (03CR) 10Ladsgroup: [C: 03+2] Revert "Title: use PageStore instead of LinkCache" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739841 (owner: 10Ladsgroup) [08:06:21] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:28] (03PS2) 10Majavah: openstack: allow passwords for new service account [puppet] - 10https://gerrit.wikimedia.org/r/739965 [08:17:13] !log installing mariadb-10.5 security updates on bullseye (as packaged in Debian, not the wmf-internal packages) [08:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:28] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) At this point we need to migrate all Kafka client using TLS to the new bundle before proceeding further... [08:21:22] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/739879 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [08:22:14] !log ayounsi@deploy1002 Started deploy [homer/deploy@dc007aa]: Homer CR738905 [08:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:40] !log ayounsi@deploy1002 Finished deploy [homer/deploy@dc007aa]: Homer CR738905 (duration: 01m 25s) [08:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:55] (03Merged) 10jenkins-bot: Revert "Title: use PageStore instead of LinkCache" [core] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/739841 (owner: 10Ladsgroup) [08:26:36] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.9/includes: Backport: [[gerrit:739841|Revert "Title: use PageStore instead of LinkCache"]] (duration: 01m 03s) [08:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:50] Amir1: I see a ton of "[{reqId}] {exception_url} Error: Call to undefined method MediaWiki\Page\PageStoreRecord::getField()" in logstash [08:26:50] (03CR) 10Urbanecm: arywiki NS (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [08:26:59] starting basically now [08:27:06] majavah: that's okay [08:27:18] it's because order of file arrivals is random [08:27:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:26] ahh [08:27:29] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:27:36] has it recovered? [08:27:51] seems to have stopped [08:27:51] ~12,3k total [08:28:06] (03CR) 10Urbanecm: Enable mapframe on the Indonesian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738547 (https://phabricator.wikimedia.org/T295571) (owner: 104nn1l2) [08:28:18] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738547 (https://phabricator.wikimedia.org/T295571) (owner: 104nn1l2) [08:28:22] marostegui: reads free falling https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=16&orgId=1&from=now-1h&to=now&var-job=All&var-server=db1163&var-port=9104 [08:28:27] (03CR) 10Urbanecm: [C: 03+1] Enable mapframe on the Indonesian Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738547 (https://phabricator.wikimedia.org/T295571) (owner: 104nn1l2) [08:28:36] Amir1: Yeah, I am monitoring!! \o/ [08:29:20] (03PS1) 10Muehlenhoff: Add library hint for mariadb-10.5 [puppet] - 10https://gerrit.wikimedia.org/r/740090 [08:29:40] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: update wmf-netbox - ayounsi@cumin1001 [08:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: update wmf-netbox - ayounsi@cumin1001 [08:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:41] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:51] Amir1: The reads are pretty much back to previous values! [08:31:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:03] \o/ [08:32:12] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:35:39] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for mariadb-10.5 [puppet] - 10https://gerrit.wikimedia.org/r/740090 (owner: 10Muehlenhoff) [08:36:15] (03PS1) 10Elukey: Move kafka-test brokers to the PKI intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/740091 (https://phabricator.wikimedia.org/T291905) [08:37:25] (03CR) 10Elukey: [C: 03+2] Move kafka-test brokers to the PKI intermediate CA [puppet] - 10https://gerrit.wikimedia.org/r/740091 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [08:46:07] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [08:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:43] (03PS1) 10Muehlenhoff: Fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/740094 [08:47:01] (03CR) 10jerkins-bot: [V: 04-1] Fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/740094 (owner: 10Muehlenhoff) [08:47:54] (03PS2) 10Muehlenhoff: Fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/740094 [08:48:50] 10SRE, 10SRE Observability: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996 (10ayounsi) p:05Medium→03High Raising the priority to bring attention to this task, feel free to re-triage accordingly. Yesterday's [[ https://docs.google.com/document/d/1s56_keYG8J58nZjH5tJLsiL... [08:49:39] (03CR) 10Muehlenhoff: [C: 03+2] Fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/740094 (owner: 10Muehlenhoff) [08:50:06] ACKNOWLEDGEMENT - PyBal backends health check on lvs2007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) ayounsi https://phabricator.wikimedia.org/T295118 https://wikitech.wikimedia.org/wiki/PyBal [08:50:06] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2007 is CRITICAL: CRITICAL: 0 connections established with conf2004.codfw.wmnet:4001 (min=12) ayounsi https://phabricator.wikimedia.org/T295118 https://wikitech.wikimedia.org/wiki/PyBal [08:50:06] ACKNOWLEDGEMENT - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal ayounsi https://phabricator.wikimedia.org/T295118 https://wikitech.wikimedia.org/wiki/PyBal [08:50:13] (03CR) 10Ideophagous: "In this case, maybe it's best to abandon this patch too and then do everything from the start. I'll update the NS and aliases for one patc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [08:51:01] ACKNOWLEDGEMENT - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal ayounsi https://phabricator.wikimedia.org/T295118 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:51:01] ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal ayounsi https://phabricator.wikimedia.org/T295118 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:58:53] (03CR) 10JMeybohm: [C: 04-1] "More a generic question:" [puppet] - 10https://gerrit.wikimedia.org/r/740083 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:00:18] (03CR) 10Jobo: [V: 03+2] Update approver for os-installers [puppet] - 10https://gerrit.wikimedia.org/r/738837 (owner: 10Muehlenhoff) [09:03:27] 10SRE, 10ops-codfw, 10serviceops: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10MoritzMuehlenhoff) >>! In T290708#7515390, @Dzahn wrote: > Regardless of the outcome we would remove the existing broken one, I think. Yeah, independent of a... [09:06:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [09:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:35] RECOVERY - HTTPS-wmfusercontent on phab.wmfusercontent.org is OK: SSL OK - Certificate *.wikipedia.org valid until 2022-02-10 08:02:21 +0000 (expires in 82 days) https://phabricator.wikimedia.org/tag/phabricator/ [09:11:41] (03CR) 10Elukey: [V: 03+1] kubernetes: expose internal CA bundle to helm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740083 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:16:03] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Active - HE, AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:16:10] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) ` elukey@kafka-test1006:~$ openssl s_client -CAfile /etc/ssl/localcerts/wmf_trusted_root_CAs.pem -verif... [09:20:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: allow passwords for new service account [puppet] - 10https://gerrit.wikimedia.org/r/739965 (owner: 10Majavah) [09:20:11] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:20:31] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:20:40] 10ops-eqiad, 10serviceops-radar: mw1448.mgmt alert - https://phabricator.wikimedia.org/T296041 (10Peachey88) [09:26:35] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 262.73 ms [09:27:23] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:27:37] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:25] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 73, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:15] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 331, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:35] !log re-enable cr2-codfw<->asw-b7-codfw link after disabling inet6 on cr2-codfw:ae2 - T295118 [09:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:39] T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 [09:33:35] PROBLEM - Apache HTTP on wtp1043 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/Application_servers [09:34:40] (03CR) 10JMeybohm: [C: 04-1] kubernetes: expose internal CA bundle to helm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740083 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:35:11] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:35:37] RECOVERY - Apache HTTP on wtp1043 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:37:12] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:38:22] (03CR) 10JMeybohm: [C: 04-1] kubernetes: expose internal CA bundle to helm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740083 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [09:46:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Keystone policy: add support for the keystonevalidate role [puppet] - 10https://gerrit.wikimedia.org/r/739902 (https://phabricator.wikimedia.org/T295234) (owner: 10Andrew Bogott) [09:47:05] (03PS1) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [09:49:38] (03CR) 10jerkins-bot: [V: 04-1] New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [09:53:27] !log run `commit full` on asw-b-codfw - T295118 [09:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:32] T295118: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 [09:56:46] (03PS2) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [10:00:02] (03CR) 10jerkins-bot: [V: 04-1] New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [10:06:44] (03Abandoned) 10Elukey: kubernetes: expose internal CA bundle to helm [puppet] - 10https://gerrit.wikimedia.org/r/740083 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [10:09:43] (03PS3) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [10:11:03] (03CR) 10Volans: "Did a very quick pass" [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [10:11:03] PROBLEM - memcached socket on wtp1039 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.32.231: Connection reset by peer https://wikitech.wikimedia.org/wiki/Memcached [10:12:30] (03CR) 10jerkins-bot: [V: 04-1] New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [10:13:03] (03PS4) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [10:13:50] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) Current status: * IPv6 is still broken on asw-b7-codfw (for traffic local and transiting through the switch) * `inet6` is disabled on cr2-co... [10:14:30] (03PS1) 10Arturo Borrero Gonzalez: cloud: cinder-backups: refresh node references [puppet] - 10https://gerrit.wikimedia.org/r/740107 (https://phabricator.wikimedia.org/T295584) [10:15:11] w/in 11 [10:15:14] ufffff :) [10:15:55] I wonder who's at window 11 patiently waiting for elukey [10:16:09] sre chan, nothing special [10:16:11] (03CR) 10jerkins-bot: [V: 04-1] cloud: cinder-backups: refresh node references [puppet] - 10https://gerrit.wikimedia.org/r/740107 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [10:16:20] (03CR) 10jerkins-bot: [V: 04-1] New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [10:19:49] PROBLEM - Check systemd state on wtp1043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-exporter-apt.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:11] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 155 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:20:53] (03PS2) 10Arturo Borrero Gonzalez: cloud: cinder-backups: refresh node references [puppet] - 10https://gerrit.wikimedia.org/r/740107 (https://phabricator.wikimedia.org/T295584) [10:22:13] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 40 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:22:40] (03CR) 10jerkins-bot: [V: 04-1] cloud: cinder-backups: refresh node references [puppet] - 10https://gerrit.wikimedia.org/r/740107 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [10:25:26] (03CR) 10David Caro: [C: 03+1] "There's a question about some behavior change (config), the other can be ignored." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [10:25:59] RECOVERY - Check systemd state on wtp1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:09] (03CR) 10David Caro: [C: 03+2] wmcs: Introduce function run_one to run a command [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739562 (owner: 10David Caro) [10:27:37] (03CR) 10David Caro: [C: 03+2] wmcs: use argparse formatter and module docs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739563 (owner: 10David Caro) [10:29:48] (03Merged) 10jenkins-bot: wmcs: Introduce function run_one to run a command [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739562 (owner: 10David Caro) [10:30:32] (03Merged) 10jenkins-bot: wmcs: use argparse formatter and module docs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/739563 (owner: 10David Caro) [10:32:07] (03PS3) 10Arturo Borrero Gonzalez: cloud: cinder-backups: refresh node references [puppet] - 10https://gerrit.wikimedia.org/r/740107 (https://phabricator.wikimedia.org/T295584) [10:35:32] (03CR) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [10:35:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: cinder-backups: refresh node references [puppet] - 10https://gerrit.wikimedia.org/r/740107 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [10:40:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10ayounsi) Hopefully we won't need to, but if asw1-b2-codfw needs to be rebooted, here are the impacted servers: ms-be2041 ms-be2046 ms-be2031 ms-be203... [10:41:43] RECOVERY - memcached socket on wtp1039 is OK: TCP OK - 0.000 second response time on socket /run/memcached/memcached.sock https://wikitech.wikimedia.org/wiki/Memcached [10:54:37] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01016 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:55:46] (03PS5) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [11:04:01] (03CR) 10jerkins-bot: [V: 04-1] New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [11:09:11] (03PS1) 10Jbond: profile::ceph::cinder_backup_nodes: add default value [puppet] - 10https://gerrit.wikimedia.org/r/740116 [11:10:49] (03CR) 10David Caro: [C: 03+1] profile::ceph::cinder_backup_nodes: add default value [puppet] - 10https://gerrit.wikimedia.org/r/740116 (owner: 10Jbond) [11:12:05] (03CR) 10Jbond: [C: 03+2] profile::ceph::cinder_backup_nodes: add default value [puppet] - 10https://gerrit.wikimedia.org/r/740116 (owner: 10Jbond) [11:13:01] (03PS1) 10Jbond: Revert "profile::ceph::cinder_backup_nodes: add default value" [puppet] - 10https://gerrit.wikimedia.org/r/739843 [11:13:44] (03PS1) 10Jbond: profile::ceph::cinder_backup_nodes: add default value [puppet] - 10https://gerrit.wikimedia.org/r/740118 [11:14:55] (03CR) 10Jbond: [C: 03+2] profile::ceph::cinder_backup_nodes: add default value [puppet] - 10https://gerrit.wikimedia.org/r/740118 (owner: 10Jbond) [11:15:28] (03CR) 10Jbond: [C: 03+2] Revert "profile::ceph::cinder_backup_nodes: add default value" [puppet] - 10https://gerrit.wikimedia.org/r/739843 (owner: 10Jbond) [11:15:35] (03PS1) 10JMeybohm: Ensure only certificates from this packag are bundled [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740119 [11:15:37] (03PS1) 10JMeybohm: Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740120 [11:17:31] (03PS2) 10JMeybohm: Ensure only certificates from this package are bundled [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740119 [11:17:33] (03PS2) 10JMeybohm: Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740120 [11:23:51] (03PS3) 10JMeybohm: Ensure only certificates from this package are bundled [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740119 [11:23:53] (03PS3) 10JMeybohm: Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740120 [11:24:12] first try! [11:25:14] (03PS6) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [11:26:34] (03CR) 10Elukey: [C: 03+1] Ensure only certificates from this package are bundled [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740119 (owner: 10JMeybohm) [11:27:01] (03CR) 10Elukey: [C: 03+1] Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740120 (owner: 10JMeybohm) [11:27:52] (03CR) 10Jbond: [C: 03+2] Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [11:28:24] (03PS1) 10Jcrespo: mediabackups: Backup s2 wikis, starting with bgwiki [puppet] - 10https://gerrit.wikimedia.org/r/740124 (https://phabricator.wikimedia.org/T262668) [11:30:13] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Ensure only certificates from this package are bundled [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740119 (owner: 10JMeybohm) [11:30:20] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Bump debian/changelog [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/740120 (owner: 10JMeybohm) [11:33:44] (03CR) 10Volans: New cookbook to reboot a VM on the Ganeti level (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [11:34:15] (03PS1) 10Jgiannelos: tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/740127 [11:34:21] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup s2 wikis, starting with bgwiki [puppet] - 10https://gerrit.wikimedia.org/r/740124 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [11:34:27] PROBLEM - Check systemd state on wtp1039 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:25] !log imported wmf-certificates 0~20211119-1 to stretch-wikimedia,buster-wikimedia,bullseye-wikimedia [11:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:30] elukey: ^ [11:36:49] (03PS7) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) [11:37:44] nice :) [11:38:17] I'll follow up with John to see if we can default to use the wmf bundle provided in the package via profile::base::certificates [11:38:40] we have also to roll out the new package fleetwide :D [11:40:05] (03PS1) 10Btullis: Update the way that the unavailable druid segment alert works [alerts] - 10https://gerrit.wikimedia.org/r/740128 (https://phabricator.wikimedia.org/T293399) [11:40:33] RECOVERY - Check systemd state on wtp1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:52] (03CR) 10Btullis: "I think that this should address the recent druid alerts that we've been experiencing." [alerts] - 10https://gerrit.wikimedia.org/r/740128 (https://phabricator.wikimedia.org/T293399) (owner: 10Btullis) [11:41:51] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005082 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:42:55] (03CR) 10Muehlenhoff: New cookbook to reboot a VM on the Ganeti level (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [11:43:14] (03CR) 10Muehlenhoff: "check" [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [11:43:45] PROBLEM - Check for large files in client bucket on wtp1028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.242: Connection reset by peer https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [11:45:56] (03PS1) 10Arturo Borrero Gonzalez: cloud: cinder-backup: refresh hiera config for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/740131 (https://phabricator.wikimedia.org/T295584) [11:48:31] (03CR) 10Btullis: [C: 03+1] "LGTM. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/739477 (owner: 10Elukey) [11:49:03] (03CR) 10jerkins-bot: [V: 04-1] New cookbook to reboot a VM on the Ganeti level [cookbooks] - 10https://gerrit.wikimedia.org/r/740104 (https://phabricator.wikimedia.org/T284811) (owner: 10Muehlenhoff) [11:49:30] (03PS2) 10Arturo Borrero Gonzalez: cloud: cinder-backup: refresh hiera defaults for eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/740131 (https://phabricator.wikimedia.org/T295584) [11:50:50] (03CR) 10David Caro: [C: 04-1] "There's an issue when loading the config, the type casting comment, otherwise LGTM" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [11:53:03] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/740127 (owner: 10Jgiannelos) [11:54:22] (03PS10) 10Jbond: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [11:54:27] !log roll-restarting cassandra on eqiad maps for java updates [11:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:41] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1003/32508/" [puppet] - 10https://gerrit.wikimedia.org/r/740131 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [11:54:43] (03CR) 10Jbond: "updated" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [11:58:19] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/740127 (owner: 10Jgiannelos) [11:59:38] elukey: I can do the roll out after lunch [12:06:49] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [12:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:40] (03CR) 10Cathal Mooney: [C: 03+2] Changing glob pattern for partman receipe for rpki VMs [puppet] - 10https://gerrit.wikimedia.org/r/739611 (https://phabricator.wikimedia.org/T292503) (owner: 10Cathal Mooney) [12:10:43] RECOVERY - Check for large files in client bucket on wtp1028 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [12:10:59] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [12:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:20] (03CR) 10Btullis: "All looks good, but I had one suggestion about whether we might add a safety feature." [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [12:20:17] (03CR) 10Btullis: cassandra: load grants files upon change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [12:39:30] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10Jelto) p:05Triage→03Medium [12:41:07] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10Jelto) [12:41:35] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10Jelto) @Rosalie_WMDE could you take a look at the [L3 agreement](https://phabricator.wikimedia.org/L3) and sign it?... [12:42:13] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [12:42:28] (03CR) 10Filippo Giunchedi: [C: 03+1] role::elasticsearch::relforge: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [12:42:47] (03CR) 10Filippo Giunchedi: [C: 03+1] role::elasticsearch::cloudelastic: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [12:43:06] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos::swift: add account for research datasets poc [puppet] - 10https://gerrit.wikimedia.org/r/737913 (https://phabricator.wikimedia.org/T294380) (owner: 10MVernon) [12:46:25] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10Jelto) [12:50:46] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Manfredi Martorana to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T295790 (10Jelto) Thanks for completing the access request. We need additional approval from @JBennett (as your manager) @Ot... [12:52:19] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol: enable RW LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/740138 (https://phabricator.wikimedia.org/T296076) [12:53:09] (03PS2) 10Arturo Borrero Gonzalez: cloudcontrol: enable RW LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/740138 (https://phabricator.wikimedia.org/T296076) [12:53:17] (03CR) 10Majavah: [C: 03+1] cloudcontrol: enable RW LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/740138 (https://phabricator.wikimedia.org/T296076) (owner: 10Arturo Borrero Gonzalez) [12:56:28] (03PS3) 10Arturo Borrero Gonzalez: cloudcontrol: enable RW LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/740138 (https://phabricator.wikimedia.org/T296076) [12:57:13] (03CR) 10Arturo Borrero Gonzalez: "PS1 was a PCC NOOP: https://puppet-compiler.wmflabs.org/compiler1003/32509/" [puppet] - 10https://gerrit.wikimedia.org/r/740138 (https://phabricator.wikimedia.org/T296076) (owner: 10Arturo Borrero Gonzalez) [12:59:04] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+1] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/32509/" [puppet] - 10https://gerrit.wikimedia.org/r/740138 (https://phabricator.wikimedia.org/T296076) (owner: 10Arturo Borrero Gonzalez) [13:00:44] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+1] cloudcontrol: enable RW LDAP servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740138 (https://phabricator.wikimedia.org/T296076) (owner: 10Arturo Borrero Gonzalez) [13:02:29] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/740138 (https://phabricator.wikimedia.org/T296076) (owner: 10Arturo Borrero Gonzalez) [13:12:01] (03PS1) 10Jbond: WIP: test [puppet] - 10https://gerrit.wikimedia.org/r/740139 [13:13:01] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] cloudcontrol: enable RW LDAP servers [puppet] - 10https://gerrit.wikimedia.org/r/740138 (https://phabricator.wikimedia.org/T296076) (owner: 10Arturo Borrero Gonzalez) [13:13:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32511/console" [puppet] - 10https://gerrit.wikimedia.org/r/740139 (owner: 10Jbond) [13:15:33] (03PS2) 10Jbond: WIP: test [puppet] - 10https://gerrit.wikimedia.org/r/740139 [13:16:31] (03PS3) 10Jbond: WIP: test [puppet] - 10https://gerrit.wikimedia.org/r/740139 [13:16:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32512/console" [puppet] - 10https://gerrit.wikimedia.org/r/740139 (owner: 10Jbond) [13:17:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32513/console" [puppet] - 10https://gerrit.wikimedia.org/r/740139 (owner: 10Jbond) [13:17:28] (03PS1) 10Arturo Borrero Gonzalez: (DONT MERGE, PoC) puppetmaster: hiera: order site after role [puppet] - 10https://gerrit.wikimedia.org/r/740141 [13:17:57] (03CR) 10Arturo Borrero Gonzalez: [C: 04-2] "don't merge, this patch is just a PCC." [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [13:18:57] (03PS7) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) [13:19:51] (03PS8) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) [13:20:08] (03CR) 10Mbch331: Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [13:21:52] (03PS2) 10Jbond: (DONT MERGE, PoC) puppetmaster: hiera: order site after role [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [13:23:05] !log draining instances from ganeti-test2001 for reimage T284811 [13:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:09] T284811: Upgrade eqiad/codfw Ganeti clusters to Buster - https://phabricator.wikimedia.org/T284811 [13:25:10] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10toan) >>! In T295765#7516506, @Jelto wrote: > @Rosalie_WMDE could you take a look at the [L3 agreement](https://phabr... [13:25:44] (03PS9) 10Mbch331: Add missing termbox codes from Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) [13:27:46] (03PS2) 10Jelto: admin: add wmde-fisch to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/739856 (https://phabricator.wikimedia.org/T295781) [13:56:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2001.codfw.wmnet with OS buster [13:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:58] (03CR) 10JMeybohm: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/739856 (https://phabricator.wikimedia.org/T295781) (owner: 10Jelto) [14:04:55] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 135 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:09:17] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 17 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:15:07] !log fleet wide updated wmf-certificates to 0~20211119-1 [14:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:31] elukey: ^ [14:19:18] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/739856 (https://phabricator.wikimedia.org/T295781) (owner: 10Jelto) [14:23:25] (03PS1) 10Vgutierrez: cache::haproxy: Bypass systemd's journal for logging [puppet] - 10https://gerrit.wikimedia.org/r/740172 (https://phabricator.wikimedia.org/T290005) [14:23:49] (03CR) 10Jelto: [C: 03+2] admin: add wmde-fisch to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/739856 (https://phabricator.wikimedia.org/T295781) (owner: 10Jelto) [14:25:17] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32516/console" [puppet] - 10https://gerrit.wikimedia.org/r/740172 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:26:01] jayme: nice! [14:27:32] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::haproxy: Bypass systemd's journal for logging [puppet] - 10https://gerrit.wikimedia.org/r/740172 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [14:29:43] RECOVERY - DPKG on ganeti-test2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:30:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test2001.codfw.wmnet with OS buster [14:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Christoph Jauera - https://phabricator.wikimedia.org/T295781 (10Jelto) 05Open→03Resolved a:03Jelto @WMDE-Fisch you should have access soon to `analytics-privatedata-users` group. I'm closing this task... [14:32:20] 10SRE, 10SRE-Access-Requests, 10Wikibase Release Strategy, 10Wikidata, 10wdwb-tech: Requesting access to releasers-wikibase for rosalie-WMDE - https://phabricator.wikimedia.org/T295765 (10Jelto) [14:33:11] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 312 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:35:39] PROBLEM - Check systemd state on wtp1046 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_intel_microcode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:26] something is clearly going on there. NRPE fork() failing on multiple wtp hosts with out of memory [14:38:35] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 24 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:38:37] memory usage in parsiod cluster is double from normal [14:39:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [14:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:29] PROBLEM - Check systemd state on wtp1046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [14:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:50] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) [14:46:54] jayme: --^ [14:47:46] we are having an incident [14:48:15] parsoid hosts have double the usual memory and various thing here and there timeout or get killed on that cluster [14:49:02] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10Vgutierrez) @ayounsi both lvs2008 and lvs2009 are primary LVS, so lvs2010 would assume the load of both during asw1-b2-codfw reboot. Far from ideal b... [14:49:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:31] 10SRE, 10Wikimedia-Mailing-lists: Request to create new mailing lists for ZHAFC Project - https://phabricator.wikimedia.org/T294676 (10LClightcat) >>! 在T294676#7501524中,@LClightcat写道: >>>! 在T294676#7491438中,@Legoktm写道: >>>>! In T294676#7481421, @Jonathan5566 wrote: >>> To be clear, what kind of on-wiki dissoci... [14:51:11] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [14:51:36] (03PS1) 10Jelto: admin: add Essex Igyan Igyan to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) [14:52:10] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for eigyan - https://phabricator.wikimedia.org/T295928 (10Jelto) p:05Triage→03Medium [14:52:33] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 31.81 ms [14:52:40] (03CR) 10jerkins-bot: [V: 04-1] admin: add Essex Igyan Igyan to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [14:52:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:15] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2044 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:58:35] (03PS1) 10Tks4Fish: kswiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740179 (https://phabricator.wikimedia.org/T296055) [15:02:27] PROBLEM - Check systemd state on wtp1046 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_intel_microcode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:33] jayme: akosiaris: the timing suggests high parsoid memory usage may be related to this weeks mediawiki train, train was deployed to all wikis at ~20:40 yesterday and memory usage starts raising very shortly after that [15:04:00] 10SRE, 10Observability-Metrics, 10Traffic, 10Patch-For-Review, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) The situation has improved significantly, we are now processing [[ https://grafana.wikimedia.org/d/wiU3SdE... [15:04:08] majavah: the increase starts even earlier. you can tell by the cpu utilization heatmap, the network graph and cpu graph at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=parsoid&var-instance=All&var-datasource=thanos&from=now-2d&to=now [15:04:19] majavah: indeed. But we do also have an increased load already starting around 2021-11-18 15:00 [15:05:40] there is nothing though in that dataframe in SAL that would explain that [15:10:19] on 2021-11-18 13:30 we had ~500 process scheduled across the cluster. And by 16:30Z > 1000 [15:11:32] so, something was going on already and the train made it worse? [15:12:40] could it be that it only started to get really bad when deployed go group2? [15:12:43] *to [15:13:07] PROBLEM - Check systemd state on elastic2044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:16] 10SRE, 10Platform Engineering: Technical advice on migrating content from Outreach-wiki to Meta-wiki - https://phabricator.wikimedia.org/T296091 (10CKoerner_WMF) [15:14:40] the charts don't really support that, though [15:15:45] I restart php-fpm on wtp1046 and memory usage hasn't spiked again [15:16:44] and from the slope's gradient, it looks like it's a small memory leak? [15:17:09] it takes a few hours to start causing issues [15:17:09] from https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=17&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=parsoid&var-method=GET&var-code=200&from=now-4d&to=now it looks like there also is an constantly increased GET rate [15:17:29] compared to the typical spiky nature [15:17:41] yes, definitely. At 13:50UTC [15:18:22] 10SRE, 10Platform Engineering: Technical advice on migrating content from Outreach-wiki to Meta-wiki - https://phabricator.wikimedia.org/T296091 (10RhinosF1) Non Sysadmin Volunteer but my only concern would be mediawiki timing out while processing it (either on the export or import). If this does happen, you c... [15:18:46] do we have a picture what's contained in the train changes? new wikidiff was recently rolled out, maybe the new deployment now enables some of that [15:19:01] akosiaris: how would one filter that to check if there maybe is a pattern or so? [15:19:04] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/739879 (https://phabricator.wikimedia.org/T293729) (owner: 10AOkoth) [15:19:38] so, 1 quick fix to alleviate the pressure and stop having immediate issues, is to depool 1 host, leave it be for debugging and then restart php-fpm on all other hosts [15:19:40] https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/738988 is the parsoid diff for this train [15:23:14] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2044 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:25:26] jayme: majavah: moritzm: Objections on the plan outlined above? [15:26:08] sounds fine [15:26:11] akosiaris: no, sounds good [15:26:28] ok randomly choosing wtp1025 for our guinea pig [15:27:08] let's pick 2 if we can. We had situations in the past where we wanted to confirm something [15:27:16] ok [15:27:22] sounds good to me [15:27:25] wtp1041, wt1025 [15:27:36] +1 [15:28:14] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=parsoid,name=wtp1041.eqiad.wmnet [15:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:23] !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: cluster=parsoid,name=wtp1025.eqiad.wmnet [15:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:49] 10SRE, 10Platform Engineering: Technical advice on migrating content from Outreach-wiki to Meta-wiki - https://phabricator.wikimedia.org/T296091 (10CKoerner_WMF) [15:29:30] !log. Depooling wtp1041, wtp1025 from traffic. The entire of the parsoid cluster is in a memory pressure situation, it looks like a rolling restart of php-fpm will alleviate the pressure and gives us some time to drill more on the problem before the pressure builds up again. [15:29:44] !log depooling wtp1041, wtp1025 from traffic. The entire of the parsoid cluster is in a memory pressure situation, it looks like a rolling restart of php-fpm will alleviate the pressure and gives us some time to drill more on the problem before the pressure builds up again. [15:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:02] akosiaris: I can kick of /usr/local/sbin/restart-php7.2-fpm if you're not already on it [15:30:13] I am [15:30:17] ack [15:31:29] !log roll restart wtp10* php7.2-fpm excluding wtp1025, wtp1041 [15:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:10] that freed ca. 0.5T [15:35:24] it just finished, so probably even more [15:35:25] 10SRE, 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) Let's wait for T296089 before proceeding :) [15:35:38] (03PS1) 104nn1l2: Enable SandboxLink on lawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740186 (https://phabricator.wikimedia.org/T296073) [15:35:46] but load ist still pretty increased [15:36:06] yeah, I am thinking we are looking at 2 different things [15:36:29] 1 is the increased request rate that leads to increased load from 2021-11-18 13:50UTC [15:36:46] and another is the slow memory increase that is pretty reminiscent of a memory leak [15:37:17] that one started around 20:40 as majavah pointed out, so probably a result of the train ? [15:37:39] the mem leak ofc also being potentially catalyzed by the increased request rate [15:37:48] yup [15:37:58] do we know what caused the request rate increase? [15:38:04] not yet [15:38:23] (03PS1) 10Muehlenhoff: Show cluster name in conformation dialogue, not the master's name [cookbooks] - 10https://gerrit.wikimedia.org/r/740187 [15:38:49] the train went out to just group1 though [15:39:15] it looks like it started before the train (req increase) but that could be a red herring as we do have recurring periods of increased req rate [15:39:22] if that is the result of the train, group2 can cause a significant outage [15:40:09] but this (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211118T2000) says group2 [15:40:15] the train went to group1 at 20:30 and to group2 at 20:30 [15:40:20] https://versions.toolforge.org/ as well [15:40:28] there were some blockers which caused group1 promotion to be delayed [15:41:56] majavah: ah 20:43 for all wikis, yeah you are right [15:42:02] I had missed that line [15:42:26] ok, at least we aren't waiting for 1 timebomb [15:44:26] https://grafana.wikimedia.org/d/000000607/cluster-overview?viewPanel=86&orgId=1&from=now-7d&to=now&var-site=eqiad&var-cluster=appserver&var-instance=All&var-datasource=thanoshmm, [15:44:36] hmmm, so the appservers see a memory increase too [15:45:17] and while the 30d graph does have a saw pattern, we haven't seen that high memory usage in 30 days [15:45:46] nor 60 for that matter [15:47:39] but it does not have the incresed cpu pattern. That supports the theory of two separate issues [15:48:02] RECOVERY - Check systemd state on wtp1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:21] do we have a task for this? I guess not yet? [15:52:34] correct [15:53:20] yeah, filling one [15:59:55] (03CR) 10Jcrespo: [C: 03+2] ci: upgrade lintian-junit-report [puppet] - 10https://gerrit.wikimedia.org/r/739141 (https://phabricator.wikimedia.org/T295719) (owner: 10Hashar) [16:01:27] 💌 [16:01:57] no change on contint1001 [16:02:18] (03CR) 10Hnowlan: [V: 03+1] cassandra: load grants files upon change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/739872 (https://phabricator.wikimedia.org/T295897) (owner: 10Hnowlan) [16:03:58] ah, I see, it is labs [16:09:21] https://phabricator.wikimedia.org/T296098 [16:09:33] thanks akosiaris! [16:13:48] jeena: ^ [16:14:01] (03PS1) 10Herron: logstash::gelf::input: remove hardcoded tags [puppet] - 10https://gerrit.wikimedia.org/r/740191 (https://phabricator.wikimedia.org/T288620) [16:14:25] akosiaris: it's edit private but not read [16:14:40] (03PS2) 10Herron: logstash::input::gelf: remove hardcoded tags [puppet] - 10https://gerrit.wikimedia.org/r/740191 (https://phabricator.wikimedia.org/T288620) [16:14:49] Actually not fully [16:14:53] It's edit TC [16:16:06] (03PS15) 10Jhernandez: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:16:22] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) [16:16:44] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/740191 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [16:16:48] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.1 point update - https://phabricator.wikimedia.org/T292844 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [16:17:02] (03CR) 10jerkins-bot: [V: 04-1] Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:20:11] (03PS1) 10Andrew Bogott: wmfkeystonehooks: don't rely on ldap.conf for ldap settings [puppet] - 10https://gerrit.wikimedia.org/r/740194 (https://phabricator.wikimedia.org/T296076) [16:20:12] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Restarting to pick up Java security updates - hnowlan@cumin1001 [16:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:57] (03PS16) 10Jhernandez: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:22:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM. Mbch331: do you want to add it to a deployment window? https://wikitech.wikimedia.org/wiki/Backport_windows" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [16:22:47] (03PS2) 10Andrew Bogott: wmfkeystonehooks: don't rely on ldap.conf for ldap settings [puppet] - 10https://gerrit.wikimedia.org/r/740194 (https://phabricator.wikimedia.org/T296076) [16:23:12] (03CR) 10jerkins-bot: [V: 04-1] Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:24:53] (03PS3) 10Andrew Bogott: wmfkeystonehooks: don't rely on ldap.conf for ldap settings [puppet] - 10https://gerrit.wikimedia.org/r/740194 (https://phabricator.wikimedia.org/T296076) [16:26:35] (03CR) 10Andrew Bogott: "pcc diff: https://puppet-compiler.wmflabs.org/compiler1002/32519/cloudcontrol1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/740194 (https://phabricator.wikimedia.org/T296076) (owner: 10Andrew Bogott) [16:28:00] (03CR) 10Majavah: [C: 03+1] "looks good, let's test it" [puppet] - 10https://gerrit.wikimedia.org/r/740194 (https://phabricator.wikimedia.org/T296076) (owner: 10Andrew Bogott) [16:28:30] (03CR) 10Muehlenhoff: [C: 03+1] "Not familiar with keystone, but the puppet part looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/740194 (https://phabricator.wikimedia.org/T296076) (owner: 10Andrew Bogott) [16:28:41] (03PS17) 10Jhernandez: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:29:31] (03PS1) 10Andrew Bogott: Revert "cloudcontrol: enable RW LDAP servers" [puppet] - 10https://gerrit.wikimedia.org/r/740196 (https://phabricator.wikimedia.org/T296076) [16:31:40] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/32520/elastic1040.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/740191 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [16:32:21] (03PS4) 10Andrew Bogott: wmfkeystonehooks: don't rely on ldap.conf for ldap settings [puppet] - 10https://gerrit.wikimedia.org/r/740194 (https://phabricator.wikimedia.org/T296076) [16:32:23] (03PS2) 10Andrew Bogott: Revert "cloudcontrol: enable RW LDAP servers" [puppet] - 10https://gerrit.wikimedia.org/r/740196 (https://phabricator.wikimedia.org/T296076) [16:34:28] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: don't rely on ldap.conf for ldap settings [puppet] - 10https://gerrit.wikimedia.org/r/740194 (https://phabricator.wikimedia.org/T296076) (owner: 10Andrew Bogott) [16:35:02] !log rolling back to group0 for T296098 [16:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:06] T296098: 1.38.0-wmf.9 seems to have introduced a memory leak - https://phabricator.wikimedia.org/T296098 [16:35:46] akosiaris, jayme: ^ [16:37:45] thanks [16:38:27] Hopefully it settles now [16:38:54] (03PS1) 10Thcipriani: Revert "group1 wikis to 1.38.0-wmf.9 refs T293950" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740198 [16:40:42] thcipriani: content looks right but it should be group1&2 in commit msg [16:41:25] RhinosF1: thanks, I'll fix it up after sync [16:41:33] (03PS18) 10Jhernandez: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:42:14] (03CR) 10Andrew Bogott: [C: 03+2] Revert "cloudcontrol: enable RW LDAP servers" [puppet] - 10https://gerrit.wikimedia.org/r/740196 (https://phabricator.wikimedia.org/T296076) (owner: 10Andrew Bogott) [16:42:41] !log thcipriani@deploy1002 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.38.0-wmf.9 refs T293950 T296098" [16:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:46] T293950: 1.38.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T293950 [16:42:46] T296098: 1.38.0-wmf.9 seems to have introduced a memory leak - https://phabricator.wikimedia.org/T296098 [16:43:33] (03PS2) 10Thcipriani: Revert "group1 and all wikis to 1.38.0-wmf.9 refs T293950" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740198 [16:43:36] (03CR) 10Thcipriani: [C: 03+2] Revert "group1 and all wikis to 1.38.0-wmf.9 refs T293950" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740198 (owner: 10Thcipriani) [16:44:30] (03Merged) 10jenkins-bot: Revert "group1 and all wikis to 1.38.0-wmf.9 refs T293950" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740198 (owner: 10Thcipriani) [16:50:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM now, though I’d like for another deployer to have a look at the array merging construction." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:50:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:14] (03CR) 10Krinkle: Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [16:53:58] (03PS1) 10Andrew Bogott: Add LDAP_RW_URI to horizon local_settings.py [puppet] - 10https://gerrit.wikimedia.org/r/740205 [16:54:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [16:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:24] (03CR) 10Majavah: Add LDAP_RW_URI to horizon local_settings.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740205 (owner: 10Andrew Bogott) [16:58:04] (03PS2) 10Andrew Bogott: Add LDAP_RW_URI to horizon local_settings.py [puppet] - 10https://gerrit.wikimedia.org/r/740205 [16:59:24] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/compiler1003/32524/labweb1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/740205 (owner: 10Andrew Bogott) [17:01:58] (03PS3) 10Andrew Bogott: Add LDAP_RW_URI to horizon local_settings.py [puppet] - 10https://gerrit.wikimedia.org/r/740205 [17:02:22] (03CR) 10Andrew Bogott: Add LDAP_RW_URI to horizon local_settings.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740205 (owner: 10Andrew Bogott) [17:07:17] (03CR) 10Andrew Bogott: [C: 03+2] Add LDAP_RW_URI to horizon local_settings.py [puppet] - 10https://gerrit.wikimedia.org/r/740205 (owner: 10Andrew Bogott) [17:10:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:15] (03PS1) 10Andrew Bogott: Horizon local_settings.py: python needs quotes around strings! [puppet] - 10https://gerrit.wikimedia.org/r/740207 [17:13:47] (03PS1) 10Herron: thanos: add recording rules for varnish SLO [puppet] - 10https://gerrit.wikimedia.org/r/740209 (https://phabricator.wikimedia.org/T289615) [17:14:31] (03CR) 10Andrew Bogott: [C: 03+2] Horizon local_settings.py: python needs quotes around strings! [puppet] - 10https://gerrit.wikimedia.org/r/740207 (owner: 10Andrew Bogott) [17:17:06] (03PS1) 10Herron: thanos: add experimental varnish multiwindow recording rules [puppet] - 10https://gerrit.wikimedia.org/r/740211 [17:19:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [17:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:17] !log andrew@deploy1002 Started deploy [horizon/deploy@ee83e27]: fixing sudo rule editing [17:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:05] (03CR) 10EllenR: Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:25:27] !log andrew@deploy1002 Finished deploy [horizon/deploy@ee83e27]: fixing sudo rule editing (duration: 04m 10s) [17:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:49] (03CR) 10Dzahn: "hmm. 06:52:34 FAIL:5623 no new line character at the end of file" [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:36:03] (03PS2) 10Dzahn: admin: add Essex Igyan Igyan to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:37:03] (03CR) 10jerkins-bot: [V: 04-1] admin: add Essex Igyan Igyan to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:39:46] (03PS3) 10Dzahn: admin: add Essex Igyan Igyan to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:40:15] (03CR) 10Dzahn: "I think the duplicate name is that bug in Namely" [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:40:35] (03CR) 10jerkins-bot: [V: 04-1] admin: add Essex Igyan Igyan to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:40:43] jerk-ins [17:41:26] (03CR) 10Dzahn: "FAIL:5624 too many blank lines (1 > 0), lol" [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:41:35] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:24] (03PS4) 10Dzahn: admin: add Essex Igyan Igyan to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:45:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:28] (03CR) 10Dzahn: [C: 04-1] "made jenkins happy but I can't find a user "eigyan" in LDAP yet. I'll ask the user on ticket if they have actually created their Wikitech " [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:49:48] (03CR) 10Dzahn: [C: 04-1] "ah, found. their uid is essexigyan, amending" [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:50:03] (03CR) 10Krinkle: Set up beta test environment for QuickSurveys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:50:19] (03PS19) 10Krinkle: Set up beta test environment for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737503 (https://phabricator.wikimedia.org/T293798) (owner: 10EllenR) [17:52:06] (03PS5) 10Dzahn: admin: add Essex Igyan to ldap_only section [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [17:53:41] (03PS3) 10Majavah: dynamicproxy: add tls to api [puppet] - 10https://gerrit.wikimedia.org/r/738220 [17:53:43] 10SRE, 10ops-eqiad, 10serviceops-radar: mw1448.mgmt alert - https://phabricator.wikimedia.org/T296041 (10Dzahn) p:05Triage→03Low Thanks! Not High prio but likely fixed quickly with a new cable or even just reseating it. [17:54:37] expired acknowledge? [17:54:54] yeah [17:55:04] should we just mark it as resolved in VO? [17:55:11] +1 [17:55:40] +1 [17:55:40] {{done}} [17:55:42] ty [17:55:51] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: add tls to api [puppet] - 10https://gerrit.wikimedia.org/r/738220 (owner: 10Majavah) [18:03:15] (03PS1) 10Majavah: dynamicproxy: add keystone authentication [puppet] - 10https://gerrit.wikimedia.org/r/740226 (https://phabricator.wikimedia.org/T295234) [18:03:35] (03PS2) 10Majavah: dynamicproxy: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/740226 (https://phabricator.wikimedia.org/T295234) [18:04:09] (03PS3) 10Majavah: dynamicproxy: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/740226 (https://phabricator.wikimedia.org/T295234) [18:06:26] !log andrew@deploy1002 Started deploy [horizon/deploy@ba16257]: moving the proxy endpoint behind keystone [18:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:27] (03PS1) 10Mforns: analytics:refinery:job:druid_load: Reduce shard size for netflow_sanitized [puppet] - 10https://gerrit.wikimedia.org/r/740233 [18:10:46] !log andrew@deploy1002 Finished deploy [horizon/deploy@ba16257]: moving the proxy endpoint behind keystone (duration: 04m 19s) [18:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:03] (03PS1) 10Nray: Fix banners to show CentralNotice [skins/MinervaNeue] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740246 (https://phabricator.wikimedia.org/T296077) [18:29:06] (03CR) 10Ryan Kemper: [C: 03+1] role::elasticsearch::cloudelastic: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:29:11] (03CR) 10Ryan Kemper: [C: 03+1] role::elasticsearch::relforge: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:31:30] (03CR) 10Herron: [C: 03+2] role::elasticsearch::cloudelastic: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739324 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:32:29] (03CR) 10Herron: [C: 03+2] role::elasticsearch::relforge: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/739325 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:32:31] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "adjusted the name, that is indeed the bug in Namely. https://wikimedia.namely.com/people/8cbc4a31-bfc0-4142-b0b7-3ff33a1febe0/show/teams-a" [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [18:33:22] (03PS1) 10MSantos: maps: script to send zoom level expiration events [puppet] - 10https://gerrit.wikimedia.org/r/740236 [18:34:00] (03CR) 10Dzahn: "P.S./nitpick: nice to have if we use the same topic branch for all the access requests" [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [18:34:42] hi 👋 question about a UBN -- should https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/740246 get backported today? [18:36:24] (03CR) 10Dzahn: "added to 'wmf' in LDAP" [puppet] - 10https://gerrit.wikimedia.org/r/740178 (https://phabricator.wikimedia.org/T295928) (owner: 10Jelto) [18:43:01] (03CR) 10jerkins-bot: [V: 04-1] Fix banners to show CentralNotice [skins/MinervaNeue] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740246 (https://phabricator.wikimedia.org/T296077) (owner: 10Nray) [18:43:27] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for eigyan - https://phabricator.wikimedia.org/T295928 (10Dzahn) Hi @eigyan welcome to WMF :) At first I could not find your user name "Eigyan" in LDAP (that's the account you created on Wikitech wiki and the one we needed for the addi... [18:44:20] cjming: if it's an UBN, it generally can be [18:44:36] but since it's Friday, you need to get an approval first [18:46:00] to me (note it's not authoritative, only my opinion), the change looks small enough, but it also fails CI for some reason 🙂 [18:46:43] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for eigyan - https://phabricator.wikimedia.org/T295928 (10Dzahn) 05Open→03Resolved a:03Dzahn P.S. re: "Do you currently have shell access (Yes/No)? Not sure". The answer is currently you do not have shell access in production. But y... [18:48:04] (03CR) 10Nray: "recheck" [skins/MinervaNeue] (wmf/1.38.0-wmf.9) - 10https://gerrit.wikimedia.org/r/740246 (https://phabricator.wikimedia.org/T296077) (owner: 10Nray) [18:48:30] (also it appears to be wmf.9 only -- and since it's unlikely train will roll forward before monday, I don't think it should be treated in the UBN mode) [18:48:32] hope this helps, cjming [18:49:57] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) Hi @Daimona If you are already in the "nda" group then I am not sure you are getting much extra out of the "wmf" group. All cases I recall have had code like "wmf or ops or nda" or similar. Is... [18:50:41] (03CR) 10Andrew Bogott: [C: 03+2] dynamicproxy: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/740226 (https://phabricator.wikimedia.org/T295234) (owner: 10Majavah) [18:54:54] (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/740191 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [18:54:57] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Restarting to pick up Java security updates - hnowlan@cumin1001 [18:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:47] urbanecm: thanks for the reply -- so it doesn't need to go out today (assuming it eventually passes CI) and i should just schedule it for backport on Monday? [18:59:18] cjming: in my opinion, yes (as I don't think group0 breakage fits the definition of an emergency). [19:07:11] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Dzahn) @Daimona What do you prefer? Would you like to have one volunteer user and one work user? Or just convert the existing user? Where convert basically just means replacing the email address with y... [19:21:36] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1287.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:21:42] PROBLEM - MariaDB Replica Lag: s8 on db1171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1283.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:22:36] (03CR) 10Herron: [C: 03+2] logstash::input::gelf: remove hardcoded tags [puppet] - 10https://gerrit.wikimedia.org/r/740191 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:27:02] (03PS1) 10Dzahn: conftool-data: remove mw2280 [puppet] - 10https://gerrit.wikimedia.org/r/740240 (https://phabricator.wikimedia.org/T290708) [19:28:56] (03CR) 10Dzahn: "[cumin1001:~] $ sudo -i confctl select name=mw2280.codfw.wmnet get" [puppet] - 10https://gerrit.wikimedia.org/r/740240 (https://phabricator.wikimedia.org/T290708) (owner: 10Dzahn) [19:30:20] (03PS2) 10Dzahn: site/conftool-data: remove mw2280 [puppet] - 10https://gerrit.wikimedia.org/r/740240 (https://phabricator.wikimedia.org/T290708) [19:30:39] (03CR) 10Jgiannelos: [C: 03+1] "Overall it looks OK. I added a nit comment just avoid issues in the future. I don't think its a blocker though." [puppet] - 10https://gerrit.wikimedia.org/r/740236 (owner: 10MSantos) [19:33:38] (03PS2) 10Dzahn: cache::text: remove config for scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/739660 (https://phabricator.wikimedia.org/T243037) [19:34:10] (03CR) 10Dzahn: "already removed from ATS a while ago" [puppet] - 10https://gerrit.wikimedia.org/r/739660 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [19:34:35] (03CR) 10Dzahn: "Host scholarships.wikimedia.org not found: 3(NXDOMAIN)" [puppet] - 10https://gerrit.wikimedia.org/r/739660 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [19:35:03] (03PS1) 10Papaul: Add kubernetes2018 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/740241 (https://phabricator.wikimedia.org/T294299) [19:35:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Kubernetes, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10Papaul) [19:35:27] (03CR) 10Dzahn: [C: 03+2] gitlab-runners: move profile::gitlab::runner::docker_volume: true to repo [puppet] - 10https://gerrit.wikimedia.org/r/739366 (owner: 10Dzahn) [19:35:34] (03PS2) 10Dzahn: gitlab-runners: move profile::gitlab::runner::docker_volume: true to repo [puppet] - 10https://gerrit.wikimedia.org/r/739366 [19:36:13] (03CR) 10Papaul: [C: 03+2] Add kubernetes2018 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/740241 (https://phabricator.wikimedia.org/T294299) (owner: 10Papaul) [19:37:25] (03CR) 10Legoktm: [C: 03+1] site/conftool-data: remove mw2280 [puppet] - 10https://gerrit.wikimedia.org/r/740240 (https://phabricator.wikimedia.org/T290708) (owner: 10Dzahn) [19:40:01] (03PS3) 10Dzahn: site/conftool-data: remove mw2280 [puppet] - 10https://gerrit.wikimedia.org/r/740240 (https://phabricator.wikimedia.org/T290708) [19:41:01] (03CR) 10Dzahn: "thanks, I need to do the decom script first though, almost forgot. https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Failed_-%3E_Decomm" [puppet] - 10https://gerrit.wikimedia.org/r/740240 (https://phabricator.wikimedia.org/T290708) (owner: 10Dzahn) [19:42:27] (03CR) 10Dzahn: "ah, it's actually already gone from Icinga..well..then :)" [puppet] - 10https://gerrit.wikimedia.org/r/740240 (https://phabricator.wikimedia.org/T290708) (owner: 10Dzahn) [19:45:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts mw2280.codfw.wmnet [19:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:13] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) server was in zombie state. somehow already removed from puppetdb and icinga but still "found physical host" when I ran the decom c... [19:50:12] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) ` Disable and reset vlan on asw-d3-codfw:ge-3/0/9 for local eno1 Delete IP 10.192.48.102/22 on eno1 Delete IP 2620:0:860:104:10:192... [19:51:22] !log shutting down undead server mw2280 - not icinga and puppetdb but in debmonitor and still has IP and puppet cert [19:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:27] undead :p [19:52:41] kind of dead but not really if you ask twice :) [19:53:11] not sure how but that was partial decom [19:53:35] still running the cookbook and it had _some_ things to do but did not find it in some other places [19:53:39] for the thumbor decom I split the removal from conftool and site.pp into two steps, did the conftool removal first, then ran the decom cookbook, it didn't complain about references to the server still being in puppet, and then afterward removed from site.pp [19:54:23] ah, *nod*, I guess for conftool both work equally, as long as it's already "inactive" [19:54:34] just cant remove it from site.pp _before_ running cookbook, yea [19:54:39] mhm [19:54:44] and that way you can't do it 100% right [19:54:56] because if you remove from site.pp it wont be found [19:55:05] and if you dont it warns you "omg, it's still here" [19:55:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2018.codfw.wmnet with OS stretch [19:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Kubernetes, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2018.codfw.wmnet with OS stretch [19:55:21] so you just have to tell it "i know" either way [19:55:41] imho it should ignore just site.pp when it does the check.. [19:56:34] now it's at the DNS removal step. but since it was "inactive" it's not in pybal anymore [19:57:49] and no more DHCP change needed is still new to me [20:00:22] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts mw2280.codfw.wmnet [20:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:29] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw2280.codfw.wmnet` - mw2280.codfw.wmnet (**F... [20:01:12] (03CR) 10Dzahn: [C: 03+2] "Host mw2280.codfw.wmnet not found: 3(NXDOMAIN)" [puppet] - 10https://gerrit.wikimedia.org/r/740240 (https://phabricator.wikimedia.org/T290708) (owner: 10Dzahn) [20:03:41] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) As you can see above I ran the decom cookbook. Some things were somehow already removed. Others were not. It removed it from debmon... [20:05:33] !log legoktm@cumin1001 START - Cookbook sre.hosts.decommission for hosts thumbor2002.codfw.wmnet [20:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:19] 10SRE, 10ops-codfw, 10serviceops, 10Patch-For-Review: decom mw2280 (was: mw2280 unresponsive to powercycle and hardreset) - https://phabricator.wikimedia.org/T290708 (10Dzahn) @Papaul This host can be recycled ;) [20:19:41] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for eigyan - https://phabricator.wikimedia.org/T295928 (10Aklapper) @Dzahn: Does that mean @eigyan should also be added as a member to #WMF-NDA per [instructions](https://wikitech.wikimedia.org/w/index.php?title=SRE%2FLDAP&type=revision&diff=1929377&oldid=1924... [20:20:25] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thumbor2002.codfw.wmnet [20:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:20] !log phabricator - adding eigyan to WMF-NDA (phab projectt 61 - https://phabricator.wikimedia.org/project/members/61/ ) - since that is now standard when adding people to the wmf LDAP group (T295928) [20:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:23] T295928: Grant Access to wmf for eigyan - https://phabricator.wikimedia.org/T295928 [20:21:42] 10ops-codfw, 10decommission-hardware, 10serviceops, 10Patch-For-Review: decommission thumbor200[12].codfw.wmnet - https://phabricator.wikimedia.org/T273141 (10Legoktm) [20:21:45] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for eigyan - https://phabricator.wikimedia.org/T295928 (10Dzahn) >>! In T295928#7517522, @Aklapper wrote: > @Dzahn: Does that mean @eigyan should also be added as a member to #WMF-NDA per [instructions](https://wikitech.wikimedia.org/w/index.php?title=SRE%2FLD... [20:23:05] (03PS2) 10Legoktm: Remove thumbor2001 and thumbor2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/739957 (https://phabricator.wikimedia.org/T273141) [20:24:16] (03CR) 10Legoktm: [C: 03+2] Remove thumbor2001 and thumbor2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/739957 (https://phabricator.wikimedia.org/T273141) (owner: 10Legoktm) [20:24:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2018.codfw.wmnet with OS stretch [20:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2018.codfw.wmnet with OS stretch completed: - kubernetes2018... [20:24:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10Papaul) [20:25:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Kubernetes: Q2:(Need By: TBD) rack/setup/install kubernetes2018 - https://phabricator.wikimedia.org/T294299 (10Papaul) 05Open→03Resolved complete [20:26:44] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10Papaul) [20:26:48] (03PS2) 10Dzahn: gitlab-runners: move puppetmaster setting to repo [puppet] - 10https://gerrit.wikimedia.org/r/739367 [20:27:27] (KubernetesCalicoDown) firing: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [20:33:07] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [20:39:49] (03CR) 10Brennen Bearnes: [C: 03+1] gitlab-runners: move puppetmaster setting to repo [puppet] - 10https://gerrit.wikimedia.org/r/739367 (owner: 10Dzahn) [20:40:29] (03CR) 10Dzahn: [C: 03+2] gitlab-runners: move puppetmaster setting to repo [puppet] - 10https://gerrit.wikimedia.org/r/739367 (owner: 10Dzahn) [20:46:55] (03CR) 10MSantos: maps: script to send zoom level expiration events (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740236 (owner: 10MSantos) [20:48:32] (03CR) 10Dzahn: "removed from all instances in Horizon Hiera. noop on runner-1019" [puppet] - 10https://gerrit.wikimedia.org/r/739366 (owner: 10Dzahn) [20:49:00] (03CR) 10Dzahn: "removed from Horizon Hiera on all instances, now in a single place, noop on runner-1019" [puppet] - 10https://gerrit.wikimedia.org/r/739367 (owner: 10Dzahn) [20:50:53] (03CR) 10Dzahn: "ah, right, all of this is even in Phab nowadays, one example: https://phabricator.wikimedia.org/rCLIP6aba1e19b9477eda00e59143cd3536a9dfa53" [puppet] - 10https://gerrit.wikimedia.org/r/739367 (owner: 10Dzahn) [21:05:26] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:16:17] Daimona: hey:) [21:22:46] RECOVERY - MariaDB Replica Lag: s8 on db1171 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:37:38] (03CR) 10Dzahn: "actually I don't know about the "dump user" part. so far I am familiar with the other part, the regular prod GRANTs. and what I can confir" [puppet] - 10https://gerrit.wikimedia.org/r/739667 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [21:39:46] (03CR) 10Dzahn: "-60 lines in prod-m2 :) more than expected but I guess this is right. well, I am just removing everything that refers to scholarships" [puppet] - 10https://gerrit.wikimedia.org/r/739667 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [21:42:35] (03CR) 10Dzahn: [V: 03+1] "@Valentin what do you think? can I deploy this next week?" [puppet] - 10https://gerrit.wikimedia.org/r/739353 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [21:43:36] (03CR) 10Dzahn: "Giuseppe you mentioned you wanted to upload a counter proposal here" [puppet] - 10https://gerrit.wikimedia.org/r/736596 (https://phabricator.wikimedia.org/T294802) (owner: 10Dzahn) [21:45:19] (03PS2) 10Dzahn: R:scap::target: drop notice message [puppet] - 10https://gerrit.wikimedia.org/r/734979 (owner: 10Jbond) [21:45:49] (03CR) 10Dzahn: "well.. based on previous comments, looks like consensus to remove" [puppet] - 10https://gerrit.wikimedia.org/r/734979 (owner: 10Jbond) [21:49:37] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/32527/" [puppet] - 10https://gerrit.wikimedia.org/r/734979 (owner: 10Jbond) [21:52:07] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on November 18, 2021. - https://phabricator.wikimedia.org/T296113 (10AlexisJazz) [21:52:38] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) [21:52:46] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on November 18, 2021. - https://phabricator.wikimedia.org/T296113 (10AlexisJazz) [21:52:57] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on November 18, 2021. - https://phabricator.wikimedia.org/T296113 (10RhinosF1) [21:53:38] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on November 18, 2021. - https://phabricator.wikimedia.org/T296113 (10RhinosF1) [21:53:46] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10RhinosF1) [21:55:10] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on November 18, 2021. - https://phabricator.wikimedia.org/T296113 (10RhinosF1) Closing as duplicate of linked task. I assume it's the standard restart of st... [21:55:23] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on November 18, 2021. - https://phabricator.wikimedia.org/T296113 (10AlexisJazz) @RhinosF1 : en.wikipedia.beta.wmflabs.org is already working again but uplo... [21:56:20] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.wikimedia.beta.wmflabs.org expired on November 18, 2021. - https://phabricator.wikimedia.org/T296113 (10RhinosF1) >>! In T296113#7517656, @AlexisJazz wrote: > @RhinosF1 : en.wikipedia.beta.wm... [21:59:13] (03PS3) 10VolkerE: Add new icons, wordmarks & taglines for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [22:03:08] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Urbanecm) [22:03:49] (03PS8) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [22:04:08] (03PS4) 10VolkerE: Add new icons, wordmarks & taglines for several wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739370 (https://phabricator.wikimedia.org/T290091) (owner: 10Clare Ming) [22:04:18] (03CR) 10Dzahn: snapshot: replace the word cron everywhere (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:07:29] (03CR) 10Dzahn: "compiling on "C:profile::dumps::generation::worker::common" works for this. (C:snapshot does not) does it cover all? I hope so because it'" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:08:23] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/32529/" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:10:42] (03PS9) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [22:12:00] (03CR) 10Dzahn: snapshot: replace the word cron everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:18:00] (03PS10) 10Dzahn: snapshot: replace the word cron everywhere [puppet] - 10https://gerrit.wikimedia.org/r/736074 [22:21:46] (03CR) 10Dzahn: "after a lot more replacing with sed -i etc:) now: https://puppet-compiler.wmflabs.org/compiler1003/32531/snapshot1012.eqiad.wmnet/index.ht" [puppet] - 10https://gerrit.wikimedia.org/r/736074 (owner: 10Dzahn) [22:26:17] (03CR) 10Dzahn: [V: 03+1] icinga: use display_name for a HOST to add 'page' string where applicable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [22:31:56] (03Abandoned) 10Dzahn: icinga: use display_name for a HOST to add 'page' string where applicable [puppet] - 10https://gerrit.wikimedia.org/r/735695 (https://phabricator.wikimedia.org/T236379) (owner: 10Dzahn) [22:50:10] (03PS1) 10Papaul: Add prometheus200[56] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/740276 (https://phabricator.wikimedia.org/T294302) [22:52:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2044-production-search-psi-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:53:06] (03CR) 10Papaul: [C: 03+2] Add prometheus200[56] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/740276 (https://phabricator.wikimedia.org/T294302) (owner: 10Papaul) [22:59:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2005.codfw.wmnet with OS bullseye [22:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host prometheus2005.co... [23:03:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10Papaul) [23:08:19] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Access request to superset for user KSiebert - https://phabricator.wikimedia.org/T295777 (10Dzahn) [23:14:18] (03PS1) 10Papaul: Fix partman for prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/740277 (https://phabricator.wikimedia.org/T294302) [23:14:27] (03PS1) 10Dzahn: admin: add mmartorana to ldap_only_admins [puppet] - 10https://gerrit.wikimedia.org/r/740278 (https://phabricator.wikimedia.org/T295789) [23:15:32] (03CR) 10Papaul: [C: 03+2] Fix partman for prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/740277 (https://phabricator.wikimedia.org/T294302) (owner: 10Papaul) [23:15:56] !log LDAP - added mmartorana to wmf (91354e9e-5706-4289-9a60-98e8a7632853) T295789 [23:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:00] T295789: Add Manfredi Martorana to wmf ldap group - https://phabricator.wikimedia.org/T295789 [23:16:28] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ed1a::1) [23:17:11] That... explains stuff [23:17:12] Err esams now? [23:17:25] I was just about to say I can't access phab/enwiki [23:17:40] (03CR) 10Dzahn: [C: 03+2] "https://wikimedia.namely.com/people/91354e9e-5706-4289-9a60-98e8a7632853/show/general/" [puppet] - 10https://gerrit.wikimedia.org/r/740278 (https://phabricator.wikimedia.org/T295789) (owner: 10Dzahn) [23:17:46] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 44.2 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:17:54] --> #mw_sec [23:18:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:19:18] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/Logs [23:20:11] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 81.04 ms [23:20:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:20:56] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 948 days) https://wikitech.wikimedia.org/wiki/Logs [23:23:23] (03PS1) 10CDanis: prepend_as_out for esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/740280 [23:24:12] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 89.34 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:24:51] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host prometheus2005.codfw.wmnet with OS bullseye [23:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:58] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host prometheus2005.codfw.... [23:25:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host prometheus2005.codfw.wmnet with OS bullseye [23:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host prometheus2005.co... [23:37:46] (03CR) 10CDanis: [C: 03+2] prepend_as_out for esams/knams [homer/public] - 10https://gerrit.wikimedia.org/r/740280 (owner: 10CDanis) [23:41:26] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 47.25 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:41:40] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7f582db75780, Connection to text-lb.esams.wikimedia.org timed out. (connect timeout=15)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [23:41:59] interesting [23:42:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:42:45] enwiki is very very slow [23:43:12] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp3054.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet, cp3064.esams.wmnet are marked down but pooled: textlb_443: Servers cp3060.esams.wmnet, cp3050.esams.wmnet, cp3054.esams.wmnet, cp3064.esams.wmnet, cp3058.esams.wmnet, cp3052.esams.wmnet are marked down but pooled: testlb6_443: Servers cp3054.esams.wmnet, cp3064.esams.wmne [23:43:12] 8.esams.wmnet, cp3052.esams.wmnet are marked down but pooled: ncredirlb_443: Servers ncredir3001.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3054.esams.wmnet, cp3058.esams.wmnet, cp3062.esams.wmnet, cp3052.esams.wmnet, cp3056.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:43:35] Christian75: assuming you're near or in Europe, we're working on it [23:43:51] :-) [23:44:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:47:50] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 108.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:47:51] PROBLEM - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is CRITICAL: type=tcp.timed_out https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [23:47:56] :) [23:48:17] * legoktm acks [23:48:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:49:09] (03PS1) 10CDanis: disable LG ipv4 in knams [homer/public] - 10https://gerrit.wikimedia.org/r/740284 [23:49:32] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [23:49:38] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:49:49] (03CR) 10CDanis: [C: 03+2] disable LG ipv4 in knams [homer/public] - 10https://gerrit.wikimedia.org/r/740284 (owner: 10CDanis) [23:50:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:52:09] RECOVERY - Too high an incoming rate of browser-reported Network Error Logging events #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Network_monitoring%23NEL_alerts https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 [23:52:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus2005.codfw.wmnet with OS bullseye [23:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host prometheus2005.codfw.... [23:54:08] Christian75: are things working now? [23:54:22] urbanecm: how about you? [23:55:04] It is much better (Europe) [23:55:15] I am not any of them, but I can reach phab again. [23:55:46] PROBLEM - At least one CPU core of an LVS is saturated- packet drops are likely on lvs3005 is CRITICAL: cpu={1,11,13,15,3,5,7,9} https://bit.ly/wmf-lvscpu https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3005&var-datasource=esams+prometheus/ops [23:56:00] thanks Christian75 and zabe, good [23:57:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q2): Q2:(Need By: TBD) rack/setup/install prometheus200[56] - https://phabricator.wikimedia.org/T294302 (10Papaul) prometheus2005 in row B `` ┌───────────┤ [!!] Download debconf preconfiguration file ├────────────┐...