[00:00:04] brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220128T0000). [00:00:13] o/ sitting in for brennen [00:01:04] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: add missing hiera var for internal [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [00:01:09] (03CR) 10Bking: [C: 03+2] wdqs: add missing hiera var for internal [puppet] - 10https://gerrit.wikimedia.org/r/757772 (https://phabricator.wikimedia.org/T300310) (owner: 10Bking) [00:01:48] (03CR) 10Clare Ming: [C: 03+1] Disable A/B test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757735 (https://phabricator.wikimedia.org/T297924) (owner: 10Jdlrobson) [00:06:17] but doesn't look like I've got takers, claiming victory [01:47:00] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2001-dev.wikimedia.org with OS bullseye [01:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:38] is someone from SRE still around? I need to revert a config change that had unforeseen consequences [03:27:19] (I can do it, just want coverage in case there's some deployment tooling problem. The config change itself is not risky.) [04:01:10] (03PS1) 10Gergő Tisza: GrowthExperiments: Disable mobile quality gate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757792 (https://phabricator.wikimedia.org/T298122) [04:24:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [04:33:16] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1068.eqiad.wmnet with OS stretch [04:33:16] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host elastic1068.eqiad.wmnet with OS stretch [04:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:17] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1068.eqiad.wmnet with OS stretch [04:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:23] (03PS2) 10Krinkle: Start writing to some wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734572 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [04:38:34] (03PS3) 10Krinkle: Start writing to some wmg* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734572 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [04:38:51] (03CR) 10Krinkle: [C: 03+1] "LGTM. The next patch in the chain needs a rebase because db-*.php files changed a bit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734572 (https://phabricator.wikimedia.org/T45956) (owner: 10Zabe) [04:39:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [04:39:27] (03PS4) 10Krinkle: multiversion: Factor dblist matching into separate method [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737210 (owner: 10Awight) [04:39:40] (03CR) 10Krinkle: [C: 03+1] multiversion: Factor dblist matching into separate method [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737210 (owner: 10Awight) [04:58:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1068.eqiad.wmnet with OS stretch [04:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:44] 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): prometheus-rabbitmq-exporter for Debian Bullseye - https://phabricator.wikimedia.org/T300308 (10Majavah) Current RabbitMQ server versions also include a built-in prometheus exporter: https://www.rabbitmq.com/prometheus.html [05:20:44] 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): prometheus-rabbitmq-exporter for Debian Bullseye - https://phabricator.wikimedia.org/T300308 (10Majavah) [05:20:54] 10SRE, 10Cloud-VPS, 10Patch-For-Review: package prometheus-rabbitmq-exporter for Debian jessie - https://phabricator.wikimedia.org/T188392 (10Majavah) [06:06:52] (03PS1) 10Marostegui: change_pt_timestamp_T298558.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757796 (https://phabricator.wikimedia.org/T298558) [06:08:32] (03PS1) 10Marostegui: db2133: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757797 (https://phabricator.wikimedia.org/T300243) [06:09:42] (03CR) 10Marostegui: [C: 03+2] db2133: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757797 (https://phabricator.wikimedia.org/T300243) (owner: 10Marostegui) [06:11:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2133.codfw.wmnet with OS bullseye [06:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:09] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:17:55] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:19:21] ^ expected [06:20:47] (03PS2) 10Marostegui: change_pt_timestamp_T298558.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757796 (https://phabricator.wikimedia.org/T298558) [06:24:13] (IcingaOverload) firing: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [06:29:13] (IcingaOverload) resolved: Checks are taking long to execute on alert2001:9245 - https://grafana.wikimedia.org/d/rsCfQfuZz/icinga - https://alerts.wikimedia.org [06:34:07] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [06:35:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:42:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2133.codfw.wmnet with OS bullseye [06:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:18] (03PS1) 10Marostegui: Revert "db2133: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757712 [06:45:20] (03CR) 10Marostegui: [C: 03+2] Revert "db2133: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757712 (owner: 10Marostegui) [06:54:16] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) @bd808 @Krinkle @MoritzMuehlenhoff @kostajh @hnowlan @dpifke I would like to failover m2 master on **Thursday 3rd Feb at 9:00AM UTC**. Du... [06:55:44] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:56:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [06:57:21] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Marostegui) Adding also @akosiaris and @Volans [07:01:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096 (s5,s6) T299479', diff saved to https://phabricator.wikimedia.org/P19531 and previous config saved to /var/cache/conftool/dbconfig/20220128-070112-marostegui.json [07:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:17] T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479 [07:02:48] (03PS1) 10Marostegui: db1096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757799 (https://phabricator.wikimedia.org/T299479) [07:03:40] (03CR) 10Marostegui: [C: 03+2] db1096: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757799 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [07:03:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1096.eqiad.wmnet with OS bullseye [07:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:52] (03PS1) 10Elukey: profile::kafka::broker: add pki_intermediate_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) [07:13:07] (03PS1) 10Majavah: rabbitmq: use built-in prometheus exporter in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757801 (https://phabricator.wikimedia.org/T300308) [07:14:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33485/console" [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [07:16:38] (03CR) 10Majavah: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1002/33484 (I guess the OS upgrade isn't yet visible there)" [puppet] - 10https://gerrit.wikimedia.org/r/757801 (https://phabricator.wikimedia.org/T300308) (owner: 10Majavah) [07:18:05] (03PS2) 10Elukey: helmfile.d: add circuit breaking settings for ml-serve's egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757675 (https://phabricator.wikimedia.org/T294414) [07:22:52] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [07:26:37] (03PS1) 10Marostegui: Revert "db1096: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757713 [07:28:26] (03CR) 10Marostegui: [C: 03+2] Revert "db1096: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757713 (owner: 10Marostegui) [07:28:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19532 and previous config saved to /var/cache/conftool/dbconfig/20220128-072849-root.json [07:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19533 and previous config saved to /var/cache/conftool/dbconfig/20220128-072858-root.json [07:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1096.eqiad.wmnet with OS bullseye [07:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:57] (03CR) 10Marostegui: "A dry run looks good." [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757796 (https://phabricator.wikimedia.org/T298558) (owner: 10Marostegui) [07:43:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19534 and previous config saved to /var/cache/conftool/dbconfig/20220128-074353-root.json [07:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19535 and previous config saved to /var/cache/conftool/dbconfig/20220128-074401-root.json [07:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19536 and previous config saved to /var/cache/conftool/dbconfig/20220128-075856-root.json [07:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19537 and previous config saved to /var/cache/conftool/dbconfig/20220128-075905-root.json [07:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220128T0800) [08:14:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19538 and previous config saved to /var/cache/conftool/dbconfig/20220128-081400-root.json [08:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19539 and previous config saved to /var/cache/conftool/dbconfig/20220128-081408-root.json [08:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:19] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) a:05ayounsi→03RobH > Please note the above diagram has a mistake, showing both routers connecting to PP:15/16 when cr1:xe-0/... [08:20:16] (03PS1) 10Kevin Bazira: ml-services: adding editquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/757838 (https://phabricator.wikimedia.org/T298943) [08:21:24] (03PS2) 10Kosta Harlan: GrowthExperiments: Disable mobile quality gate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757792 (https://phabricator.wikimedia.org/T298122) (owner: 10Gergő Tisza) [08:21:48] (03PS3) 10Kosta Harlan: GrowthExperiments: Disable mobile quality gate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757792 (https://phabricator.wikimedia.org/T298122) (owner: 10Gergő Tisza) [08:22:00] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Disable mobile quality gate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757792 (https://phabricator.wikimedia.org/T298122) (owner: 10Gergő Tisza) [08:23:30] (03CR) 10JMeybohm: [C: 03+2] Deploy istio-ingressgateway as daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/757696 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:26:14] (03CR) 10JMeybohm: [C: 03+2] Upgrade eqiad kubernetes masters to tainted full nodes [puppet] - 10https://gerrit.wikimedia.org/r/757615 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [08:27:17] (03Merged) 10jenkins-bot: Deploy istio-ingressgateway as daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/757696 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [08:28:00] (03CR) 10Elukey: ml-services: adding editquality transformer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757838 (https://phabricator.wikimedia.org/T298943) (owner: 10Kevin Bazira) [08:29:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19540 and previous config saved to /var/cache/conftool/dbconfig/20220128-082904-root.json [08:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19541 and previous config saved to /var/cache/conftool/dbconfig/20220128-082912-root.json [08:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:39] (03CR) 10Gehel: [C: 04-1] "See comment inline comment about wildcard certs. I'm not familiar with cergen, so my review only covers intent." [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [08:30:20] (03PS4) 10JMeybohm: kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) [08:30:49] (03PS1) 10Elukey: helmfile.d: remove unnecessary config from ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/757839 [08:31:59] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33486/console" [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [08:32:23] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Disable mobile quality gate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757792 (https://phabricator.wikimedia.org/T298122) (owner: 10Gergő Tisza) [08:32:27] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet [08:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:37] (03PS2) 10Elukey: helmfile.d: remove unnecessary config from ml-services [deployment-charts] - 10https://gerrit.wikimedia.org/r/757839 [08:37:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1002.eqiad.wmnet [08:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:19] (03CR) 10JMeybohm: [C: 03+2] Add k8s masters in eqiad eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757438 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [08:38:54] (03Merged) 10jenkins-bot: Add k8s masters in eqiad eBGP config [homer/public] - 10https://gerrit.wikimedia.org/r/757438 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [08:40:28] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10kostajh) Sounds good to me (for mwaddlink). [08:40:43] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet [08:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19542 and previous config saved to /var/cache/conftool/dbconfig/20220128-084407-root.json [08:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19543 and previous config saved to /var/cache/conftool/dbconfig/20220128-084416-root.json [08:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:27] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1001.eqiad.wmnet [08:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:44] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Remove expose_puppet_certs parameter [puppet] - 10https://gerrit.wikimedia.org/r/757631 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [08:51:15] (03CR) 10Gehel: "See a few comments inline." [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [08:56:21] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757839 (owner: 10Elukey) [08:56:32] (03CR) 10Elukey: [C: 03+2] "This leads to a no-op as expected, merging!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757839 (owner: 10Elukey) [08:56:46] (03PS2) 10Kevin Bazira: ml-services: adding editquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/757838 (https://phabricator.wikimedia.org/T298943) [08:58:12] (03CR) 10Jcrespo: "The patch is valid, but waiting on your feedback to see if it matches the desired actionables." [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [08:59:01] (03CR) 10Kevin Bazira: ml-services: adding editquality transformer (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757838 (https://phabricator.wikimedia.org/T298943) (owner: 10Kevin Bazira) [08:59:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19544 and previous config saved to /var/cache/conftool/dbconfig/20220128-085911-root.json [08:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19545 and previous config saved to /var/cache/conftool/dbconfig/20220128-085919-root.json [08:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:41] (03CR) 10Hashar: "The puppet compiler can be triggered by commenting "check experimental" in Gerrit which cause Zuul to trigger the compiler and it reports " [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/756989 (owner: 10Majavah) [09:11:45] (03PS3) 10Kevin Bazira: ml-services: adding editquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/757838 (https://phabricator.wikimedia.org/T298943) [09:12:39] kostajh: i am here for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/757792 :D [09:13:08] hashar: hi! [09:13:57] (03PS2) 10Filippo Giunchedi: hieradata: swap prometheus2003 with prometheus2005 [puppet] - 10https://gerrit.wikimedia.org/r/757623 (https://phabricator.wikimedia.org/T296199) [09:14:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19546 and previous config saved to /var/cache/conftool/dbconfig/20220128-091415-root.json [09:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19547 and previous config saved to /var/cache/conftool/dbconfig/20220128-091423-root.json [09:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:45] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: swap prometheus2003 with prometheus2005 [puppet] - 10https://gerrit.wikimedia.org/r/757623 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [09:14:56] kostajh: should I +2 it ? [09:15:14] hashar: yes please [09:15:23] (03CR) 10Hashar: [C: 03+2] GrowthExperiments: Disable mobile quality gate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757792 (https://phabricator.wikimedia.org/T298122) (owner: 10Gergő Tisza) [09:15:47] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:16:06] (03Merged) 10jenkins-bot: GrowthExperiments: Disable mobile quality gate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757792 (https://phabricator.wikimedia.org/T298122) (owner: 10Gergő Tisza) [09:16:50] kostajh: pulled on mwdebug1001 [09:17:13] (03CR) 10Elukey: [C: 03+2] ml-services: adding editquality transformer [deployment-charts] - 10https://gerrit.wikimedia.org/r/757838 (https://phabricator.wikimedia.org/T298943) (owner: 10Kevin Bazira) [09:17:16] hashar: thanks, checking (cc tgr_ [09:17:22] !log pool prometheus2005 and depool prometheus2003 - T296199 [09:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:26] T296199: Prometheus hardware refresh (+ Bullseye upgrade) - https://phabricator.wikimedia.org/T296199 [09:22:12] hashar: LGTM [09:22:24] syncing :D [09:22:31] merci [09:22:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:37] !log hashar@deploy1002 Synchronized wmf-config/CommonSettings.php: GrowthExperiments: Disable mobile quality gate - T298122 T300336 (duration: 00m 50s) [09:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:42] T298122: Add an image: experiment (desktop) - https://phabricator.wikimedia.org/T298122 [09:23:42] T300336: Add an image: Mobile-only quality gate shown to desktop users - https://phabricator.wikimedia.org/T300336 [09:24:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:24:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:32] (03CR) 10Muehlenhoff: [C: 03+2] Update service entry in idp-test for Puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/757630 (owner: 10Muehlenhoff) [09:29:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19548 and previous config saved to /var/cache/conftool/dbconfig/20220128-092918-root.json [09:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19549 and previous config saved to /var/cache/conftool/dbconfig/20220128-092927-root.json [09:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:01] (03PS1) 10Vgutierrez: Release 6.0.10-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/757866 (https://phabricator.wikimedia.org/T300264) [09:36:25] (03CR) 10jerkins-bot: [V: 04-1] Release 6.0.10-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/757866 (https://phabricator.wikimedia.org/T300264) (owner: 10Vgutierrez) [09:38:01] (03PS2) 10Vgutierrez: Release 6.0.10-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/757866 (https://phabricator.wikimedia.org/T300264) [09:38:20] (03CR) 10jerkins-bot: [V: 04-1] Release 6.0.10-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/757866 (https://phabricator.wikimedia.org/T300264) (owner: 10Vgutierrez) [09:39:58] (03PS1) 10Muehlenhoff: Addintional cache setting for puppetboard against idp-test [puppet] - 10https://gerrit.wikimedia.org/r/757868 [09:40:20] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10Volans) Nothing to do for `debmonitor`, it will reconnect automatically to the new host, so anytime is good. [09:44:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3315 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19550 and previous config saved to /var/cache/conftool/dbconfig/20220128-094422-root.json [09:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1096:3316 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19551 and previous config saved to /var/cache/conftool/dbconfig/20220128-094430-root.json [09:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:56] (03PS1) 10Muehlenhoff: Trim service list for idp-test [puppet] - 10https://gerrit.wikimedia.org/r/757869 [09:46:06] !log installing brltty bugfix updates from bullseye point release [09:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:28] (03PS1) 10Marostegui: db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757870 (https://phabricator.wikimedia.org/T299479) [09:46:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1168 T299479', diff saved to https://phabricator.wikimedia.org/P19552 and previous config saved to /var/cache/conftool/dbconfig/20220128-094636-marostegui.json [09:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:41] T299479: Upgrade s6 to Bullseye - https://phabricator.wikimedia.org/T299479 [09:47:08] (03CR) 10Marostegui: [C: 03+2] db1168: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/757870 (https://phabricator.wikimedia.org/T299479) (owner: 10Marostegui) [09:50:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1168.eqiad.wmnet with OS bullseye [09:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:58] (03CR) 10Jbond: [C: 03+1] "LGTM and still a worth while change, however FYI i plan to create all the intermediates we have in production in cloud today" [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [10:04:21] (03CR) 10Jbond: [C: 03+1] Review access change (031 comment) [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/756989 (owner: 10Majavah) [10:05:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757868 (owner: 10Muehlenhoff) [10:05:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/757869 (owner: 10Muehlenhoff) [10:06:39] 10SRE, 10SRE-Access-Requests: Requesting Google Search Console Access for a Service Account - https://phabricator.wikimedia.org/T300004 (10SCherukuwada) Thank you Jesse. This will be sufficient for now; I'll come back if I need more sites. I'm documenting some more information around how I'll likely end up us... [10:08:38] (03CR) 10Muehlenhoff: [C: 03+2] Trim service list for idp-test [puppet] - 10https://gerrit.wikimedia.org/r/757869 (owner: 10Muehlenhoff) [10:11:48] (03Abandoned) 10Elukey: profile::kafka::broker: add pki_intermediate_name parameter [puppet] - 10https://gerrit.wikimedia.org/r/757800 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [10:13:20] (03PS1) 10Marostegui: Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757849 [10:16:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 1%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19553 and previous config saved to /var/cache/conftool/dbconfig/20220128-101636-root.json [10:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:04] (03CR) 10Filippo Giunchedi: "Thanks Jessie and Riccardo ! LGTM overall" [alerts] - 10https://gerrit.wikimedia.org/r/757489 (https://phabricator.wikimedia.org/T294564) (owner: 10Volans) [10:18:36] (03CR) 10Hashar: [V: 03+2 C: 03+2] "Spoke with John about it, I got a bit confused earlier this morning but we have clarified the concerns. Note the Verified vote is restri" [puppet] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/756989 (owner: 10Majavah) [10:18:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1168.eqiad.wmnet with OS bullseye [10:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:56] (03CR) 10Marostegui: [C: 03+2] Revert "db1168: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/757849 (owner: 10Marostegui) [10:21:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] rabbitmq: use built-in prometheus exporter in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757801 (https://phabricator.wikimedia.org/T300308) (owner: 10Majavah) [10:23:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1019.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [10:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1019.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [10:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:09] (03PS1) 10Elukey: custom_deploy.d: set externalTrafficPolicy for ml-serve's istio ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757875 [10:25:57] !log draining ganeti1010 for eventual reimage [10:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:59] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [10:27:22] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1019 [10:28:16] 10SRE, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): prometheus-rabbitmq-exporter for Debian Bullseye - https://phabricator.wikimedia.org/T300308 (10aborrero) >>! In T300308#7658528, @Majavah wrote: > Current RabbitMQ server versions also include a built-in prometheus exporter: https://w... [10:28:31] 10SRE, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): prometheus-rabbitmq-exporter for Debian Bullseye - https://phabricator.wikimedia.org/T300308 (10aborrero) 05Open→03Resolved [10:29:45] !log mdipietro@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2001-dev.wikimedia.org with OS bullseye [10:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19554 and previous config saved to /var/cache/conftool/dbconfig/20220128-103140-root.json [10:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:36] (03PS1) 10Jelto: charts: remove depricated helm test annotation, fix hook-delete-policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/757877 (https://phabricator.wikimedia.org/T251305) [10:40:00] (03CR) 10Elukey: [C: 03+2] custom_deploy.d: set externalTrafficPolicy for ml-serve's istio ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757875 (owner: 10Elukey) [10:41:35] (03CR) 10Arturo Borrero Gonzalez: "Sorry for the late review. As promised, here it is now. I know the code has been merged already, so you'll likely need to do some changes " [puppet] - 10https://gerrit.wikimedia.org/r/757647 (https://phabricator.wikimedia.org/T300254) (owner: 10Michael DiPietro) [10:43:37] (03Merged) 10jenkins-bot: custom_deploy.d: set externalTrafficPolicy for ml-serve's istio ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757875 (owner: 10Elukey) [10:45:34] (03CR) 10Filippo Giunchedi: centrallog: clean up old /srv/syslog/host directories after grace period (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757498 (https://phabricator.wikimedia.org/T300056) (owner: 10Herron) [10:46:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19555 and previous config saved to /var/cache/conftool/dbconfig/20220128-104643-root.json [10:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:01] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10Miriam) Thanks @Ottomata! @AniketArs will be with us until the end of the fiscal year, so the expiry_date should be 2020-06-30. And yes, could we please... [10:51:06] 10SRE, 10Infrastructure-Foundations, 10netops: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) [10:51:48] PROBLEM - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:52:26] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve1004.eqiad.wmnet, ml-serve1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:52:45] this is me --^ [10:53:35] it's how you say hello [10:53:56] reverting my last change [10:54:23] 10SRE, 10Infrastructure-Foundations, 10netops: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) Just to update we've had console access for most of this week and configuration / testing is under way. Will submit CRs when config is ready. [10:54:38] 10SRE, 10Infrastructure-Foundations, 10netops: Validate EVPN/VXLAN configuration for Juniper QFX Platform - https://phabricator.wikimedia.org/T294115 (10cmooney) 05Open→03Resolved Closing this task. Have discussed with @ayounsi and we are broadly in agreement on next steps for the Eqiad expansion. Furt... [10:55:26] mmm lovely it doesn't work [10:56:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [11:00:54] RECOVERY - LVS inference codfw port 30443/tcp - Inference ML service IPv4 on inference.svc.codfw.wmnet is OK: TCP OK - 0.007 second response time on inference.discovery.wmnet port 30443 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:01:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 20%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19556 and previous config saved to /var/cache/conftool/dbconfig/20220128-110147-root.json [11:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:59] (03CR) 10JMeybohm: [C: 04-1] "I think you need to bump chart versions" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757877 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [11:03:18] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:11:27] (03PS1) 10Majavah: rabbitmq: fix dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/757881 (https://phabricator.wikimedia.org/T300308) [11:16:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19557 and previous config saved to /var/cache/conftool/dbconfig/20220128-111650-root.json [11:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:38] (03Abandoned) 10Vgutierrez: Release 6.0.10-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/757866 (https://phabricator.wikimedia.org/T300264) (owner: 10Vgutierrez) [11:21:47] (03PS1) 10Vgutierrez: Release 6.0.10-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/757883 (https://phabricator.wikimedia.org/T300264) [11:23:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] rabbitmq: fix dependency cycle [puppet] - 10https://gerrit.wikimedia.org/r/757881 (https://phabricator.wikimedia.org/T300308) (owner: 10Majavah) [11:26:46] 10SRE, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): prometheus-rabbitmq-exporter for Debian Bullseye - https://phabricator.wikimedia.org/T300308 (10Majavah) a:05aborrero→03Majavah ` Notice: /Stage[main]/Profile::Openstack::Base::Rabbitmq/Rabbitmq::Plugin[rabbitmq_prometheus]/Exec[ra... [11:31:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 40%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19558 and previous config saved to /var/cache/conftool/dbconfig/20220128-113154-root.json [11:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:51] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [11:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:26] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [11:45:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) Actually one thing that is outstanding I believe is to confirm the cable IDs? **Inter-Switch Links** I documented the inter-switc... [11:46:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19559 and previous config saved to /var/cache/conftool/dbconfig/20220128-114658-root.json [11:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:05] (03CR) 10jerkins-bot: [V: 04-1] Release 6.0.10-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/757883 (https://phabricator.wikimedia.org/T300264) (owner: 10Vgutierrez) [11:54:10] FFS :) [12:01:35] (03PS1) 10Jbond: hieradata: cloud pki project [puppet] - 10https://gerrit.wikimedia.org/r/757887 [12:02:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 60%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19560 and previous config saved to /var/cache/conftool/dbconfig/20220128-120201-root.json [12:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:29] (03CR) 10Jbond: [C: 03+2] hieradata: cloud pki project [puppet] - 10https://gerrit.wikimedia.org/r/757887 (owner: 10Jbond) [12:05:08] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Release 6.0.10-1wm1 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/757883 (https://phabricator.wikimedia.org/T300264) (owner: 10Vgutierrez) [12:17:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19561 and previous config saved to /var/cache/conftool/dbconfig/20220128-121706-root.json [12:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:34] !log upload varnish 6.0.10-1wm1 to apt.wm.o (buster component/varnish6) - T300264 [12:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:25] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1020.eqiad.wmnet with OS buster [12:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:26] (03PS1) 10Jbond: P:pki::multiroot: parameterises int CA cert location [puppet] - 10https://gerrit.wikimedia.org/r/757891 [12:30:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33488/console" [puppet] - 10https://gerrit.wikimedia.org/r/757891 (owner: 10Jbond) [12:30:12] !log installing libseccomp bugfix updates from bullseye point release [12:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: repooling after reimage', diff saved to https://phabricator.wikimedia.org/P19562 and previous config saved to /var/cache/conftool/dbconfig/20220128-123210-root.json [12:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:41] (03PS1) 10Inductiveload: Wikisource: Increase PDF rendering resolution to 300 dpi [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757892 (https://phabricator.wikimedia.org/T256959) [12:39:34] (03PS1) 10Jbond: hieradata: pki cloud add production intermediate ca's [puppet] - 10https://gerrit.wikimedia.org/r/757894 [12:41:51] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/757635 (owner: 10Ssingh) [12:46:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multiroot: parameterises int CA cert location [puppet] - 10https://gerrit.wikimedia.org/r/757891 (owner: 10Jbond) [12:46:39] (03PS2) 10Jbond: hieradata: pki cloud add production intermediate ca's [puppet] - 10https://gerrit.wikimedia.org/r/757894 [12:49:01] (03CR) 10Jbond: [C: 03+2] hieradata: pki cloud add production intermediate ca's [puppet] - 10https://gerrit.wikimedia.org/r/757894 (owner: 10Jbond) [12:50:18] (03PS2) 10Jelto: charts: remove depricated helm test annotation, fix hook-delete-policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/757877 (https://phabricator.wikimedia.org/T251305) [12:52:26] (03CR) 10Jelto: charts: remove depricated helm test annotation, fix hook-delete-policy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757877 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:00:14] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:31] !log installing uriparser security updates [13:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:54] (03PS1) 10Jbond: hieradata: add ocsp policy for kafka [puppet] - 10https://gerrit.wikimedia.org/r/757895 [13:04:12] (03CR) 10Jbond: [V: 03+2 C: 03+2] hieradata: add ocsp policy for kafka [puppet] - 10https://gerrit.wikimedia.org/r/757895 (owner: 10Jbond) [13:05:57] (03CR) 10JMeybohm: [C: 03+1] charts: remove depricated helm test annotation, fix hook-delete-policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/757877 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:10:10] (03CR) 10Jelto: [C: 03+2] charts: remove depricated helm test annotation, fix hook-delete-policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/757877 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:10:48] (03PS1) 10Jbond: P:pki::multirootca: merge default profiles with profiles [puppet] - 10https://gerrit.wikimedia.org/r/757896 [13:11:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33489/console" [puppet] - 10https://gerrit.wikimedia.org/r/757896 (owner: 10Jbond) [13:13:30] (03PS2) 10Jbond: P:pki::multirootca: merge default profiles with profiles [puppet] - 10https://gerrit.wikimedia.org/r/757896 [13:14:07] (03Merged) 10jenkins-bot: charts: remove depricated helm test annotation, fix hook-delete-policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/757877 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [13:14:11] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33490/console" [puppet] - 10https://gerrit.wikimedia.org/r/757896 (owner: 10Jbond) [13:15:35] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: merge default profiles with profiles [puppet] - 10https://gerrit.wikimedia.org/r/757896 (owner: 10Jbond) [13:20:08] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1020.eqiad.wmnet with OS buster [13:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:01] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [13:22:33] (03CR) 10Ladsgroup: [C: 03+1] change_pt_timestamp_T298558.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757796 (https://phabricator.wikimedia.org/T298558) (owner: 10Marostegui) [13:22:47] (03CR) 10Marostegui: [V: 03+2 C: 03+2] change_pt_timestamp_T298558.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/757796 (https://phabricator.wikimedia.org/T298558) (owner: 10Marostegui) [13:25:23] (03PS1) 10JMeybohm: Allow deploy users to create ingress and certificate objects [deployment-charts] - 10https://gerrit.wikimedia.org/r/757898 (https://phabricator.wikimedia.org/T290966) [13:27:48] (03PS1) 10Majavah: admin: enforce-users-groups: Add new system user range [puppet] - 10https://gerrit.wikimedia.org/r/757899 (https://phabricator.wikimedia.org/T300254) [13:29:01] (03PS2) 10Majavah: admin: enforce-users-groups: Add new system user range [puppet] - 10https://gerrit.wikimedia.org/r/757899 (https://phabricator.wikimedia.org/T300254) [13:30:24] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) 05In progress→03Resolved With the removal of deprecated `helm2` test annotations in https://gerrit.wikimedia.org/r/757877 all (known) cleanup steps are finished. There is some... [13:32:00] (03PS3) 10Majavah: admin: enforce-users-groups: Add new system user range [puppet] - 10https://gerrit.wikimedia.org/r/757899 (https://phabricator.wikimedia.org/T300254) [14:13:01] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10Seddon) Thank you. Can confirm it's all working. [14:27:26] !log update varnish to version 6.0.10-1wm1 on cp4034 - T300264 [14:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:14] (03CR) 10Jbond: [C: 03+1] "LGTM, I also ran the following to see what we currently have" [puppet] - 10https://gerrit.wikimedia.org/r/757899 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [14:38:30] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.2 point update - https://phabricator.wikimedia.org/T298021 (10MoritzMuehlenhoff) [14:40:26] (03PS2) 10Ssingh: site: add role for durum hosts in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/757741 (https://phabricator.wikimedia.org/T300158) [14:47:05] !log optimizing dewiki.flaggedtemplates in db2113 [14:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:43] !log update varnish to version 6.0.10-1wm1 on cp4036 - T300264 [14:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:43] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1020.eqiad.wmnet [14:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:56] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1021.eqiad.wmnet with OS buster [14:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:42] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10hnowlan) No concerns as regards `sockpuppet`, it is not currently receiving writes afaik. [14:56:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [15:02:41] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10akosiaris) Adding @Arnoldokoth as well for `vrts` (formerly named `otrs`). The software will automatically reconnect to the new host, but best to kee... [15:07:58] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10akosiaris) As far as `recommendationapi` and `mwaddlink` go, for the former we know it will not have issues, for the latter, I have no recollection o... [15:14:22] !log start of cleaning lint errors caused by content model changes (T298343) [15:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:27] T298343: Similar pages on it.wv and it.wp or en:wv do not have the same lint errors - https://phabricator.wikimedia.org/T298343 [15:21:46] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:29:55] (03CR) 10Majavah: "I tried documenting the current reality here: https://wikitech.wikimedia.org/wiki/UID" [puppet] - 10https://gerrit.wikimedia.org/r/757899 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [15:40:45] 10SRE, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10Dbrant) 05Open→03Resolved a:03Dbrant I don't believe we've seen this recently, so closing. [15:41:55] !log pool cp4031 using envoy as TLS termination layer - T271421 [15:41:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:00] T271421: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 [15:43:17] (03CR) 10Andrew Bogott: [C: 03+2] admin: enforce-users-groups: Add new system user range [puppet] - 10https://gerrit.wikimedia.org/r/757899 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [15:44:17] 10SRE, 10Traffic, 10Patch-For-Review: Test envoyproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T271421 (10Vgutierrez) [15:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [15:47:11] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1021.eqiad.wmnet with OS buster [15:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:34] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1021.eqiad.wmnet [15:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:46] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1022.eqiad.wmnet with OS buster [15:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:52] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [15:50:53] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [15:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [15:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:42] (03PS1) 10Vgutierrez: prometheus::ops: Gather varnish mtail metrics on text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/757917 (https://phabricator.wikimedia.org/T271421) [16:16:07] (03PS2) 10Arturo Borrero Gonzalez: toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697 [16:16:13] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33491/console" [puppet] - 10https://gerrit.wikimedia.org/r/757917 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:18:20] (03PS1) 10BryanDavis: toolhub: Update README with helm-doc [deployment-charts] - 10https://gerrit.wikimedia.org/r/757921 [16:20:13] (03PS3) 10Arturo Borrero Gonzalez: toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697 [16:22:19] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1022.eqiad.wmnet with OS buster [16:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:27] (03PS4) 10Arturo Borrero Gonzalez: toolforge: automated-tests: add basic python webservice grid test [puppet] - 10https://gerrit.wikimedia.org/r/757697 [16:25:42] 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10TAndic) [16:27:27] (03PS2) 10Bking: deployment-prep: add cergen config for elastic service [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797) [16:27:37] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] prometheus::ops: Gather varnish mtail metrics on text_envoy [puppet] - 10https://gerrit.wikimedia.org/r/757917 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [16:31:03] (03CR) 10Bking: "Updated code per feedback from Majavah and Gehel" [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [16:31:41] (03CR) 10Gehel: [C: 03+1] "LGTM now. I'm still not familiar with cergen, but this seems good enough to merge." [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [16:37:11] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) [16:37:29] (03CR) 10BryanDavis: [C: 03+2] toolhub: Update README with helm-doc [deployment-charts] - 10https://gerrit.wikimedia.org/r/757921 (owner: 10BryanDavis) [16:41:14] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1022.eqiad.wmnet [16:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:32] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1023.eqiad.wmnet with OS buster [16:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:06] (03Merged) 10jenkins-bot: toolhub: Update README with helm-doc [deployment-charts] - 10https://gerrit.wikimedia.org/r/757921 (owner: 10BryanDavis) [16:45:07] 10SRE, 10SRE-Access-Requests, 10Research: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) @Miriam I assume you mean 2022-06-30 😉, though with covid still with us, who knows what year it is! [16:46:37] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:49:05] (03PS1) 10JHathaway: Add Aniket Bharti to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757926 (https://phabricator.wikimedia.org/T299919) [16:54:20] 10SRE, 10SRE-Access-Requests: Requesting update to SSH key and Kerberos for Joseph Seddon - https://phabricator.wikimedia.org/T299988 (10jhathaway) 05Open→03Resolved great! [16:54:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:59:59] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10jhathaway) a:03jhathaway [17:00:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:00:23] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10dpifke) No objections from `xhgui` owners, the service is low traffic and should automatically reconnect. [17:00:39] I am checking Zuul [17:01:04] merger:merge 360 3 3 [17:01:04] so bunch of merge requests are pending [17:07:27] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master (db1183 -> db1159) - https://phabricator.wikimedia.org/T300329 (10bd808) >>! In T300329#7658612, @Marostegui wrote: > * iegreview PHP app without persistent connections, so it should recover fine. It doesn't look l... [17:07:51] 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10cmooney) [17:11:07] (03CR) 10RLazarus: [C: 03+1] Add Aniket Bharti to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757926 (https://phabricator.wikimedia.org/T299919) (owner: 10JHathaway) [17:12:11] (03CR) 10JHathaway: [C: 03+2] Add Aniket Bharti to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/757926 (https://phabricator.wikimedia.org/T299919) (owner: 10JHathaway) [17:14:53] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1023.eqiad.wmnet with OS buster [17:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:19] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10jhathaway) ssh key confirmed via gchat [17:17:42] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1024.eqiad.wmnet with OS buster [17:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:52] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase1023.eqiad.wmnet [17:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:34] 10SRE, 10SRE-Access-Requests, 10Research, 10Patch-For-Review: Access to analytics-privatedata-users for Research intern AniketArs - https://phabricator.wikimedia.org/T299919 (10jhathaway) @AniketArs change has been submitted, including your kerberos access, please give it a try! [17:22:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:27:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:27:58] (03PS2) 10JMeybohm: Allow deploy users to create ingress and certificate objects [deployment-charts] - 10https://gerrit.wikimedia.org/r/757898 (https://phabricator.wikimedia.org/T290966) [17:28:00] (03PS1) 10JMeybohm: _ingress_helpers: HTTPRoute does not require a destination [deployment-charts] - 10https://gerrit.wikimedia.org/r/757934 (https://phabricator.wikimedia.org/T290966) [17:28:03] (03PS1) 10JMeybohm: Add ingress support to miscweb chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/757935 (https://phabricator.wikimedia.org/T290966) [17:28:05] (03PS1) 10JMeybohm: miscweb: Remove repeating settings and enable ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/757936 (https://phabricator.wikimedia.org/T290966) [17:30:39] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10jhathaway) [17:34:45] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10jhathaway) @JAnstee_WMF & @Ottomata please approve [17:50:01] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:52:14] RECOVERY - Check systemd state on apifeatureusage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:58] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1024.eqiad.wmnet with OS buster [17:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:09] (03PS1) 10JHathaway: Add Tanja Anđić to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/757941 (https://phabricator.wikimedia.org/T300383) [17:55:38] PROBLEM - Check systemd state on apifeatureusage1001 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_apifeatureusage_codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:40] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10JAnstee_WMF) I approve as Tanja's manager [17:58:27] (03PS1) 10Hashar: zuul: stop keeping reflog on the mergers [puppet] - 10https://gerrit.wikimedia.org/r/757943 [17:58:29] (03PS1) 10Hashar: zuul: prune heads and tags on each fetches [puppet] - 10https://gerrit.wikimedia.org/r/757944 (https://phabricator.wikimedia.org/T220606) [17:58:49] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757943 (owner: 10Hashar) [17:58:55] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757944 (https://phabricator.wikimedia.org/T220606) (owner: 10Hashar) [17:59:06] (03CR) 10Hashar: [V: 04-1] "I have to test it out :)" [puppet] - 10https://gerrit.wikimedia.org/r/757944 (https://phabricator.wikimedia.org/T220606) (owner: 10Hashar) [18:00:01] (03CR) 10jerkins-bot: [V: 04-1] zuul: prune heads and tags on each fetches [puppet] - 10https://gerrit.wikimedia.org/r/757944 (https://phabricator.wikimedia.org/T220606) (owner: 10Hashar) [18:00:03] (03CR) 10jerkins-bot: [V: 04-1] zuul: stop keeping reflog on the mergers [puppet] - 10https://gerrit.wikimedia.org/r/757943 (owner: 10Hashar) [18:00:06] bah [18:01:05] that will be for next week :-] they are not anywhere urgent :D [18:21:29] 10SRE, 10ops-eqiad: New Cage Config/Testing Eqiad - https://phabricator.wikimedia.org/T300353 (10wiki_willy) a:05cmooney→03Cmjohnson Assigning over to @Cmjohnson and cc'ing @Jclark-ctr, Chris - this one needs to be completed fairly quickly (maybe. within a couple business days), so that we can proceed for... [18:25:04] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:32:59] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) >>! In T300277#7658774, @ayounsi wrote: >> Please note the above diagram has a mistake, showing both routers connecting to PP:15/16... [18:54:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10wiki_willy) Hi @cmooney - here's the doc that @Jclark-ctr put together when running the cables for the inter-switch links. Some of the cabl... [18:56:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [19:08:29] (03PS1) 10Cwhite: apifeatureusage: disable gc logging [puppet] - 10https://gerrit.wikimedia.org/r/757955 (https://phabricator.wikimedia.org/T297239) [19:12:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission pay-lvs1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T300168 (10wiki_willy) a:03Cmjohnson [19:12:25] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission pay-lvs1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T300165 (10wiki_willy) a:03Cmjohnson [19:16:26] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:16:51] (03PS1) 10Andrew Bogott: Keystone: don't use apache/mod_wsgi on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757956 (https://phabricator.wikimedia.org/T300254) [19:20:29] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: don't use apache/mod_wsgi on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757956 (https://phabricator.wikimedia.org/T300254) (owner: 10Andrew Bogott) [19:21:08] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33493/console" [puppet] - 10https://gerrit.wikimedia.org/r/757956 (https://phabricator.wikimedia.org/T300254) (owner: 10Andrew Bogott) [19:24:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10Ottomata) Approved! [19:28:34] (03PS1) 10Andrew Bogott: Exclude keystone::apache profile on Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/757959 (https://phabricator.wikimedia.org/T300254) [19:32:01] (03PS2) 10Andrew Bogott: Exclude keystone::apache profile on Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/757959 (https://phabricator.wikimedia.org/T300254) [19:32:18] (03CR) 10Majavah: [C: 03+1] Exclude keystone::apache profile on Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/757959 (https://phabricator.wikimedia.org/T300254) (owner: 10Andrew Bogott) [19:33:52] (03CR) 10Andrew Bogott: [C: 03+2] Exclude keystone::apache profile on Bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/757959 (https://phabricator.wikimedia.org/T300254) (owner: 10Andrew Bogott) [19:35:16] (03CR) 10AntiCompositeNumber: [C: 04-1] "On Wikimedia wikis, PDF rendering is handled by Thumbor, not MediaWiki. This change would have no effect." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757892 (https://phabricator.wikimedia.org/T256959) (owner: 10Inductiveload) [19:36:39] (03PS1) 10Andrew Bogott: Another piece of removing Apache from cloudcontrol/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757962 (https://phabricator.wikimedia.org/T300254) [19:38:04] (03CR) 10Andrew Bogott: [C: 03+2] Another piece of removing Apache from cloudcontrol/bullseye [puppet] - 10https://gerrit.wikimedia.org/r/757962 (https://phabricator.wikimedia.org/T300254) (owner: 10Andrew Bogott) [19:48:39] (03PS1) 10Andrew Bogott: profile::openstack::eqiad1::keystone::wsgi_server: 'keystone' [puppet] - 10https://gerrit.wikimedia.org/r/757968 (https://phabricator.wikimedia.org/T300254) [19:49:46] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:49:47] (03CR) 10Ebernhardson: [C: 03+1] sre.wdqs.data-reload: few fixes and cleanups [cookbooks] - 10https://gerrit.wikimedia.org/r/753426 (owner: 10DCausse) [19:50:56] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::eqiad1::keystone::wsgi_server: 'keystone' [puppet] - 10https://gerrit.wikimedia.org/r/757968 (https://phabricator.wikimedia.org/T300254) (owner: 10Andrew Bogott) [19:53:13] (03PS1) 10Andrew Bogott: typo fix: apache2/keystone [puppet] - 10https://gerrit.wikimedia.org/r/757969 (https://phabricator.wikimedia.org/T300254) [19:54:19] (03CR) 10Andrew Bogott: [C: 03+2] typo fix: apache2/keystone [puppet] - 10https://gerrit.wikimedia.org/r/757969 (https://phabricator.wikimedia.org/T300254) (owner: 10Andrew Bogott) [19:56:50] 10SRE, 10Domains, 10Traffic, 10WMF-Communications, 10wikimediafoundation.org: Project Unseen campaign URL redirect - https://phabricator.wikimedia.org/T300398 (10Varnent) [19:56:59] (03PS1) 10Andrew Bogott: Further attempts at profile::openstack::eqiad1::keystone::wsgi_server: 'keystone' [puppet] - 10https://gerrit.wikimedia.org/r/757971 [19:58:14] (03CR) 10jerkins-bot: [V: 04-1] Further attempts at profile::openstack::eqiad1::keystone::wsgi_server: 'keystone' [puppet] - 10https://gerrit.wikimedia.org/r/757971 (owner: 10Andrew Bogott) [20:00:16] (03PS2) 10Andrew Bogott: Further attempts at keystone::wsgi_server: 'keystone' [puppet] - 10https://gerrit.wikimedia.org/r/757971 [20:03:22] (03CR) 10Andrew Bogott: [C: 03+2] Further attempts at keystone::wsgi_server: 'keystone' [puppet] - 10https://gerrit.wikimedia.org/r/757971 (owner: 10Andrew Bogott) [20:10:48] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [20:10:48] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [20:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:58] (03CR) 10Inductiveload: Wikisource: Increase PDF rendering resolution to 300 dpi (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757892 (https://phabricator.wikimedia.org/T256959) (owner: 10Inductiveload) [20:11:22] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 37 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:12:54] (03CR) 10Bking: [C: 03+2] deployment-prep: add cergen config for elastic service [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:12:57] (03CR) 10Bking: [V: 03+2 C: 03+2] deployment-prep: add cergen config for elastic service [labs/private] - 10https://gerrit.wikimedia.org/r/757699 (https://phabricator.wikimedia.org/T299797) (owner: 10Bking) [20:14:16] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [20:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:24] (03CR) 10RLazarus: [C: 03+1] Add Tanja Anđić to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/757941 (https://phabricator.wikimedia.org/T300383) (owner: 10JHathaway) [20:33:52] (03CR) 10JHathaway: [C: 03+2] Add Tanja Anđić to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/757941 (https://phabricator.wikimedia.org/T300383) (owner: 10JHathaway) [20:49:15] (03PS5) 10Giuseppe Lavagetto: tls_helpers: fail if a listener is non existent [deployment-charts] - 10https://gerrit.wikimedia.org/r/755527 (https://phabricator.wikimedia.org/T291959) [20:49:17] (03PS1) 10Giuseppe Lavagetto: Rakefile: rationalize task arguments treatment [deployment-charts] - 10https://gerrit.wikimedia.org/r/757976 [20:49:19] (03PS1) 10Giuseppe Lavagetto: [DRAFT] Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 [20:49:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Private Data Users for Tanja Andic - https://phabricator.wikimedia.org/T300383 (10jhathaway) @TAndic your account and kerberos credentials should be setup, please lookout for an email. Let me know if everything works! [20:49:56] (03CR) 10jerkins-bot: [V: 04-1] Rakefile: rationalize task arguments treatment [deployment-charts] - 10https://gerrit.wikimedia.org/r/757976 (owner: 10Giuseppe Lavagetto) [20:50:01] (03CR) 10jerkins-bot: [V: 04-1] [DRAFT] Refactor Rakefile [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [20:54:02] (03CR) 10AntiCompositeNumber: [C: 04-1] Wikisource: Increase PDF rendering resolution to 300 dpi (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757892 (https://phabricator.wikimedia.org/T256959) (owner: 10Inductiveload) [21:18:18] (03PS1) 10Majavah: openstack: keystone: set bind port for uwsgi process [puppet] - 10https://gerrit.wikimedia.org/r/757982 (https://phabricator.wikimedia.org/T300254) [21:18:58] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:52] (03PS2) 10Majavah: openstack: keystone: set bind port for uwsgi process [puppet] - 10https://gerrit.wikimedia.org/r/757982 (https://phabricator.wikimedia.org/T300254) [21:21:44] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33495/console" [puppet] - 10https://gerrit.wikimedia.org/r/757982 (https://phabricator.wikimedia.org/T300254) (owner: 10Majavah) [21:23:15] (03PS3) 10Majavah: openstack: keystone: set bind port for uwsgi process [puppet] - 10https://gerrit.wikimedia.org/r/757982 (https://phabricator.wikimedia.org/T300254) [21:33:58] (03CR) 10Dereckson: "Indeed for wikisource, but it could be a sane default for other Proofreadpage installations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757892 (https://phabricator.wikimedia.org/T256959) (owner: 10Inductiveload) [21:38:50] (03CR) 10Dzahn: "Yes, I did know most of this but not every single detail so I'm thankful you were so detailed in your response! (Also very useful for othe" [puppet] - 10https://gerrit.wikimedia.org/r/757509 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [21:47:21] !log purging font packages from remaining appservers in codfw mw23* ranges.. T294378 [21:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:26] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [21:49:16] PROBLEM - Disk space on centrallog1001 is CRITICAL: DISK CRITICAL - free space: /srv 34497 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=centrallog1001&var-datasource=eqiad+prometheus/ops [21:51:37] ^ commented at https://phabricator.wikimedia.org/T285371#7660747 [21:52:12] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:52:45] !log purging font packages from mwdebug* and scandium* T294378 [21:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:49] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [22:00:39] (03PS4) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [22:00:53] (03CR) 10Ryan Kemper: elastic: install elasticsearch-oss from component (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:01:15] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:02:30] (03PS5) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [22:02:42] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:04:16] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:04:20] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:04:51] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:06:42] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) @Joe All the font* and xfont* packages are gone. from mw1, mw2, wtp, parse, labweb, scandium etc For example https://debmonitor.wikimedia.org/packages/fon... [22:08:17] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) I did not see a change in MW exceptions over the last couple days either. [22:15:20] (03PS6) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [22:16:03] (03PS12) 10MarcoAurelio: mediawiki::maintenance: Run recountCategories.php monthly on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) [22:17:40] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:17:44] (03PS13) 10MarcoAurelio: mediawiki::maintenance: Run recountCategories.php monthly on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) [22:19:14] (03CR) 10MarcoAurelio: "Script ran on all Beta Cluster wikis. Output at sounds good." [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [22:20:16] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [22:20:34] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::maintenance: Run recountCategories.php monthly on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [22:22:06] (03PS14) 10MarcoAurelio: mediawiki::maintenance: Run recountCategories.php monthly on all wikis [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) [22:24:35] (03CR) 10Gergő Tisza: GrowthExperiments: Disable mobile quality gate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/757792 (https://phabricator.wikimedia.org/T298122) (owner: 10Gergő Tisza) [22:24:43] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/756069 (https://phabricator.wikimedia.org/T299823) (owner: 10MarcoAurelio) [22:31:43] (03PS7) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [22:33:38] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:34:26] (03PS8) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [22:36:21] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:43:32] (03PS9) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [22:45:40] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:47:20] 10SRE, 10ops-drmrs, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10RobH) Submitted the revised document (using numbered steps) to Interxion via ticket CS0433959. I listed @wiki_willy, @ayounsi, & @cmoone... [22:54:12] (03PS10) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [22:56:06] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [22:56:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic1049-production-search-psi-eqiad is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [22:59:31] (03PS11) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [23:00:54] (03PS12) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [23:02:47] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [23:03:10] (03PS13) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [23:05:09] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [23:08:20] (03PS14) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [23:09:44] (03PS15) 10Ryan Kemper: elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 [23:11:49] (03CR) 10jerkins-bot: [V: 04-1] elastic: install elasticsearch-oss from component [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [23:14:17] (03CR) 10Ryan Kemper: "Latest build "failure" has the tests passing so it might just be a CI issue? Will check pcc" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [23:14:23] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper)