[00:00:04] brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220114T0000). [00:02:11] (03Merged) 10jenkins-bot: In WikitextContentHandler always use getFreshParser() [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753828 (https://phabricator.wikimedia.org/T299149) (owner: 10Dduvall) [00:04:19] alright, pulling the patch in to php-1.38.0-wmf.17 [00:05:18] TimStarling: is there a way to test ^ on mwdebug or should i just sync it? [00:05:36] PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:40] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) @hashar since Monday is a Holiday, let is do this on the 18th a... [00:06:18] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: discard_held_messages.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:03] dduvall: just sync it, the logs will provide confirmation [00:07:10] ack [00:08:59] !log dduvall@deploy1002 Synchronized php-1.38.0-wmf.17/includes/content/WikitextContentHandler.php: Backport: [[gerrit:753828|In WikitextContentHandler always use getFreshParser() (T299149)]] (duration: 01m 07s) [00:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:03] T299149: MWException: Parser state cleared while parsing. Did you call Parser::parse recursively? - https://phabricator.wikimedia.org/T299149 [00:09:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:30] i don't recall seeing much before group1 promotion so i will go ahead with that as well [00:10:17] (03PS1) 10Dduvall: group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753860 [00:10:19] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753860 (owner: 10Dduvall) [00:12:08] (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753860 (owner: 10Dduvall) [00:13:57] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.17 refs T293958 [00:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:01] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [00:14:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:14:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:04] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.17 refs T293958 (duration: 01m 06s) [00:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:48] looks good. rolling to all wikis [00:19:58] (03PS1) 10Dduvall: all wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753862 [00:20:00] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753862 (owner: 10Dduvall) [00:20:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:28] (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.17 refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753862 (owner: 10Dduvall) [00:23:08] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.17 refs T293958 [00:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:12] T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958 [00:23:44] Fingers crossed. [00:24:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:24:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:27] i am seeing some db replication lag errors [00:27:21] and some slow queries on the page table [00:27:37] maybe this is temporary though [00:29:57] i think we're ok [00:30:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:32:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [00:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [00:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:04] calling that a train i guess. thanks TimStarling, twentyafterfour, others [00:50:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:55:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:10:40] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10nray) [01:20:38] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:28:46] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:30:52] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:41:46] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitor [01:41:46] base [01:51:34] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:55:04] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:09:26] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:22:36] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:37:54] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:49:38] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:52:12] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: elastic2051, labstore1007, miscweb1002, labstore1006, restbase2009 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:09:18] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:26:06] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: labstore1006, elastic2051, miscweb1002, labstore1007, restbase2009 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:52:10] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:09:46] (03PS9) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [05:12:04] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:42] (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [05:15:21] (03PS10) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [05:16:28] !log manually restarted discard_held_messages service on lists1001, failed with a spurious sqlalchemy issue about packets being out of order [05:16:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:50] (03CR) 10Andrew Bogott: [C: 03+2] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [05:19:01] (03PS11) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [05:57:46] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: miscweb1002, labstore1006, restbase2009, labstore1007, elastic2051 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [06:11:50] (03PS1) 10Marostegui: wmnet: Failover m5-master to dbproxy1021 [dns] - 10https://gerrit.wikimedia.org/r/753870 (https://phabricator.wikimedia.org/T298586) [06:14:10] (03PS1) 10Marostegui: Revert "dbproxy1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753835 [06:14:52] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753835 (owner: 10Marostegui) [06:15:47] !log Failover m5 proxy from dbproxy1017 to dbproxy1021 T298586 [06:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:50] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:15:51] T298586: Upgrade all dbproxy hosts to Bullseye - https://phabricator.wikimedia.org/T298586 [06:15:54] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master to dbproxy1021 [dns] - 10https://gerrit.wikimedia.org/r/753870 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui) [06:35:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove logpager group from s3 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18735 and previous config saved to /var/cache/conftool/dbconfig/20220114-063554-marostegui.json [06:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:59] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [06:37:04] 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10KartikMistry) >>! In T299023#7620868, @Dzahn wrote: > Hi @KartikMistry re: the question how to get the key to us: you can make a new file in your ho... [07:00:40] (03PS1) 10Marostegui: pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753874 (https://phabricator.wikimedia.org/T299046) [07:02:48] (03CR) 10Marostegui: [C: 03+2] pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753874 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [07:05:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2012.codfw.wmnet with OS bullseye [07:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:09] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:21:13] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:33:11] (03PS2) 10Gehel: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [07:33:49] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [07:36:30] (03PS3) 10Gehel: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [07:37:05] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [07:37:08] (03PS3) 10KartikMistry: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584) [07:37:59] (03Abandoned) 10Gehel: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753851 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [07:39:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2012.codfw.wmnet with OS bullseye [07:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:44] (03PS2) 10Giuseppe Lavagetto: shellbox: remove useless files/stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/753062 [07:41:07] (03CR) 10jerkins-bot: [V: 04-1] shellbox: remove useless files/stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/753062 (owner: 10Giuseppe Lavagetto) [07:41:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: make the default grace period 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/752600 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [07:41:48] (03PS4) 10Gehel: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [07:44:13] (03PS1) 10Marostegui: Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753836 [07:45:59] (03CR) 10Ideophagous: arywiki NS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [07:48:04] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:48:22] (03CR) 10Gehel: [C: 04-1] "See minor dependency issue inline" [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [07:53:49] (03CR) 10Gehel: [C: 04-1] "Re-reading that patch (and looking at PCC failures): profile::java is already required from profile::elasticsearch (which makes sense). So" [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [07:55:14] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:00:02] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220114T0800) [08:00:42] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:03:06] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) is WARNING: Test Get summary for test page responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service [08:09:49] (03CR) 10Ryan Kemper: [C: 03+1] icinga: add multiple case for Gehel in Icinga authorization [puppet] - 10https://gerrit.wikimedia.org/r/752130 (owner: 10Gehel) [08:10:21] (03PS2) 10Gehel: icinga: add multiple case for Gehel in Icinga authorization [puppet] - 10https://gerrit.wikimedia.org/r/752130 [08:11:37] (03CR) 10Gehel: [C: 03+2] icinga: add multiple case for Gehel in Icinga authorization [puppet] - 10https://gerrit.wikimedia.org/r/752130 (owner: 10Gehel) [08:13:08] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitor [08:13:08] base [08:13:56] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:19:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10SCherukuwada) Manager approves. [08:21:51] (03CR) 10Marostegui: [C: 03+2] Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753836 (owner: 10Marostegui) [08:25:32] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [08:27:51] 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10MoritzMuehlenhoff) >>! In T299107#7620550, @Platonides wrote: > @MoritzMuehlenhoff, did you see https://www.spinics.net/lists/stable/msg509296.html ? > Apparently upstream i... [08:32:17] (03PS1) 10Marostegui: pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753912 (https://phabricator.wikimedia.org/T299046) [08:32:54] (03CR) 10Marostegui: [C: 03+2] pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753912 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [08:33:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2013.codfw.wmnet with OS bullseye [08:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM cuminunpriv1001.eqiad.wmnet [08:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cuminunpriv1001.eqiad.wmnet [08:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:39] (03CR) 10David Caro: wmcs: move grid-dedicated code to its own package (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez) [08:48:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,trafficserver-upload,varnish-upload} site={drmrs,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:50:50] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:50:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-tool1005.eqiad.wmnet [08:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:08] PROBLEM - Check systemd state on cp6002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:17] !log rebooting an-tool1007 (running turnilo.wikimedia.org) [08:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:26] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:53:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-tool1005.eqiad.wmnet [08:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:22] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:55:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-tool1007.eqiad.wmnet [08:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-tool1007.eqiad.wmnet [08:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-tool1008.eqiad.wmnet [08:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:51] !log rebooting an-tool1008 (running yarn.wikimedia.org) [08:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:28] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-tool1008.eqiad.wmnet [09:00:44] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:57] !log systemctl reset-failed ifup@ens5.service on an-tool1005 T273026 [09:00:59] (03PS5) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:00] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [09:01:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-tool1009.eqiad.wmnet [09:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:26] !log rebooting an-tool1009 (running hue.wikimedia.org) [09:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-tool1009.eqiad.wmnet [09:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:06] PROBLEM - Host cp6002 is DOWN: PING CRITICAL - Packet loss = 100% [09:05:06] RECOVERY - Host cp6002 is UP: PING OK - Packet loss = 0%, RTA = 86.10 ms [09:05:15] (03PS6) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:05:31] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitor [09:05:31] base [09:05:41] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [09:05:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2013.codfw.wmnet with OS bullseye [09:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:58] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:06:00] RECOVERY - Check systemd state on cp6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:30] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:06:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:09:10] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:09:17] (03PS1) 10Marostegui: Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753837 [09:09:26] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:36] (03PS7) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:11:17] !log Move pc1014 from pc1 to pc2 T299046 [09:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:21] T299046: Upgrade parsercache infra to Bullseye - https://phabricator.wikimedia.org/T299046 [09:11:35] (03CR) 10jerkins-bot: [V: 04-1] kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [09:19:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-test-client1001.eqiad.wmnet [09:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:10] (03CR) 10Marostegui: [C: 03+2] Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753837 (owner: 10Marostegui) [09:21:46] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10mfossati) Thanks for your comments @Dzahn , very useful! > re: grafana.wikimedia.org - this should not actually need a login but when you click on "sign in" in the lower left corner, you sh... [09:21:58] (03PS1) 10DCausse: rdf-streaming-updater: add support for WCQS [alerts] - 10https://gerrit.wikimedia.org/r/753915 [09:22:09] PROBLEM - Check systemd state on cp6010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-test-client1001.eqiad.wmnet [09:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:43] PROBLEM - Host cp6010 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:13] (03PS8) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:28:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install1003.wikimedia.org [09:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:56] RECOVERY - Check systemd state on cp6010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:59] RECOVERY - Host cp6010 is UP: PING OK - Packet loss = 0%, RTA = 86.14 ms [09:32:21] (03PS9) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:32:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install1003.wikimedia.org [09:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:09] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [09:35:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM apt1001.wikimedia.org [09:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:11] (03PS10) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:38:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM apt1001.wikimedia.org [09:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:30] (03PS11) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:41:32] (03PS1) 10Vgutierrez: envoyproxy: Add stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753919 (https://phabricator.wikimedia.org/T271421) [09:41:38] (03PS12) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:42:56] (03PS13) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:45:09] (03PS14) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 [09:45:25] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase [09:45:51] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33245/console" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [09:46:46] (03CR) 10Elukey: [V: 03+1] kafka: add check to test the Broker's TLS port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey) [09:47:53] (03CR) 10ZPapierski: [C: 03+1] rdf-streaming-updater: add support for WCQS [alerts] - 10https://gerrit.wikimedia.org/r/753915 (owner: 10DCausse) [09:49:59] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitor [09:49:59] base [09:53:11] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921 [09:54:05] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33246/console" [puppet] - 10https://gerrit.wikimedia.org/r/753919 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [09:55:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-test-druid1001.eqiad.wmnet [09:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:24] PROBLEM - Host cp6003 is DOWN: PING CRITICAL - Packet loss = 100% [09:58:50] RECOVERY - Host cp6003 is UP: PING OK - Packet loss = 0%, RTA = 86.15 ms [09:59:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-test-druid1001.eqiad.wmnet [09:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:02] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [10:04:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM matomo1002.eqiad.wmnet [10:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:13] !log rebooting matomo1002 (running piwik.wikimedia.org) [10:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:07:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM matomo1002.eqiad.wmnet [10:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:42] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:14:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Will talk later to Volans." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez) [10:15:58] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:17:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-test-presto1001.eqiad.wmnet [10:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:49] (03Merged) 10jenkins-bot: wmcs: move grid-dedicated code to its own package [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez) [10:19:00] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson) [10:21:44] PROBLEM - Check systemd state on cp6011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-test-presto1001.eqiad.wmnet [10:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:18] (03PS3) 10Muehlenhoff: Make build2001 a build host [puppet] - 10https://gerrit.wikimedia.org/r/751146 [10:29:48] PROBLEM - Host cp6011 is DOWN: PING CRITICAL - Packet loss = 100% [10:30:38] RECOVERY - Host cp6011 is UP: PING OK - Packet loss = 0%, RTA = 86.09 ms [10:30:48] RECOVERY - Check systemd state on cp6011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:56] PROBLEM - purged service on cp6011 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:33:26] RECOVERY - purged service on cp6011 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:38:43] (03PS1) 10Jbond: heiradata - cloud: update email address [puppet] - 10https://gerrit.wikimedia.org/r/753925 [10:40:12] (03CR) 10Jbond: [C: 03+2] heiradata - cloud: update email address [puppet] - 10https://gerrit.wikimedia.org/r/753925 (owner: 10Jbond) [10:42:30] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-test-ui1001.eqiad.wmnet [10:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:35] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10cmooney) 05In progress→03Resolved Thanks @mfossati for the feedback, and indeed @Dzahn for the detailed info, appreciate it. [10:43:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/753919 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [10:43:54] PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:44:26] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase [10:47:27] (03PS2) 10DCausse: wcqs: Deploy streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson) [10:50:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-test-ui1001.eqiad.wmnet [10:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:52] !log systemctl reset-failed ifup@ens5.service on an-test-ui1001 T273026 [10:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:55] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [10:51:37] (03CR) 10DCausse: [C: 03+1] cirrussearch: Reenable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/752724 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson) [10:51:55] (03PS11) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600) [10:52:04] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [10:53:27] 10SRE, 10SRE-Access-Requests: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10cmooney) @MNovotny_WMF apologies for the delay processing this. Checking your existing access I believe you should already be able to log in to Superset, is that correct? There are... [10:54:43] PROBLEM - Host cp6004 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:03] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2051.codfw.wmnet with OS stretch [10:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:11] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors: - elastic2051 (*... [10:55:39] RECOVERY - Host cp6004 is UP: PING OK - Packet loss = 0%, RTA = 86.22 ms [10:55:40] (03PS1) 10Jbond: ihieradata - bgpalerter: update email group [puppet] - 10https://gerrit.wikimedia.org/r/753927 [10:55:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] ihieradata - bgpalerter: update email group [puppet] - 10https://gerrit.wikimedia.org/r/753927 (owner: 10Jbond) [10:56:02] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM archiva1002.wikimedia.org [10:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:17] !log rebooting archiva1002 (running archiva.wikimedia.org) [10:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:44] (03CR) 10DCausse: [C: 03+1] wcqs: Deploy streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson) [10:57:38] (03CR) 10Kormat: [C: 03+1] "Good call re: linux-swap. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/753781 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [10:59:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10cmooney) Have asked user to send me SSH key out of band to verify. [11:00:39] !log systemctl reset-failed ifup@ens5.service on archiva1002 T273026 [11:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:43] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [11:01:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM archiva1002.wikimedia.org [11:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:54] (03CR) 10Giuseppe Lavagetto: [V: 03+1] envoy: make the choice of api version explicit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751717 (owner: 10Giuseppe Lavagetto) [11:04:03] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase [11:06:49] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751717 (owner: 10Giuseppe Lavagetto) [11:07:03] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] envoyproxy: Add stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753919 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:14:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Cleanup: remove the extract method, now unused. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699716 (owner: 10Giuseppe Lavagetto) [11:15:42] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff) [11:16:45] (03Merged) 10jenkins-bot: Cleanup: remove the extract method, now unused. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699716 (owner: 10Giuseppe Lavagetto) [11:16:57] RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:18:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1023.eqiad.wmnet with OS buster [11:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:47] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS buster [11:21:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Image module refactoring (step 1) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699717 (owner: 10Giuseppe Lavagetto) [11:23:49] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:27:08] (03PS1) 10Kormat: wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 [11:27:27] 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10cmooney) 05Open→03Resolved Sandra confirmed access working over Slack. Please re-open if there are any problems, I will resolve this now. Thanks. [11:27:43] (03CR) 10Hnowlan: [C: 03+2] partman: remove reuse-test from restbase2009, use linux-swap [puppet] - 10https://gerrit.wikimedia.org/r/753781 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [11:29:11] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:29:21] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase [11:32:29] (03PS2) 10Kormat: wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 [11:32:33] PROBLEM - Check systemd state on cp6012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:04] (03Merged) 10jenkins-bot: Image module refactoring (step 1) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699717 (owner: 10Giuseppe Lavagetto) [11:35:51] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:03] PROBLEM - Host cp6012 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:31] RECOVERY - Check systemd state on cp6012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:33] RECOVERY - Host cp6012 is UP: PING OK - Packet loss = 0%, RTA = 86.16 ms [11:38:13] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:40:31] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:42:07] (03PS3) 10Giuseppe Lavagetto: Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718 [11:42:41] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitor [11:42:41] base [11:45:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1023.eqiad.wmnet with OS buster [11:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:25] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS buster completed: - ganeti1023 (**PASS**) - Downtimed on Ici... [11:45:32] looking at restbase issues, seeing timeouts to some services (mathoid and parsoid so far) [11:46:25] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase [11:47:25] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: improve get_node_info() error reporting [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753935 [11:47:27] (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to query grid node information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753936 [11:48:05] (03PS1) 10Vgutierrez: cache::envoy: Set stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753937 (https://phabricator.wikimedia.org/T271421) [11:48:57] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing [11:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:59] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing [11:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:01] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921 [11:51:37] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster [11:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:03] (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921 [11:54:16] (03PS2) 10Vgutierrez: cache::envoy: Set stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753937 (https://phabricator.wikimedia.org/T271421) [11:54:48] (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921 [11:55:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33250/console" [puppet] - 10https://gerrit.wikimedia.org/r/753937 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [11:55:48] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6005 is CRITICAL: connect to address 10.136.0.8 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:56:36] mmandere: ^^ please make sure that cp6 hosts are properly downtimed on icinga :) [11:57:42] vgutierrez: got it [11:58:10] RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6005 is OK: HTTP OK: HTTP/1.0 200 OK - 25331 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:02:39] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: improve get_node_info() error reporting [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753935 [12:02:41] (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to query grid node information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753936 [12:02:43] (03PS5) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921 [12:04:28] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:09:07] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Set stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753937 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:10:38] PROBLEM - Host cp6005 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:24] RECOVERY - Host cp6005 is UP: PING OK - Packet loss = 0%, RTA = 86.09 ms [12:16:53] (03PS6) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921 [12:18:53] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster [12:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:08] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster [12:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:04] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitor [12:21:05] base [12:22:12] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase [12:22:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1024.eqiad.wmnet with OS buster [12:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:01] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS buster [12:25:52] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6013 is CRITICAL: connect to address 10.136.0.12 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:25:52] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp6013 is CRITICAL: connect to address 10.136.0.12 and port 3120: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [12:25:52] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [12:28:08] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6013 is OK: HTTP OK: HTTP/1.1 200 Ok - 33593 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:28:08] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6013 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 189111 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [12:29:08] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp6013 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Varnish [12:34:07] (03PS1) 10Hnowlan: restbase: remove restbase2009 [puppet] - 10https://gerrit.wikimedia.org/r/753942 (https://phabricator.wikimedia.org/T295375) [12:37:50] (03CR) 10David Caro: [C: 03+1] wmcs: toolforge: grid: improve get_node_info() error reporting (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753935 (owner: 10Arturo Borrero Gonzalez) [12:40:58] PROBLEM - Host cp6013 is DOWN: PING CRITICAL - Packet loss = 100% [12:41:32] RECOVERY - Host cp6013 is UP: PING OK - Packet loss = 0%, RTA = 86.12 ms [12:41:32] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:43:10] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:44:26] RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:49:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1024.eqiad.wmnet with OS buster [12:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:19] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS buster completed: - ganeti1024 (**PASS**) - Downtimed on Ici... [12:51:12] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster [12:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:03] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster [12:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:42] (03PS1) 10Marostegui: pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753943 (https://phabricator.wikimedia.org/T299046) [12:59:29] (03CR) 10Marostegui: [C: 03+2] pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753943 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui) [12:59:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2011.codfw.wmnet with OS bullseye [12:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:06] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:06:18] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:06:20] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:06:42] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:06:48] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:06:56] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:07:00] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:07:58] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:07:58] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:10:09] (03CR) 10Muehlenhoff: [C: 03+2] Make build2001 a build host [puppet] - 10https://gerrit.wikimedia.org/r/751146 (owner: 10Muehlenhoff) [13:18:08] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:20:09] (03PS3) 10Gehel: wcqs: Deploy streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson) [13:20:15] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster [13:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:04] (03CR) 10Gehel: [C: 03+2] wcqs: Deploy streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson) [13:25:20] PROBLEM - package builder rsync on build2001 is CRITICAL: connect to address 10.192.32.77 and port 873: Connection refused https://wikitech.wikimedia.org/wiki/Debian_Packaging%23Upload_to_Wikimedia_Repo [13:26:25] RECOVERY - package builder rsync on build2001 is OK: TCP OK - 0.033 second response time on 10.192.32.77 port 873 https://wikitech.wikimedia.org/wiki/Debian_Packaging%23Upload_to_Wikimedia_Repo [13:31:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2011.codfw.wmnet with OS bullseye [13:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:45] (03PS1) 10Gehel: query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945 [13:39:27] (03CR) 10jerkins-bot: [V: 04-1] query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945 (owner: 10Gehel) [13:40:42] <_joe_> sigh the pages acknowldgement expired [13:40:48] I re-acked [13:40:53] ack [13:40:57] (03PS2) 10Gehel: query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945 [13:40:58] And I'll re-ask, can we resolve these? [13:41:06] (03PS1) 10BBlack: varnish: Remove outdated cluster scale conditional [puppet] - 10https://gerrit.wikimedia.org/r/753966 [13:41:20] bblack: ^ [13:41:51] I guess I can try, worst case they'll fire again ;) [13:42:14] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:42:35] can they not be fully-resolved? [13:43:00] I just did [13:43:01] the icinga alert that triggered the page is now downtimed for a month, and we don't expect it to succeed/recover before then [13:43:13] All good then [13:43:59] (03PS3) 10Gehel: query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945 [13:50:59] (03PS3) 10Kormat: wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 [13:51:01] (03PS1) 10Kormat: wmfdb/mycnf: Set unix_socket OR port, not both. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 [13:53:02] (03CR) 10DCausse: [C: 03+1] query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945 (owner: 10Gehel) [13:53:23] (03CR) 10Gehel: [C: 03+2] query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945 (owner: 10Gehel) [14:16:16] 10SRE-swift-storage, 10Commons, 10Internet-Archive: Error: 503, Backend fetch failed, while the file uploaded fine - https://phabricator.wikimedia.org/T299220 (10Aklapper) [14:16:19] (03PS1) 10Muehlenhoff: Actually switch build2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/753968 [14:18:36] (03CR) 10Muehlenhoff: [C: 03+2] Actually switch build2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/753968 (owner: 10Muehlenhoff) [14:36:54] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) >>! In T294120#7618187, @Platonides wrote: > Wouldn't setting kvm:machine_version=pc-i440fx-2.8 as a [global parameter](https://docs.ganeti.org/d... [14:37:33] 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) [14:43:47] (03PS1) 104nn1l2: fawiki: Add flow-delete right to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753969 (https://phabricator.wikimedia.org/T299223) [14:44:50] 10SRE, 10Discovery: Ban elastic2035 from prod elastic clusters - https://phabricator.wikimedia.org/T299151 (10bking) 05Resolved→03In progress [14:53:51] (03PS1) 10Muehlenhoff: Add component/cassandradev for stretch and buster [puppet] - 10https://gerrit.wikimedia.org/r/753971 (https://phabricator.wikimedia.org/T298805) [14:55:43] (03CR) 10Klausman: [C: 03+1] "One very minor nit, other than that LGTM" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat) [14:57:43] (03CR) 10Klausman: [C: 03+1] "One clarification question, otherwise LGTM." [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat) [14:58:28] (03PS12) 10David Caro: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [14:59:33] (03CR) 10Muehlenhoff: [C: 03+2] Add component/cassandradev for stretch and buster [puppet] - 10https://gerrit.wikimedia.org/r/753971 (https://phabricator.wikimedia.org/T298805) (owner: 10Muehlenhoff) [15:00:17] !log silenced site=drmrs in alertmanager, I think [15:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:27] !log silenced site=drmrs in alertmanager for one month, I think [15:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:52] 10SRE, 10Data-Engineering, 10Generated Data Platform, 10Platform Engineering, 10Patch-For-Review: Import Debian package of Cassandra 3.11.11 as 'dev' version - https://phabricator.wikimedia.org/T298805 (10MoritzMuehlenhoff) I added component/cassandradev for buster and stretch. For the import we can eith... [15:03:44] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:06:28] (03CR) 10Kormat: wmfdb/mycnf: Set unix_socket OR port, not both. (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat) [15:12:21] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718 (owner: 10Giuseppe Lavagetto) [15:14:02] (03Merged) 10jenkins-bot: Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718 (owner: 10Giuseppe Lavagetto) [15:14:18] (03CR) 10Klausman: [C: 03+1] wmfdb/mycnf: Set unix_socket OR port, not both. (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat) [15:17:38] (03CR) 10Kormat: wmfdb/mycnf: Add CnfSelector (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat) [15:18:41] (03PS1) 10DCausse: wcqs: set QUERY_SERVICE env name with wcqs/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/753973 [15:23:19] we're shortly going to puppetize lvs6001 in drmrs [15:23:34] (03CR) 10ZPapierski: [C: 03+1] wcqs: set QUERY_SERVICE env name with wcqs/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/753973 (owner: 10DCausse) [15:23:46] this will likely define/cause some kind of false alarms related to drmrs and/or lvs6001 pybal, etc until we find them and get them downtimed/silenced whatever [15:28:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2051-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [15:28:08] (03PS1) 10Hnowlan: partman: don't format swap volume [puppet] - 10https://gerrit.wikimedia.org/r/753975 (https://phabricator.wikimedia.org/T295375) [15:29:09] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: dc=drmrs [15:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:22] (03PS3) 10Giuseppe Lavagetto: Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 [15:33:01] (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2051-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [15:33:02] RECOVERY - DPKG on elastic2051 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:33:49] (03CR) 10Klausman: [C: 03+1] wmfdb/mycnf: Add CnfSelector (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat) [15:35:45] (03CR) 10David Caro: "There's a couple things, but mostly nits" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [15:39:09] !log lvs6001 + all services downtimed [15:39:09] (03CR) 10Kormat: [C: 03+1] partman: don't format swap volume [puppet] - 10https://gerrit.wikimedia.org/r/753975 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [15:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:39] (03CR) 10Hnowlan: [C: 03+2] partman: don't format swap volume [puppet] - 10https://gerrit.wikimedia.org/r/753975 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [15:40:39] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster [15:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pybal site=drmrs https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:44:18] (03CR) 10Kormat: [C: 03+2] wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat) [15:44:24] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:44:25] (03CR) 10Kormat: [C: 03+2] wmfdb/mycnf: Set unix_socket OR port, not both. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat) [15:46:14] (03Merged) 10jenkins-bot: wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat) [15:46:16] (03Merged) 10jenkins-bot: wmfdb/mycnf: Set unix_socket OR port, not both. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat) [15:50:16] (03CR) 10Ema: "According to Chris' comment in 997e257d the whole block should have been deleted in October 2020, so I'd propose we just do that instead." [puppet] - 10https://gerrit.wikimedia.org/r/753966 (owner: 10BBlack) [16:04:06] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster [16:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:44] (03CR) 10BBlack: varnish: Remove outdated cluster scale conditional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753966 (owner: 10BBlack) [16:06:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:07:35] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10Tks4Fish) @faidon are there any updates on this? We've been discussing tooling to help with steward workflow, but they are highly dependent on IPv6. If that is already a problem locally, globally... [16:10:58] jouncebot now [16:10:58] For the next 15 hour(s) and 49 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220114T0800) [16:15:45] !log dancy@deploy1002 Synchronized README: Testing php-fpm restart (duration: 03m 18s) [16:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:26] (03PS1) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753983 (https://phabricator.wikimedia.org/T299177) [16:21:02] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753983 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [16:21:32] PROBLEM - Check systemd state on cp6007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:46] PROBLEM - Check systemd state on cp6008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:21:52] PROBLEM - Check systemd state on cp6015 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:22] (03PS1) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753985 (https://phabricator.wikimedia.org/T299177) [16:24:22] PROBLEM - Check systemd state on cp6014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:57] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753985 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [16:25:32] (03PS1) 10Hnowlan: partman: use reuse profiles on all restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/753986 (https://phabricator.wikimedia.org/T295375) [16:27:13] (03PS1) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) [16:28:00] PROBLEM - traffic-pool service on cp6016 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:28:18] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [16:29:34] all these cp6 alerts are ignorable (anything with site code 6 in the hostname is!) [16:30:21] !log rebooting cp60xx where x is 6, 7, 8, 14, 15, 16 (downtimed) [16:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:48] RECOVERY - Check systemd state on cp6014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:59] (03CR) 10Muehlenhoff: elasticsearch: fix package dependency issue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [16:34:16] RECOVERY - Check systemd state on cp6007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:44] RECOVERY - Check systemd state on cp6008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:54] RECOVERY - Check systemd state on cp6015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:32] RECOVERY - traffic-pool service on cp6016 is OK: OK - traffic-pool is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:39:41] 10SRE, 10SRE-Access-Requests: Add bking as icinga user - https://phabricator.wikimedia.org/T298738 (10bking) Confirmed working, sorry for the delay. Feel free to close. [16:41:16] 10SRE, 10Data-Engineering, 10Generated Data Platform, 10Platform Engineering: Import Debian package of Cassandra 3.11.11 as 'dev' version - https://phabricator.wikimedia.org/T298805 (10Eevans) >>! In T298805#7622471, @MoritzMuehlenhoff wrote: > I added component/cassandradev for buster and stretch. For the... [16:42:35] 10SRE, 10SRE-Access-Requests: Add bking as icinga user - https://phabricator.wikimedia.org/T298738 (10cmooney) 05In progress→03Resolved [16:51:42] RECOVERY - IPMI Sensor Status on wdqs2003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:55:03] !log reboot lvs6001 [16:55:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 (owner: 10Giuseppe Lavagetto) [17:00:18] (03Merged) 10jenkins-bot: Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 (owner: 10Giuseppe Lavagetto) [17:04:14] (03PS1) 10Elukey: knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 [17:04:37] (03CR) 10jerkins-bot: [V: 04-1] knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 (owner: 10Elukey) [17:05:54] (03CR) 10Elukey: "These values have been set manually via kubectl, there was heavy cpu throttling with the default values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 (owner: 10Elukey) [17:08:53] (03PS2) 10Elukey: knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 [17:09:50] (03PS3) 10Elukey: knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 (https://phabricator.wikimedia.org/T296173) [17:10:22] (03PS1) 10JMeybohm: Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 [17:11:00] (03CR) 10jerkins-bot: [V: 04-1] Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 (owner: 10JMeybohm) [17:15:59] (03PS2) 10JMeybohm: Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 (https://phabricator.wikimedia.org/T228967) [17:16:43] (03CR) 10jerkins-bot: [V: 04-1] Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm) [17:17:12] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10cmooney) @Tks4Fish I don't think there is any reason to worry in terms of availability of IPv6 address space. Is there a specific proposal on the table requiring additional IPv6 address space for... [17:19:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI Power Supply Failure (PS2) for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 (10Papaul) 05Open→03Resolved PS2 replaced [17:20:05] (03CR) 10Andrew Bogott: [C: 03+2] Added nfs/migrate_service.py (0324 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [17:20:40] (03PS13) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [17:20:56] (03PS14) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [17:24:20] (03PS3) 10JMeybohm: Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 (https://phabricator.wikimedia.org/T228967) [17:24:44] (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [17:26:19] (03CR) 10Accraze: [C: 03+1] knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 (https://phabricator.wikimedia.org/T296173) (owner: 10Elukey) [17:26:25] !log reboot lvs600[23] [17:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:13] (03PS1) 10JMeybohm: Migrate kube-scheduler away from insecure API [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) [17:41:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:42:43] (03PS2) 10JMeybohm: Migrate kube-scheduler away from insecure API [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) [17:43:34] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33254/console" [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [17:44:03] !log drmrs asw: removed native-vlan-id from config on secondary (x-rack) interfaces of lvses to debug network issue [17:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:37] PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:47:22] (03PS2) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) [17:49:21] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [17:50:58] (03PS15) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) [17:51:17] (03PS3) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) [17:51:22] (03PS5) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [17:57:09] PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:57:09] PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The following units failed: smartd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:19] PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:57:35] (03PS1) 10JMeybohm: controllermanager_token is defined in common [labs/private] - 10https://gerrit.wikimedia.org/r/754005 [17:57:38] (03PS1) 10JMeybohm: Add profile::kubernetes::master::scheduler_token to staging [labs/private] - 10https://gerrit.wikimedia.org/r/754006 (https://phabricator.wikimedia.org/T290967) [17:57:41] PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:57:47] PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:57:53] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] controllermanager_token is defined in common [labs/private] - 10https://gerrit.wikimedia.org/r/754005 (owner: 10JMeybohm) [17:57:58] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add profile::kubernetes::master::scheduler_token to staging [labs/private] - 10https://gerrit.wikimedia.org/r/754006 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [17:58:07] PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:58:09] PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:58:41] PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:58:43] PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [17:58:45] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [17:59:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33255/console" [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [18:06:16] (03CR) 10JMeybohm: [V: 03+1] "The first PCC it without (noop) and the second with scheduler_token in labs/private" [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm) [18:06:18] (03PS1) 10Herron: profile::apifeatureusage::logstash: add placeholder secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/754007 (https://phabricator.wikimedia.org/T297239) [18:06:49] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:09:18] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 15 days, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing [18:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:19] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing [18:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:47] (03CR) 10Herron: [V: 03+2 C: 03+2] profile::apifeatureusage::logstash: add placeholder secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/754007 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [18:16:01] (03PS2) 10Herron: logstash: add optional document_type parameter to es output config [puppet] - 10https://gerrit.wikimedia.org/r/747634 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [18:16:12] (03PS6) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [18:22:46] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10Tks4Fish) @cmooney Sorry, I think I ended up asking in the wrong place. My question comes from T37947, and after looking at the comments there, I got to this task, saw it as stalled and concluded... [18:23:54] (03PS3) 10Herron: logstash: add optional document_type parameter to es output config [puppet] - 10https://gerrit.wikimedia.org/r/747634 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite) [18:28:02] (03PS7) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) [18:31:35] (03CR) 10Ebernhardson: elasticsearch: fix package dependency issue (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [18:40:17] (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [18:41:33] (03PS1) 10Jcrespo: mediabackups: Backup s2 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754013 (https://phabricator.wikimedia.org/T262668) [18:41:37] PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:46:53] RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:49:45] (03PS4) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [18:50:15] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [18:54:18] (03PS5) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [18:54:55] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [18:55:56] (03PS1) 10Majavah: P:cyberbot::exec: support newer debian versions [puppet] - 10https://gerrit.wikimedia.org/r/754016 [18:57:52] (03CR) 10Cyberpower678: [C: 03+1] P:cyberbot::exec: support newer debian versions [puppet] - 10https://gerrit.wikimedia.org/r/754016 (owner: 10Majavah) [19:00:23] (03PS6) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [19:01:37] (03PS7) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [19:02:58] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [19:07:05] (03PS8) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [19:07:13] 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) Planning to move the apifeatureusage pipeline over to the new hosts next week with these switchover steps: * Add profile::apifeatureusage::... [19:08:04] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [19:08:27] (03CR) 10Herron: "Planning to merge this next week as part of full plan outlined in https://phabricator.wikimedia.org/T297239#7622955" [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron) [19:09:29] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup s2 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754013 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [19:10:23] RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 155, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:24:00] (03PS1) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) [19:24:45] (03CR) 10jerkins-bot: [V: 04-1] OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:24:54] (03PS2) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) [19:25:44] (03CR) 10jerkins-bot: [V: 04-1] OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:27:05] (03CR) 10Majavah: [C: 04-1] OTRS: rename role class to VRTS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:28:38] (03PS3) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) [19:30:33] (03PS4) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) [19:33:16] (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2051-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://alerts.wikimedia.org [19:37:22] (03PS5) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) [19:39:57] (03CR) 10Dzahn: OTRS: rename role class to VRTS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [19:41:12] (03PS6) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) [19:45:00] (03PS1) 10Jcrespo: mediabackups: Backup s3 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754022 (https://phabricator.wikimedia.org/T262668) [19:45:11] 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth) [19:46:27] 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Jclark-ctr) moved msw2 back to old cage it is connected to port 41. msw2 is back online looks like we might have a missing link with opengear future-scs-f8-eqiad will need some assistance with this @a... [19:47:45] (03CR) 10Nikki Nikkhoui: [C: 03+1] image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/742271 (owner: 10PipelineBot) [19:48:21] (03CR) 10Nikki Nikkhoui: [C: 04-1] "close in favor of newer patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/734438 (owner: 10PipelineBot) [19:48:27] (03CR) 10Nikki Nikkhoui: [C: 04-1] "close in favor of newer patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736565 (owner: 10PipelineBot) [19:49:33] (03CR) 10Nikki Nikkhoui: [C: 04-1] "close in favor of newer patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739618 (owner: 10PipelineBot) [19:50:10] (03PS1) 10Jcrespo: mediabackups: Backup s5 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754023 (https://phabricator.wikimedia.org/T262668) [19:50:14] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Jdforrester-WMF) >>! In T205361#7574347, @Legoktm wrote: > OK, I copied ove... [19:52:40] (03CR) 10Dzahn: [C: 03+2] P:cyberbot::exec: support newer debian versions [puppet] - 10https://gerrit.wikimedia.org/r/754016 (owner: 10Majavah) [19:54:43] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Legoktm) No, the Apache config change is non trivial (disable puppet everyw... [19:58:37] (03PS2) 10Jcrespo: mediabackups: Backup s5 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754023 (https://phabricator.wikimedia.org/T262668) [19:58:39] (03PS1) 10Jcrespo: mediabackups: Backup s6 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754024 (https://phabricator.wikimedia.org/T262668) [19:58:41] (03PS1) 10Jcrespo: mediabackups: Backup s7 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754025 (https://phabricator.wikimedia.org/T262668) [19:58:43] (03PS1) 10Jcrespo: mediabackups: Backup s8 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754026 (https://phabricator.wikimedia.org/T262668) [20:01:19] 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10cmooney) @Tks4Fish no problem at all! And certainly no need to apologize. This task more relates to allocating blocks of IPv6 for Toolforge/Cloud. As per the above discussion there are some sma... [20:05:00] (03CR) 10Jforrester: [C: 03+1] MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy) [20:07:24] (03PS1) 10Herron: remove references to centrallog2001 [homer/public] - 10https://gerrit.wikimedia.org/r/754028 (https://phabricator.wikimedia.org/T298994) [20:08:16] 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10Dzahn) Of course using GPG is fine as well. I just did not suggest it because usually people consider it cumbersome and once we add the credentials... [20:09:23] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:09:43] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup s3 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754022 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo) [20:11:50] (03CR) 10Dzahn: [C: 04-2] "there also needs to be renaming in the private repo or this will fail. we have scheduled a time to do this together, so disable puppet, me" [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [20:14:48] (03PS1) 10Herron: remove references to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) [20:15:09] (03PS9) 10Ebernhardson: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [20:17:46] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [20:19:38] (03PS10) 10Ebernhardson: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [20:27:33] (03CR) 10Ebernhardson: elasticsearch: fix package dependency issue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking) [20:42:03] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:47:51] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:49:13] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:56:41] (03PS1) 10Ebernhardson: cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754034 [21:00:11] (03PS2) 10Ebernhardson: cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754034 (https://phabricator.wikimedia.org/T299177) [21:00:33] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/754034 (https://phabricator.wikimedia.org/T299177) (owner: 10Ebernhardson) [21:05:37] (03CR) 10Bking: [C: 03+2] cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754034 (https://phabricator.wikimedia.org/T299177) (owner: 10Ebernhardson) [21:07:33] (03PS1) 10Cwhite: hiera: allow logstash1032 through kafka-jumbo firewall [puppet] - 10https://gerrit.wikimedia.org/r/754035 (https://phabricator.wikimedia.org/T288621) [21:09:48] (03CR) 10Herron: [C: 03+1] hiera: allow logstash1032 through kafka-jumbo firewall [puppet] - 10https://gerrit.wikimedia.org/r/754035 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [21:12:30] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron) [21:14:00] (03CR) 10Cwhite: [C: 03+2] hiera: allow logstash1032 through kafka-jumbo firewall [puppet] - 10https://gerrit.wikimedia.org/r/754035 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite) [21:21:53] (03PS1) 10Ryan Kemper: Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953 [21:23:14] (03CR) 10Bking: [C: 03+2] Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953 (owner: 10Ryan Kemper) [21:23:33] (03CR) 10jerkins-bot: [V: 04-1] Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953 (owner: 10Ryan Kemper) [21:24:11] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01193 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:25:21] (03PS2) 10Ryan Kemper: Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953 [21:25:34] (03PS3) 10Ryan Kemper: Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953 (https://phabricator.wikimedia.org/T299177) [21:39:55] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002169 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [21:40:03] (03PS1) 10Ryan Kemper: cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177) [21:41:00] (03PS2) 10Ryan Kemper: cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177) [21:41:18] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177) (owner: 10Ryan Kemper) [21:44:53] (03CR) 10Bking: [C: 03+1] cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177) (owner: 10Ryan Kemper) [22:11:46] (03CR) 10Ryan Kemper: [C: 03+2] cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177) (owner: 10Ryan Kemper) [22:26:42] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS stretch [22:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:50] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2051.codfw.wmnet with OS stretch [22:49:16] (03PS1) 10Andrew Bogott: cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server [puppet] - 10https://gerrit.wikimedia.org/r/754043 (https://phabricator.wikimedia.org/T291405) [22:50:52] 10SRE, 10MediaWiki-Uploading, 10Traffic: ATS 502 on uploading non-small files - https://phabricator.wikimedia.org/T299160 (10Josve05a) [22:58:12] 10SRE, 10Discovery: Ban elastic2035 from prod elastic clusters - https://phabricator.wikimedia.org/T299151 (10bking) Banned elastic2035 and elastic2051 (which was already broken) via the following commands: `curl -H 'Content-Type: application/json' -XPUT \ "http://localhost:9200/_cluster/settings" -d \ '{"tra... [23:07:38] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2051.codfw.wmnet with OS stretch [23:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:46] 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2051.codfw.wmnet with OS stretch completed: - elastic2051 (**WARN*... [23:12:12] 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10RKemper) p:05Triage→03Medium [23:15:40] 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10RKemper) 05In progress→03Resolved [23:15:48] 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10RKemper) Re-image is complete [23:24:01] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:25:13] RECOVERY - Check systemd state on elastic2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:19] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:38:27] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:38:39] (03PS1) 10Legoktm: Revert "LinksUpdate refactor" and follow-ups [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754046 (https://phabricator.wikimedia.org/T299244) [23:39:59] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:40:09] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:43:17] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:47:24] legoktm: Well, it looks like it's passing CI. Should we JFDI and deploy (with no SRE or RelEng)? [23:48:04] uhmmm [23:48:13] legoktm: My thoughts too. :-( [23:48:47] how much longer are you going to be around for? [23:49:17] I can be around for a bit, but this isn't really my area of expertise. If it stampedes the DBs, for instance… [23:49:38] But it should just work as a reversion to the old status quo, hopefully? [23:49:41] yeah, just wondering how long we should wait to see if Tim comes around [23:49:48] rzl: are you still around? [23:50:06] It's only 18:50 here, I can be around for a couple of hours before turning into a pumpkin. [23:50:18] were any extensions updated based on these refactors? [23:50:43] good question, let's look through Tim's other patches [23:50:49] subbu[m]: Yes. [23:50:51] git #768464d0 - Identify lead images using a new parser hook instead of during LinksUpdate (task T176520) (task T296895) by Tim Starling [23:50:52] T176520: Pageimage property (and possibly other page properties) not updated reliably after reverts - https://phabricator.wikimedia.org/T176520 [23:50:52] T296895: LinksUpdate hook review - https://phabricator.wikimedia.org/T296895 [23:51:17] In PageImages. But just that, from grepping https://www.mediawiki.org/wiki/MediaWiki_1.38/wmf.17 for "links" [23:51:23] James_F, no, those are the other way around. [23:51:33] PageImages refactor blocked the LinksUpdated refactor. [23:51:42] Right, so it should be fine then? [23:51:42] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GeoData/+/744910 [23:51:58] That shipped in wmf.17. [23:52:08] Err. wmf.16. [23:52:35] ah [23:52:36] yeah [23:52:48] legoktm: Are you able to be around for a bit? [23:52:55] afecf46c237b39af43efd9cd2d664b83ed7c4f16 is Add LinksUpdate::getPageId() and that went out in wmf.13 [23:52:57] yes [23:53:18] Cool, that makes two no-longer-qualified people to do it, 0+0 = 1, right? [23:53:39] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/743283 is likely to be backwards-compatible [23:53:50] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/744908 another use of getPageId(), fine [23:53:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:54:55] to me, the translate change looks b/c yes. [23:55:00] and the PageImages stuff landed earlier than wmf.17 [23:55:01] The risky-patch comment for this says "Revert plan: Revert the patch, if the issue cannot be clearly identified and fixed easily", so at least when writing that they didn't expect it to break things if rolled back: https://phabricator.wikimedia.org/T293958#7612230 [23:56:15] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [23:56:28] legoktm, James_F: hi. sorry, i was out for a bit [23:56:30] * dduvall reads [23:56:34] I am going to say this .... just because CI passed the reverts is not a big source of confidence in and of itself that nothing will break, but the risky-patch comment is a better indicator. [23:56:55] Totally. [23:57:19] I merged an OOUI update yesterday that broke all edits, and our CI passed it anyway. I'm not feeling massively confident in rolling this back. [23:57:44] Monday is a no-deploy day too, I believe. [23:57:56] dduvall: we've identified the breaking change (LinksUpdate refactor), have a revert of the 4 patches that passes CI, now discussing how confident we feel about reverting it [23:58:01] er, deploying the revert*