[00:00:04] <jouncebot>	 brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220114T0000).
[00:02:11] <wikibugs>	 (03Merged) 10jenkins-bot: In WikitextContentHandler always use getFreshParser() [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/753828 (https://phabricator.wikimedia.org/T299149) (owner: 10Dduvall)
[00:04:19] <dduvall>	 alright, pulling the patch in to php-1.38.0-wmf.17
[00:05:18] <dduvall>	 TimStarling: is there a way to test ^ on mwdebug or should i just sync it?
[00:05:36] <icinga-wm>	 PROBLEM - Check systemd state on grafana1002 is CRITICAL: CRITICAL - degraded: The following units failed: grafana-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:40] <wikibugs>	 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10Papaul) @hashar since Monday is a Holiday, let is do this on the 18th a...
[00:06:18] <icinga-wm>	 PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: discard_held_messages.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:03] <TimStarling>	 dduvall: just sync it, the logs will provide confirmation
[00:07:10] <dduvall>	 ack
[00:08:59] <logmsgbot>	 !log dduvall@deploy1002 Synchronized php-1.38.0-wmf.17/includes/content/WikitextContentHandler.php: Backport: [[gerrit:753828|In WikitextContentHandler always use getFreshParser() (T299149)]] (duration: 01m 07s)
[00:09:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:09:03] <stashbot>	 T299149: MWException: Parser state cleared while parsing. Did you call Parser::parse recursively? - https://phabricator.wikimedia.org/T299149
[00:09:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:09:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:09:30] <dduvall>	 i don't recall seeing much before group1 promotion so i will go ahead with that as well
[00:10:17] <wikibugs>	 (03PS1) 10Dduvall: group1 wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753860
[00:10:19] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753860 (owner: 10Dduvall)
[00:12:08] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753860 (owner: 10Dduvall)
[00:13:57] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.38.0-wmf.17  refs T293958
[00:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:01] <stashbot>	 T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958
[00:14:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:14:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:14:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:04] <logmsgbot>	 !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.38.0-wmf.17  refs T293958 (duration: 01m 06s)
[00:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:15:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:48] <dduvall>	 looks good. rolling to all wikis
[00:19:58] <wikibugs>	 (03PS1) 10Dduvall: all wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753862
[00:20:00] <wikibugs>	 (03CR) 10Dduvall: [C: 03+2] all wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753862 (owner: 10Dduvall)
[00:20:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:28] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.38.0-wmf.17  refs T293958 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753862 (owner: 10Dduvall)
[00:23:08] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.38.0-wmf.17  refs T293958
[00:23:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:12] <stashbot>	 T293958: 1.38.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T293958
[00:23:44] <James_F>	 Fingers crossed.
[00:24:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:24:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:25:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:26:27] <dduvall>	 i am seeing some db replication lag errors
[00:27:21] <dduvall>	 and some slow queries on the page table
[00:27:37] <dduvall>	 maybe this is temporary though
[00:29:57] <dduvall>	 i think we're ok
[00:30:52] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:30:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:32:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn
[00:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:36:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn
[00:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:41:04] <dduvall>	 calling that a train i guess. thanks TimStarling, twentyafterfour, others
[00:50:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:55:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:10:40] <wikibugs>	 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10nray)
[01:20:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:28:46] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[01:30:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:41:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitor
[01:41:46] <icinga-wm>	 base
[01:51:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:55:04] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:09:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:22:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:37:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:49:38] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:52:12] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: elastic2051, labstore1007, miscweb1002, labstore1006, restbase2009 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[04:09:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:26:06] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2001 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: labstore1006, elastic2051, miscweb1002, labstore1007, restbase2009 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[04:52:10] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:09:46] <wikibugs>	 (03PS9) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800)
[05:12:04] <icinga-wm>	 RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:12:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott)
[05:15:21] <wikibugs>	 (03PS10) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800)
[05:16:28] <legoktm>	 !log manually restarted discard_held_messages service on lists1001, failed with a spurious sqlalchemy issue about packets being out of order
[05:16:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:18:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott)
[05:19:01] <wikibugs>	 (03PS11) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800)
[05:57:46] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: miscweb1002, labstore1006, restbase2009, labstore1007, elastic2051 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[06:11:50] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m5-master to dbproxy1021 [dns] - 10https://gerrit.wikimedia.org/r/753870 (https://phabricator.wikimedia.org/T298586)
[06:14:10] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753835
[06:14:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753835 (owner: 10Marostegui)
[06:15:47] <marostegui>	 !log Failover m5 proxy from dbproxy1017 to dbproxy1021 T298586
[06:15:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:15:51] <stashbot>	 T298586: Upgrade all dbproxy hosts to Bullseye - https://phabricator.wikimedia.org/T298586
[06:15:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m5-master to dbproxy1021 [dns] - 10https://gerrit.wikimedia.org/r/753870 (https://phabricator.wikimedia.org/T298586) (owner: 10Marostegui)
[06:35:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove logpager group from s3 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P18735 and previous config saved to /var/cache/conftool/dbconfig/20220114-063554-marostegui.json
[06:35:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:35:59] <stashbot>	 T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127
[06:37:04] <wikibugs>	 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10KartikMistry) >>! In T299023#7620868, @Dzahn wrote: > Hi @KartikMistry re: the question how to get the key to us: you can make a new file in your ho...
[07:00:40] <wikibugs>	 (03PS1) 10Marostegui: pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753874 (https://phabricator.wikimedia.org/T299046)
[07:02:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2012: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753874 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui)
[07:05:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2012.codfw.wmnet with OS bullseye
[07:05:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:21:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:33:11] <wikibugs>	 (03PS2) 10Gehel: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[07:33:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[07:36:30] <wikibugs>	 (03PS3) 10Gehel: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[07:37:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[07:37:08] <wikibugs>	 (03PS3) 10KartikMistry: Deploy Flores MT [deployment-charts] - 10https://gerrit.wikimedia.org/r/751547 (https://phabricator.wikimedia.org/T298584)
[07:37:59] <wikibugs>	 (03Abandoned) 10Gehel: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753851 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[07:39:28] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2012.codfw.wmnet with OS bullseye
[07:39:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:44] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: shellbox: remove useless files/stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/753062
[07:41:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] shellbox: remove useless files/stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/753062 (owner: 10Giuseppe Lavagetto)
[07:41:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: make the default grace period 3 seconds [puppet] - 10https://gerrit.wikimedia.org/r/752600 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto)
[07:41:48] <wikibugs>	 (03PS4) 10Gehel: elasticsearch: fix package dependency error [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[07:44:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753836
[07:45:59] <wikibugs>	 (03CR) 10Ideophagous: arywiki NS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/747973 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous)
[07:48:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:48:22] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See minor dependency issue inline" [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[07:53:49] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Re-reading that patch (and looking at PCC failures): profile::java is already required from profile::elasticsearch (which makes sense). So" [puppet] - 10https://gerrit.wikimedia.org/r/753857 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[07:55:14] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[08:00:02] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220114T0800)
[08:00:42] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:03:06] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) is WARNING: Test Get summary for test page responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service
[08:09:49] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] icinga: add multiple case for Gehel in Icinga authorization [puppet] - 10https://gerrit.wikimedia.org/r/752130 (owner: 10Gehel)
[08:10:21] <wikibugs>	 (03PS2) 10Gehel: icinga: add multiple case for Gehel in Icinga authorization [puppet] - 10https://gerrit.wikimedia.org/r/752130
[08:11:37] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] icinga: add multiple case for Gehel in Icinga authorization [puppet] - 10https://gerrit.wikimedia.org/r/752130 (owner: 10Gehel)
[08:13:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitor
[08:13:08] <icinga-wm>	 base
[08:13:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:19:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10SCherukuwada) Manager approves.
[08:21:51] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc2012: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753836 (owner: 10Marostegui)
[08:25:32] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[08:27:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: mx1001.wikimedia.org mail delivery timeouts - https://phabricator.wikimedia.org/T299107 (10MoritzMuehlenhoff) >>! In T299107#7620550, @Platonides wrote: > @MoritzMuehlenhoff, did you see https://www.spinics.net/lists/stable/msg509296.html ? > Apparently upstream i...
[08:32:17] <wikibugs>	 (03PS1) 10Marostegui: pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753912 (https://phabricator.wikimedia.org/T299046)
[08:32:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753912 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui)
[08:33:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2013.codfw.wmnet with OS bullseye
[08:33:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM cuminunpriv1001.eqiad.wmnet
[08:34:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM cuminunpriv1001.eqiad.wmnet
[08:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:39] <wikibugs>	 (03CR) 10David Caro: wmcs: move grid-dedicated code to its own package (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez)
[08:48:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,trafficserver-upload,varnish-upload} site={drmrs,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:50:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:50:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-tool1005.eqiad.wmnet
[08:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:08] <icinga-wm>	 PROBLEM - Check systemd state on cp6002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:17] <moritzm>	 !log rebooting an-tool1007 (running turnilo.wikimedia.org)
[08:51:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:53:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-tool1005.eqiad.wmnet
[08:53:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:55:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-tool1007.eqiad.wmnet
[08:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:57:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-tool1007.eqiad.wmnet
[08:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-tool1008.eqiad.wmnet
[08:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:51] <moritzm>	 !log rebooting an-tool1008 (running yarn.wikimedia.org)
[08:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:28] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens5.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-tool1008.eqiad.wmnet
[09:00:44] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:00:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:57] <moritzm>	 !log systemctl reset-failed ifup@ens5.service on an-tool1005 T273026
[09:00:59] <wikibugs>	 (03PS5) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:00:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:00] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[09:01:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-tool1009.eqiad.wmnet
[09:01:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:26] <moritzm>	 !log rebooting an-tool1009 (running hue.wikimedia.org)
[09:01:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-tool1009.eqiad.wmnet
[09:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:06] <icinga-wm>	 PROBLEM - Host cp6002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:05:06] <icinga-wm>	 RECOVERY - Host cp6002 is UP: PING OK - Packet loss = 0%, RTA = 86.10 ms
[09:05:15] <wikibugs>	 (03PS6) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:05:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitor
[09:05:31] <icinga-wm>	 base
[09:05:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff)
[09:05:52] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2013.codfw.wmnet with OS bullseye
[09:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:06:00] <icinga-wm>	 RECOVERY - Check systemd state on cp6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:06:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:06:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:09:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:09:17] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753837
[09:09:26] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:10:36] <wikibugs>	 (03PS7) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:11:17] <marostegui>	 !log Move pc1014 from pc1 to pc2 T299046
[09:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:21] <stashbot>	 T299046: Upgrade parsercache infra to Bullseye - https://phabricator.wikimedia.org/T299046
[09:11:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey)
[09:19:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-test-client1001.eqiad.wmnet
[09:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/753837 (owner: 10Marostegui)
[09:21:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10mfossati) Thanks for your comments @Dzahn , very useful!  > re: grafana.wikimedia.org - this should not actually need a login but when you click on "sign in" in the lower left corner, you sh...
[09:21:58] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: add support for WCQS [alerts] - 10https://gerrit.wikimedia.org/r/753915
[09:22:09] <icinga-wm>	 PROBLEM - Check systemd state on cp6010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:22:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-test-client1001.eqiad.wmnet
[09:22:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:43] <icinga-wm>	 PROBLEM - Host cp6010 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:13] <wikibugs>	 (03PS8) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:28:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM install1003.wikimedia.org
[09:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:56] <icinga-wm>	 RECOVERY - Check systemd state on cp6010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:29:59] <icinga-wm>	 RECOVERY - Host cp6010 is UP: PING OK - Packet loss = 0%, RTA = 86.14 ms
[09:32:21] <wikibugs>	 (03PS9) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:32:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM install1003.wikimedia.org
[09:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff)
[09:35:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM apt1001.wikimedia.org
[09:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:11] <wikibugs>	 (03PS10) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:38:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM apt1001.wikimedia.org
[09:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:30] <wikibugs>	 (03PS11) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:41:32] <wikibugs>	 (03PS1) 10Vgutierrez: envoyproxy: Add stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753919 (https://phabricator.wikimedia.org/T271421)
[09:41:38] <wikibugs>	 (03PS12) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:42:56] <wikibugs>	 (03PS13) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:45:09] <wikibugs>	 (03PS14) 10Elukey: kafka: add check to test the Broker's TLS port [puppet] - 10https://gerrit.wikimedia.org/r/753738
[09:45:25] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase
[09:45:51] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33245/console" [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey)
[09:46:46] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] kafka: add check to test the Broker's TLS port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753738 (owner: 10Elukey)
[09:47:53] <wikibugs>	 (03CR) 10ZPapierski: [C: 03+1] rdf-streaming-updater: add support for WCQS [alerts] - 10https://gerrit.wikimedia.org/r/753915 (owner: 10DCausse)
[09:49:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitor
[09:49:59] <icinga-wm>	 base
[09:53:11] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921
[09:54:05] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33246/console" [puppet] - 10https://gerrit.wikimedia.org/r/753919 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[09:55:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-test-druid1001.eqiad.wmnet
[09:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:24] <icinga-wm>	 PROBLEM - Host cp6003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:58:50] <icinga-wm>	 RECOVERY - Host cp6003 is UP: PING OK - Packet loss = 0%, RTA = 86.15 ms
[09:59:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-test-druid1001.eqiad.wmnet
[09:59:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff)
[10:04:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM matomo1002.eqiad.wmnet
[10:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:13] <moritzm>	 !log rebooting matomo1002 (running piwik.wikimedia.org)
[10:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:48] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:07:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM matomo1002.eqiad.wmnet
[10:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:14:27] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Will talk later to Volans." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez)
[10:15:58] <icinga-wm>	 PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:17:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-test-presto1001.eqiad.wmnet
[10:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:49] <wikibugs>	 (03Merged) 10jenkins-bot: wmcs: move grid-dedicated code to its own package [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753769 (owner: 10Arturo Borrero Gonzalez)
[10:19:00] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson)
[10:21:44] <icinga-wm>	 PROBLEM - Check systemd state on cp6011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:21:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-test-presto1001.eqiad.wmnet
[10:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:18] <wikibugs>	 (03PS3) 10Muehlenhoff: Make build2001 a build host [puppet] - 10https://gerrit.wikimedia.org/r/751146
[10:29:48] <icinga-wm>	 PROBLEM - Host cp6011 is DOWN: PING CRITICAL - Packet loss = 100%
[10:30:38] <icinga-wm>	 RECOVERY - Host cp6011 is UP: PING OK - Packet loss = 0%, RTA = 86.09 ms
[10:30:48] <icinga-wm>	 RECOVERY - Check systemd state on cp6011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:31:56] <icinga-wm>	 PROBLEM - purged service on cp6011 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:33:26] <icinga-wm>	 RECOVERY - purged service on cp6011 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[10:38:43] <wikibugs>	 (03PS1) 10Jbond: heiradata - cloud: update email address [puppet] - 10https://gerrit.wikimedia.org/r/753925
[10:40:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] heiradata - cloud: update email address [puppet] - 10https://gerrit.wikimedia.org/r/753925 (owner: 10Jbond)
[10:42:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM an-test-ui1001.eqiad.wmnet
[10:42:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:35] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Marco_Fossati - https://phabricator.wikimedia.org/T298766 (10cmooney) 05In progress→03Resolved Thanks @mfossati for the feedback, and indeed @Dzahn for the detailed info, appreciate it.
[10:43:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/753919 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[10:43:54] <icinga-wm>	 PROBLEM - SSH on mw2254.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:44:26] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase
[10:47:27] <wikibugs>	 (03PS2) 10DCausse: wcqs: Deploy streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson)
[10:50:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM an-test-ui1001.eqiad.wmnet
[10:50:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:52] <moritzm>	 !log systemctl reset-failed ifup@ens5.service on an-test-ui1001 T273026
[10:50:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:55] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[10:51:37] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cirrussearch: Reenable saneitizer [puppet] - 10https://gerrit.wikimedia.org/r/752724 (https://phabricator.wikimedia.org/T295705) (owner: 10Ebernhardson)
[10:51:55] <wikibugs>	 (03PS11) 10Jbond: O:rpkivalidator: add bgpalerter to rpki servers [puppet] - 10https://gerrit.wikimedia.org/r/753450 (https://phabricator.wikimedia.org/T230600)
[10:52:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff)
[10:53:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Superset for Margeigh Novotny - https://phabricator.wikimedia.org/T299072 (10cmooney) @MNovotny_WMF apologies for the delay processing this.  Checking your existing access I believe you should already be able to log in to Superset, is that correct?  There are...
[10:54:43] <icinga-wm>	 PROBLEM - Host cp6004 is DOWN: PING CRITICAL - Packet loss = 100%
[10:55:03] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2051.codfw.wmnet with OS stretch
[10:55:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:11] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2051.codfw.wmnet with OS stretch executed with errors: - elastic2051 (*...
[10:55:39] <icinga-wm>	 RECOVERY - Host cp6004 is UP: PING OK - Packet loss = 0%, RTA = 86.22 ms
[10:55:40] <wikibugs>	 (03PS1) 10Jbond: ihieradata - bgpalerter: update email group [puppet] - 10https://gerrit.wikimedia.org/r/753927
[10:55:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] ihieradata - bgpalerter: update email group [puppet] - 10https://gerrit.wikimedia.org/r/753927 (owner: 10Jbond)
[10:56:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM archiva1002.wikimedia.org
[10:56:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:17] <moritzm>	 !log rebooting archiva1002 (running archiva.wikimedia.org)
[10:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:44] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] wcqs: Deploy streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson)
[10:57:38] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] "Good call re: linux-swap. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/753781 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[10:59:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nick Ray - https://phabricator.wikimedia.org/T299186 (10cmooney) Have asked user to send me SSH key out of band to verify.
[11:00:39] <moritzm>	 !log systemctl reset-failed ifup@ens5.service on archiva1002 T273026
[11:00:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:43] <stashbot>	 T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026
[11:01:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM archiva1002.wikimedia.org
[11:01:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] envoy: make the choice of api version explicit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/751717 (owner: 10Giuseppe Lavagetto)
[11:04:03] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase
[11:06:49] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/751717 (owner: 10Giuseppe Lavagetto)
[11:07:03] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] envoyproxy: Add stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753919 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[11:14:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Cleanup: remove the extract method, now unused. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699716 (owner: 10Giuseppe Lavagetto)
[11:15:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10MoritzMuehlenhoff)
[11:16:45] <wikibugs>	 (03Merged) 10jenkins-bot: Cleanup: remove the extract method, now unused. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699716 (owner: 10Giuseppe Lavagetto)
[11:16:57] <icinga-wm>	 RECOVERY - SSH on restbase2011.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:18:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1023.eqiad.wmnet with OS buster
[11:18:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS buster
[11:21:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Image module refactoring (step 1) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699717 (owner: 10Giuseppe Lavagetto)
[11:23:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:27:08] <wikibugs>	 (03PS1) 10Kormat: wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931
[11:27:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Data Engineering team resources for Sandra Ebele Nwachukwu - https://phabricator.wikimedia.org/T298786 (10cmooney) 05Open→03Resolved Sandra confirmed access working over Slack.  Please re-open if there are any problems, I will resolve this now.  Thanks.
[11:27:43] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] partman: remove reuse-test from restbase2009, use linux-swap [puppet] - 10https://gerrit.wikimedia.org/r/753781 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[11:29:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:29:21] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase
[11:32:29] <wikibugs>	 (03PS2) 10Kormat: wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931
[11:32:33] <icinga-wm>	 PROBLEM - Check systemd state on cp6012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:04] <wikibugs>	 (03Merged) 10jenkins-bot: Image module refactoring (step 1) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699717 (owner: 10Giuseppe Lavagetto)
[11:35:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:03] <icinga-wm>	 PROBLEM - Host cp6012 is DOWN: PING CRITICAL - Packet loss = 100%
[11:37:31] <icinga-wm>	 RECOVERY - Check systemd state on cp6012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:37:33] <icinga-wm>	 RECOVERY - Host cp6012 is UP: PING OK - Packet loss = 0%, RTA = 86.16 ms
[11:38:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:40:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:42:07] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718
[11:42:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitor
[11:42:41] <icinga-wm>	 base
[11:45:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1023.eqiad.wmnet with OS buster
[11:45:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1023.eqiad.wmnet with OS buster completed: - ganeti1023 (**PASS**)   - Downtimed on Ici...
[11:45:32] <hnowlan>	 looking at restbase issues, seeing timeouts to some services (mathoid and parsoid so far)
[11:46:25] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase
[11:47:25] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: improve get_node_info() error reporting [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753935
[11:47:27] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to query grid node information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753936
[11:48:05] <wikibugs>	 (03PS1) 10Vgutierrez: cache::envoy: Set stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753937 (https://phabricator.wikimedia.org/T271421)
[11:48:57] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing
[11:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:59] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing
[11:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:01] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921
[11:51:37] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster
[11:51:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:03] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921
[11:54:16] <wikibugs>	 (03PS2) 10Vgutierrez: cache::envoy: Set stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753937 (https://phabricator.wikimedia.org/T271421)
[11:54:48] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921
[11:55:04] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33250/console" [puppet] - 10https://gerrit.wikimedia.org/r/753937 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[11:55:48] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6005 is CRITICAL: connect to address 10.136.0.8 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[11:56:36] <vgutierrez>	 mmandere: ^^ please make sure that cp6 hosts are properly downtimed on icinga :)
[11:57:42] <mmandere>	 vgutierrez: got it
[11:58:10] <icinga-wm>	 RECOVERY - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp6005 is OK: HTTP OK: HTTP/1.0 200 OK - 25331 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:02:39] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: improve get_node_info() error reporting [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753935
[12:02:41] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: add cookbook to query grid node information [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753936
[12:02:43] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921
[12:04:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:09:07] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Set stream_idle/request/request_headers timeout [puppet] - 10https://gerrit.wikimedia.org/r/753937 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez)
[12:10:38] <icinga-wm>	 PROBLEM - Host cp6005 is DOWN: PING CRITICAL - Packet loss = 100%
[12:12:24] <icinga-wm>	 RECOVERY - Host cp6005 is UP: PING OK - Packet loss = 0%, RTA = 86.09 ms
[12:16:53] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: wmcs: toolforge: grid: depool_remove: fix internal hostname usage [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753921
[12:18:53] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster
[12:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:08] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster
[12:20:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitor
[12:21:05] <icinga-wm>	 base
[12:22:12] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [thumbnail, originalimage] https://wikitech.wikimedia.org/wiki/RESTBase
[12:22:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1024.eqiad.wmnet with OS buster
[12:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS buster
[12:25:52] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6013 is CRITICAL: connect to address 10.136.0.12 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:25:52] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp6013 is CRITICAL: connect to address 10.136.0.12 and port 3120: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[12:25:52] <icinga-wm>	 PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp6013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[12:28:08] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp6013 is OK: HTTP OK: HTTP/1.1 200 Ok - 33593 bytes in 0.273 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:28:08] <icinga-wm>	 RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp6013 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 189111 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2022-04-04 16:55:58 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS
[12:29:08] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp6013 is OK: HTTP OK: HTTP/1.1 200 OK - 470 bytes in 0.173 second response time https://wikitech.wikimedia.org/wiki/Varnish
[12:34:07] <wikibugs>	 (03PS1) 10Hnowlan: restbase: remove restbase2009 [puppet] - 10https://gerrit.wikimedia.org/r/753942 (https://phabricator.wikimedia.org/T295375)
[12:37:50] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] wmcs: toolforge: grid: improve get_node_info() error reporting (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753935 (owner: 10Arturo Borrero Gonzalez)
[12:40:58] <icinga-wm>	 PROBLEM - Host cp6013 is DOWN: PING CRITICAL - Packet loss = 100%
[12:41:32] <icinga-wm>	 RECOVERY - Host cp6013 is UP: PING OK - Packet loss = 0%, RTA = 86.12 ms
[12:41:32] <icinga-wm>	 PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:43:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected value at path = Missing keys: [originalimage, thumbnail] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:44:26] <icinga-wm>	 RECOVERY - SSH on mw2254.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:49:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1024.eqiad.wmnet with OS buster
[12:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ganeti102[34] - https://phabricator.wikimedia.org/T283036 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1024.eqiad.wmnet with OS buster completed: - ganeti1024 (**PASS**)   - Downtimed on Ici...
[12:51:12] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster
[12:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:03] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster
[12:53:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:42] <wikibugs>	 (03PS1) 10Marostegui: pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753943 (https://phabricator.wikimedia.org/T299046)
[12:59:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/753943 (https://phabricator.wikimedia.org/T299046) (owner: 10Marostegui)
[12:59:45] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host pc2011.codfw.wmnet with OS bullseye
[12:59:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:06:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:06:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:06:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:06:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:06:56] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:07:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:07:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:07:58] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:10:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make build2001 a build host [puppet] - 10https://gerrit.wikimedia.org/r/751146 (owner: 10Muehlenhoff)
[13:18:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:20:09] <wikibugs>	 (03PS3) 10Gehel: wcqs: Deploy streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson)
[13:20:15] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster
[13:20:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:04] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] wcqs: Deploy streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/752737 (owner: 10Ebernhardson)
[13:25:20] <icinga-wm>	 PROBLEM - package builder rsync on build2001 is CRITICAL: connect to address 10.192.32.77 and port 873: Connection refused https://wikitech.wikimedia.org/wiki/Debian_Packaging%23Upload_to_Wikimedia_Repo
[13:26:25] <icinga-wm>	 RECOVERY - package builder rsync on build2001 is OK: TCP OK - 0.033 second response time on 10.192.32.77 port 873 https://wikitech.wikimedia.org/wiki/Debian_Packaging%23Upload_to_Wikimedia_Repo
[13:31:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc2011.codfw.wmnet with OS bullseye
[13:32:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:45] <wikibugs>	 (03PS1) 10Gehel: query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945
[13:39:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945 (owner: 10Gehel)
[13:40:42] <_joe_>	 sigh the pages acknowldgement expired
[13:40:48] <sobanski>	 I re-acked
[13:40:53] <moritzm>	 ack
[13:40:57] <wikibugs>	 (03PS2) 10Gehel: query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945
[13:40:58] <sobanski>	 And I'll re-ask, can we resolve these?
[13:41:06] <wikibugs>	 (03PS1) 10BBlack: varnish: Remove outdated cluster scale conditional [puppet] - 10https://gerrit.wikimedia.org/r/753966
[13:41:20] <question_mark>	 bblack: ^
[13:41:51] <sobanski>	 I guess I can try, worst case they'll fire again ;)
[13:42:14] <icinga-wm>	 RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:42:35] <bblack>	 can they not be fully-resolved?
[13:43:00] <sobanski>	 I just did
[13:43:01] <bblack>	 the icinga alert that triggered the page is now downtimed for a month, and we don't expect it to succeed/recover before then
[13:43:13] <sobanski>	 All good then
[13:43:59] <wikibugs>	 (03PS3) 10Gehel: query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945
[13:50:59] <wikibugs>	 (03PS3) 10Kormat: wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931
[13:51:01] <wikibugs>	 (03PS1) 10Kormat: wmfdb/mycnf: Set unix_socket OR port, not both. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967
[13:53:02] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945 (owner: 10Gehel)
[13:53:23] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] query_service: make journal configureable for streaming updater [puppet] - 10https://gerrit.wikimedia.org/r/753945 (owner: 10Gehel)
[14:16:16] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Internet-Archive: Error: 503, Backend fetch failed, while the file uploaded fine - https://phabricator.wikimedia.org/T299220 (10Aklapper)
[14:16:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Actually switch build2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/753968
[14:18:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Actually switch build2001 to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/753968 (owner: 10Muehlenhoff)
[14:36:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff) >>! In T294120#7618187, @Platonides wrote: > Wouldn't setting kvm:machine_version=pc-i440fx-2.8 as a [global parameter](https://docs.ganeti.org/d...
[14:37:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate eqiad Ganeti cluster to KVM machine type pc-i440fx-2.8 - https://phabricator.wikimedia.org/T294120 (10MoritzMuehlenhoff)
[14:43:47] <wikibugs>	 (03PS1) 104nn1l2: fawiki: Add flow-delete right to eliminators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/753969 (https://phabricator.wikimedia.org/T299223)
[14:44:50] <wikibugs>	 10SRE, 10Discovery: Ban elastic2035 from prod elastic clusters - https://phabricator.wikimedia.org/T299151 (10bking) 05Resolved→03In progress
[14:53:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Add component/cassandradev for stretch and buster [puppet] - 10https://gerrit.wikimedia.org/r/753971 (https://phabricator.wikimedia.org/T298805)
[14:55:43] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] "One very minor nit, other than that LGTM" [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat)
[14:57:43] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] "One clarification question, otherwise LGTM." [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat)
[14:58:28] <wikibugs>	 (03PS12) 10David Caro: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott)
[14:59:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add component/cassandradev for stretch and buster [puppet] - 10https://gerrit.wikimedia.org/r/753971 (https://phabricator.wikimedia.org/T298805) (owner: 10Muehlenhoff)
[15:00:17] <bblack>	 !log silenced site=drmrs in alertmanager, I think
[15:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:27] <bblack>	 !log silenced site=drmrs in alertmanager for one month, I think
[15:00:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:52] <wikibugs>	 10SRE, 10Data-Engineering, 10Generated Data Platform, 10Platform Engineering, 10Patch-For-Review: Import Debian package of Cassandra 3.11.11 as 'dev' version - https://phabricator.wikimedia.org/T298805 (10MoritzMuehlenhoff) I added component/cassandradev for buster and stretch. For the import we can eith...
[15:03:44] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:06:28] <wikibugs>	 (03CR) 10Kormat: wmfdb/mycnf: Set unix_socket OR port, not both. (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat)
[15:12:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718 (owner: 10Giuseppe Lavagetto)
[15:14:02] <wikibugs>	 (03Merged) 10jenkins-bot: Slim down DockerDriver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699718 (owner: 10Giuseppe Lavagetto)
[15:14:18] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] wmfdb/mycnf: Set unix_socket OR port, not both. (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat)
[15:17:38] <wikibugs>	 (03CR) 10Kormat: wmfdb/mycnf: Add CnfSelector (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat)
[15:18:41] <wikibugs>	 (03PS1) 10DCausse: wcqs: set QUERY_SERVICE env name with wcqs/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/753973
[15:23:19] <bblack>	 we're shortly going to puppetize lvs6001 in drmrs
[15:23:34] <wikibugs>	 (03CR) 10ZPapierski: [C: 03+1] wcqs: set QUERY_SERVICE env name with wcqs/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/753973 (owner: 10DCausse)
[15:23:46] <bblack>	 this will likely define/cause some kind of false alarms related to drmrs and/or lvs6001 pybal, etc until we find them and get them downtimed/silenced whatever
[15:28:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2051-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[15:28:08] <wikibugs>	 (03PS1) 10Hnowlan: partman: don't format swap volume [puppet] - 10https://gerrit.wikimedia.org/r/753975 (https://phabricator.wikimedia.org/T295375)
[15:29:09] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: dc=drmrs
[15:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:22] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719
[15:33:01] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: (2) Elasticsearch instance elastic2051-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[15:33:02] <icinga-wm>	 RECOVERY - DPKG on elastic2051 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[15:33:49] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] wmfdb/mycnf: Add CnfSelector (031 comment) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat)
[15:35:45] <wikibugs>	 (03CR) 10David Caro: "There's a couple things, but mostly nits" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott)
[15:39:09] <bblack>	 !log lvs6001 + all services downtimed
[15:39:09] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] partman: don't format swap volume [puppet] - 10https://gerrit.wikimedia.org/r/753975 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[15:39:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:39] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] partman: don't format swap volume [puppet] - 10https://gerrit.wikimedia.org/r/753975 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan)
[15:40:39] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2009.codfw.wmnet with OS buster
[15:40:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pybal site=drmrs https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:44:18] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat)
[15:44:24] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:44:25] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] wmfdb/mycnf: Set unix_socket OR port, not both. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat)
[15:46:14] <wikibugs>	 (03Merged) 10jenkins-bot: wmfdb/mycnf: Add CnfSelector [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753931 (owner: 10Kormat)
[15:46:16] <wikibugs>	 (03Merged) 10jenkins-bot: wmfdb/mycnf: Set unix_socket OR port, not both. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/753967 (owner: 10Kormat)
[15:50:16] <wikibugs>	 (03CR) 10Ema: "According to Chris' comment in 997e257d the whole block should have been deleted in October 2020, so I'd propose we just do that instead." [puppet] - 10https://gerrit.wikimedia.org/r/753966 (owner: 10BBlack)
[16:04:06] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase2009.codfw.wmnet with OS buster
[16:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:44] <wikibugs>	 (03CR) 10BBlack: varnish: Remove outdated cluster scale conditional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753966 (owner: 10BBlack)
[16:06:04] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 233, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:07:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10Tks4Fish) @faidon are there any updates on this? We've been discussing tooling to help with steward workflow, but they are highly dependent on IPv6. If that is already a problem locally, globally...
[16:10:58] <dancy>	 jouncebot now
[16:10:58] <jouncebot>	 For the next 15 hour(s) and 49 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220114T0800)
[16:15:45] <logmsgbot>	 !log dancy@deploy1002 Synchronized README: Testing php-fpm restart (duration: 03m 18s)
[16:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:26] <wikibugs>	 (03PS1) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753983 (https://phabricator.wikimedia.org/T299177)
[16:21:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753983 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[16:21:32] <icinga-wm>	 PROBLEM - Check systemd state on cp6007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:46] <icinga-wm>	 PROBLEM - Check systemd state on cp6008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:21:52] <icinga-wm>	 PROBLEM - Check systemd state on cp6015 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:24:22] <wikibugs>	 (03PS1) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753985 (https://phabricator.wikimedia.org/T299177)
[16:24:22] <icinga-wm>	 PROBLEM - Check systemd state on cp6014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-varnish-exporter.service,varnishncsa.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:24:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753985 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[16:25:32] <wikibugs>	 (03PS1) 10Hnowlan: partman: use reuse profiles on all restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/753986 (https://phabricator.wikimedia.org/T295375)
[16:27:13] <wikibugs>	 (03PS1) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177)
[16:28:00] <icinga-wm>	 PROBLEM - traffic-pool service on cp6016 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:28:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[16:29:34] <bblack>	 all these cp6 alerts are ignorable (anything with site code 6 in the hostname is!)
[16:30:21] <bblack>	 !log rebooting cp60xx where x is 6, 7, 8, 14, 15, 16 (downtimed)
[16:30:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:48] <icinga-wm>	 RECOVERY - Check systemd state on cp6014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:59] <wikibugs>	 (03CR) 10Muehlenhoff: elasticsearch: fix package dependency issue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[16:34:16] <icinga-wm>	 RECOVERY - Check systemd state on cp6007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:44] <icinga-wm>	 RECOVERY - Check systemd state on cp6008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:54] <icinga-wm>	 RECOVERY - Check systemd state on cp6015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:35:32] <icinga-wm>	 RECOVERY - traffic-pool service on cp6016 is OK: OK - traffic-pool is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:39:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add bking as icinga user - https://phabricator.wikimedia.org/T298738 (10bking) Confirmed working, sorry for the delay. Feel free to close.
[16:41:16] <wikibugs>	 10SRE, 10Data-Engineering, 10Generated Data Platform, 10Platform Engineering: Import Debian package of Cassandra 3.11.11 as 'dev' version - https://phabricator.wikimedia.org/T298805 (10Eevans) >>! In T298805#7622471, @MoritzMuehlenhoff wrote: > I added component/cassandradev for buster and stretch. For the...
[16:42:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add bking as icinga user - https://phabricator.wikimedia.org/T298738 (10cmooney) 05In progress→03Resolved
[16:51:42] <icinga-wm>	 RECOVERY - IPMI Sensor Status on wdqs2003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:55:03] <bblack>	 !log reboot lvs6001
[16:55:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 (owner: 10Giuseppe Lavagetto)
[17:00:18] <wikibugs>	 (03Merged) 10jenkins-bot: Introduce DriverInterface [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/699719 (owner: 10Giuseppe Lavagetto)
[17:04:14] <wikibugs>	 (03PS1) 10Elukey: knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996
[17:04:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 (owner: 10Elukey)
[17:05:54] <wikibugs>	 (03CR) 10Elukey: "These values have been set manually via kubectl, there was heavy cpu throttling with the default values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 (owner: 10Elukey)
[17:08:53] <wikibugs>	 (03PS2) 10Elukey: knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996
[17:09:50] <wikibugs>	 (03PS3) 10Elukey: knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 (https://phabricator.wikimedia.org/T296173)
[17:10:22] <wikibugs>	 (03PS1) 10JMeybohm: Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998
[17:11:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 (owner: 10JMeybohm)
[17:15:59] <wikibugs>	 (03PS2) 10JMeybohm: Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 (https://phabricator.wikimedia.org/T228967)
[17:16:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 (https://phabricator.wikimedia.org/T228967) (owner: 10JMeybohm)
[17:17:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10cmooney) @Tks4Fish I don't think there is any reason to worry in terms of availability of IPv6 address space.  Is there a specific proposal on the table requiring additional IPv6 address space for...
[17:19:35] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: IPMI Power Supply Failure (PS2) for wdqs2003.codfw.wmnet - https://phabricator.wikimedia.org/T299098 (10Papaul) 05Open→03Resolved PS2 replaced
[17:20:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Added nfs/migrate_service.py (0324 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott)
[17:20:40] <wikibugs>	 (03PS13) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800)
[17:20:56] <wikibugs>	 (03PS14) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800)
[17:24:20] <wikibugs>	 (03PS3) 10JMeybohm: Make use-service-account-credentials the default for controller-manager [puppet] - 10https://gerrit.wikimedia.org/r/753998 (https://phabricator.wikimedia.org/T228967)
[17:24:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott)
[17:26:19] <wikibugs>	 (03CR) 10Accraze: [C: 03+1] knative-serving: add params to configure requests/limits of queue-proxy [deployment-charts] - 10https://gerrit.wikimedia.org/r/753996 (https://phabricator.wikimedia.org/T296173) (owner: 10Elukey)
[17:26:25] <bblack>	 !log reboot lvs600[23]
[17:26:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:13] <wikibugs>	 (03PS1) 10JMeybohm: Migrate kube-scheduler away from insecure API [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967)
[17:41:29] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:42:43] <wikibugs>	 (03PS2) 10JMeybohm: Migrate kube-scheduler away from insecure API [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967)
[17:43:34] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33254/console" [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[17:44:03] <bblack>	 !log drmrs asw: removed native-vlan-id from config on secondary (x-rack) interfaces of lvses to debug network issue
[17:44:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:37] <icinga-wm>	 PROBLEM - SSH on restbase2010.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:47:22] <wikibugs>	 (03PS2) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177)
[17:49:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[17:50:58] <wikibugs>	 (03PS15) 10Andrew Bogott: Added nfs/migrate_service.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/753612 (https://phabricator.wikimedia.org/T293800)
[17:51:17] <wikibugs>	 (03PS3) 10Bking: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177)
[17:51:22] <wikibugs>	 (03PS5) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239)
[17:57:09] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.48.54:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.54 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[17:57:09] <icinga-wm>	 PROBLEM - Check systemd state on restbase2009 is CRITICAL: CRITICAL - degraded: The following units failed: smartd.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:57:19] <icinga-wm>	 PROBLEM - cassandra-b service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:57:35] <wikibugs>	 (03PS1) 10JMeybohm: controllermanager_token is defined in common [labs/private] - 10https://gerrit.wikimedia.org/r/754005
[17:57:38] <wikibugs>	 (03PS1) 10JMeybohm: Add profile::kubernetes::master::scheduler_token to staging [labs/private] - 10https://gerrit.wikimedia.org/r/754006 (https://phabricator.wikimedia.org/T290967)
[17:57:41] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.48.55:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[17:57:47] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.48.56:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[17:57:53] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] controllermanager_token is defined in common [labs/private] - 10https://gerrit.wikimedia.org/r/754005 (owner: 10JMeybohm)
[17:57:58] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add profile::kubernetes::master::scheduler_token to staging [labs/private] - 10https://gerrit.wikimedia.org/r/754006 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[17:58:07] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.48.56:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.56 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[17:58:09] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:58:41] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.48.54:7001 on restbase2009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[17:58:43] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:58:45] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: connect to address 10.192.48.55 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[17:59:51] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33255/console" [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[18:06:16] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "The first PCC it without (noop) and the second with scheduler_token in labs/private" [puppet] - 10https://gerrit.wikimedia.org/r/754003 (https://phabricator.wikimedia.org/T290967) (owner: 10JMeybohm)
[18:06:18] <wikibugs>	 (03PS1) 10Herron: profile::apifeatureusage::logstash: add placeholder secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/754007 (https://phabricator.wikimedia.org/T297239)
[18:06:49] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:09:18] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 15 days, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing
[18:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:19] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15 days, 0:00:00 on restbase2009.codfw.wmnet with reason: not in restbase cluster, used for testing
[18:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:47] <wikibugs>	 (03CR) 10Herron: [V: 03+2 C: 03+2] profile::apifeatureusage::logstash: add placeholder secrets for PCC [labs/private] - 10https://gerrit.wikimedia.org/r/754007 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron)
[18:16:01] <wikibugs>	 (03PS2) 10Herron: logstash: add optional document_type parameter to es output config [puppet] - 10https://gerrit.wikimedia.org/r/747634 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)
[18:16:12] <wikibugs>	 (03PS6) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239)
[18:22:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10Tks4Fish) @cmooney Sorry, I think I ended up asking in the wrong place.  My question comes from T37947, and after looking at the comments there, I got to this task, saw it as stalled and concluded...
[18:23:54] <wikibugs>	 (03PS3) 10Herron: logstash: add optional document_type parameter to es output config [puppet] - 10https://gerrit.wikimedia.org/r/747634 (https://phabricator.wikimedia.org/T297239) (owner: 10Cwhite)
[18:28:02] <wikibugs>	 (03PS7) 10Herron: assign role::apifeatureusage::logstash to apifeatureusage[12]001 hosts [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239)
[18:31:35] <wikibugs>	 (03CR) 10Ebernhardson: elasticsearch: fix package dependency issue (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[18:40:17] <wikibugs>	 (03CR) 10Herron: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron)
[18:41:33] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Backup s2 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754013 (https://phabricator.wikimedia.org/T262668)
[18:41:37] <icinga-wm>	 PROBLEM - SSH on restbase2011.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:46:53] <icinga-wm>	 RECOVERY - SSH on restbase2010.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:49:45] <wikibugs>	 (03PS4) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[18:50:15] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[18:54:18] <wikibugs>	 (03PS5) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[18:54:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[18:55:56] <wikibugs>	 (03PS1) 10Majavah: P:cyberbot::exec: support newer debian versions [puppet] - 10https://gerrit.wikimedia.org/r/754016
[18:57:52] <wikibugs>	 (03CR) 10Cyberpower678: [C: 03+1] P:cyberbot::exec: support newer debian versions [puppet] - 10https://gerrit.wikimedia.org/r/754016 (owner: 10Majavah)
[19:00:23] <wikibugs>	 (03PS6) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[19:01:37] <wikibugs>	 (03PS7) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[19:02:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[19:07:05] <wikibugs>	 (03PS8) 10Ryan Kemper: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[19:07:13] <wikibugs>	 10SRE, 10Observability-Logging, 10Patch-For-Review: Move logstash api-feature-usage output away from v5 cluster - https://phabricator.wikimedia.org/T297239 (10herron) Planning to move the apifeatureusage pipeline over to the new hosts next week with these switchover steps:    * Add profile::apifeatureusage::...
[19:08:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[19:08:27] <wikibugs>	 (03CR) 10Herron: "Planning to merge this next week as part of full plan outlined in https://phabricator.wikimedia.org/T297239#7622955" [puppet] - 10https://gerrit.wikimedia.org/r/752211 (https://phabricator.wikimedia.org/T297239) (owner: 10Herron)
[19:09:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup s2 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754013 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[19:10:23] <icinga-wm>	 RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 155, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:24:00] <wikibugs>	 (03PS1) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942)
[19:24:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[19:24:54] <wikibugs>	 (03PS2) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942)
[19:25:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[19:27:05] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] OTRS: rename role class to VRTS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[19:28:38] <wikibugs>	 (03PS3) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942)
[19:30:33] <wikibugs>	 (03PS4) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942)
[19:33:16] <jinxer-wm>	 (CirrusSearchJVMGCOldPoolFlatlined) firing: Elasticsearch instance elastic2051-production-search-codfw is showing memory pressure in the old pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell  - https://alerts.wikimedia.org
[19:37:22] <wikibugs>	 (03PS5) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942)
[19:39:57] <wikibugs>	 (03CR) 10Dzahn: OTRS: rename role class to VRTS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[19:41:12] <wikibugs>	 (03PS6) 10Dzahn: OTRS: rename role class to VRTS [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942)
[19:45:00] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Backup s3 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754022 (https://phabricator.wikimedia.org/T262668)
[19:45:11] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: refactor OTRS role/module/cumin aliases - https://phabricator.wikimedia.org/T293942 (10Arnoldokoth)
[19:46:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Rack msw2-eqiad in new cage - https://phabricator.wikimedia.org/T298980 (10Jclark-ctr) moved msw2 back to old cage  it is connected to port 41.   msw2 is back online looks like we might have a missing link with opengear  future-scs-f8-eqiad will need some assistance with this @a...
[19:47:45] <wikibugs>	 (03CR) 10Nikki Nikkhoui: [C: 03+1] image-suggestion-api: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/742271 (owner: 10PipelineBot)
[19:48:21] <wikibugs>	 (03CR) 10Nikki Nikkhoui: [C: 04-1] "close in favor of newer patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/734438 (owner: 10PipelineBot)
[19:48:27] <wikibugs>	 (03CR) 10Nikki Nikkhoui: [C: 04-1] "close in favor of newer patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/736565 (owner: 10PipelineBot)
[19:49:33] <wikibugs>	 (03CR) 10Nikki Nikkhoui: [C: 04-1] "close in favor of newer patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/739618 (owner: 10PipelineBot)
[19:50:10] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Backup s5 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754023 (https://phabricator.wikimedia.org/T262668)
[19:50:14] <wikibugs>	 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Jdforrester-WMF) >>! In T205361#7574347, @Legoktm wrote: > OK, I copied ove...
[19:52:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] P:cyberbot::exec: support newer debian versions [puppet] - 10https://gerrit.wikimedia.org/r/754016 (owner: 10Majavah)
[19:54:43] <wikibugs>	 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar, 10Patch-For-Review: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Legoktm) No, the Apache config change is non trivial (disable puppet everyw...
[19:58:37] <wikibugs>	 (03PS2) 10Jcrespo: mediabackups: Backup s5 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754023 (https://phabricator.wikimedia.org/T262668)
[19:58:39] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Backup s6 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754024 (https://phabricator.wikimedia.org/T262668)
[19:58:41] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Backup s7 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754025 (https://phabricator.wikimedia.org/T262668)
[19:58:43] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Backup s8 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754026 (https://phabricator.wikimedia.org/T262668)
[20:01:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929 (10cmooney) @Tks4Fish no problem at all!  And certainly no need to apologize.  This task more relates to allocating blocks of IPv6 for Toolforge/Cloud.  As per the above discussion there are some sma...
[20:05:00] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] MWMultiVersion.php: Reverse logic for wikiversions file selection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/744836 (owner: 10Ahmon Dancy)
[20:07:24] <wikibugs>	 (03PS1) 10Herron: remove references to centrallog2001 [homer/public] - 10https://gerrit.wikimedia.org/r/754028 (https://phabricator.wikimedia.org/T298994)
[20:08:16] <wikibugs>	 10SRE, 10Language-Team (Language-2022-January-March): Deploy Flores MT secrets in Production for ContentTranslation - https://phabricator.wikimedia.org/T299023 (10Dzahn) Of course using GPG is fine as well. I just did not suggest it because usually people consider it cumbersome and once we add the credentials...
[20:09:23] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:09:43] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackups: Backup s3 media files at codfw [puppet] - 10https://gerrit.wikimedia.org/r/754022 (https://phabricator.wikimedia.org/T262668) (owner: 10Jcrespo)
[20:11:50] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "there also needs to be renaming in the private repo or this will fail. we have scheduled a time to do this together, so disable puppet, me" [puppet] - 10https://gerrit.wikimedia.org/r/754020 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn)
[20:14:48] <wikibugs>	 (03PS1) 10Herron: remove references to centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994)
[20:15:09] <wikibugs>	 (03PS9) 10Ebernhardson: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[20:17:46] <wikibugs>	 (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[20:19:38] <wikibugs>	 (03PS10) 10Ebernhardson: elasticsearch: fix package dependency issue [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[20:27:33] <wikibugs>	 (03CR) 10Ebernhardson: elasticsearch: fix package dependency issue (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/753988 (https://phabricator.wikimedia.org/T299177) (owner: 10Bking)
[20:42:03] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:47:51] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:49:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:56:41] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754034
[21:00:11] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754034 (https://phabricator.wikimedia.org/T299177)
[21:00:33] <wikibugs>	 (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/754034 (https://phabricator.wikimedia.org/T299177) (owner: 10Ebernhardson)
[21:05:37] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754034 (https://phabricator.wikimedia.org/T299177) (owner: 10Ebernhardson)
[21:07:33] <wikibugs>	 (03PS1) 10Cwhite: hiera: allow logstash1032 through kafka-jumbo firewall [puppet] - 10https://gerrit.wikimedia.org/r/754035 (https://phabricator.wikimedia.org/T288621)
[21:09:48] <wikibugs>	 (03CR) 10Herron: [C: 03+1] hiera: allow logstash1032 through kafka-jumbo firewall [puppet] - 10https://gerrit.wikimedia.org/r/754035 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite)
[21:12:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/754029 (https://phabricator.wikimedia.org/T298994) (owner: 10Herron)
[21:14:00] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: allow logstash1032 through kafka-jumbo firewall [puppet] - 10https://gerrit.wikimedia.org/r/754035 (https://phabricator.wikimedia.org/T288621) (owner: 10Cwhite)
[21:21:53] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953
[21:23:14] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953 (owner: 10Ryan Kemper)
[21:23:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953 (owner: 10Ryan Kemper)
[21:24:11] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01193 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[21:25:21] <wikibugs>	 (03PS2) 10Ryan Kemper: Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953
[21:25:34] <wikibugs>	 (03PS3) 10Ryan Kemper: Revert "cirrus: Fetch java before elastic plugins" [puppet] - 10https://gerrit.wikimedia.org/r/753953 (https://phabricator.wikimedia.org/T299177)
[21:39:55] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002169 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[21:40:03] <wikibugs>	 (03PS1) 10Ryan Kemper: cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177)
[21:41:00] <wikibugs>	 (03PS2) 10Ryan Kemper: cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177)
[21:41:18] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177) (owner: 10Ryan Kemper)
[21:44:53] <wikibugs>	 (03CR) 10Bking: [C: 03+1] cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177) (owner: 10Ryan Kemper)
[22:11:46] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] cirrus: Fetch java before elastic plugins [puppet] - 10https://gerrit.wikimedia.org/r/754042 (https://phabricator.wikimedia.org/T299177) (owner: 10Ryan Kemper)
[22:26:42] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2051.codfw.wmnet with OS stretch
[22:26:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:50] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2051.codfw.wmnet with OS stretch
[22:49:16] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps nfsclient: switch to using the VM-hosted scratch NFS server [puppet] - 10https://gerrit.wikimedia.org/r/754043 (https://phabricator.wikimedia.org/T291405)
[22:50:52] <wikibugs>	 10SRE, 10MediaWiki-Uploading, 10Traffic: ATS 502 on uploading non-small files - https://phabricator.wikimedia.org/T299160 (10Josve05a)
[22:58:12] <wikibugs>	 10SRE, 10Discovery: Ban elastic2035 from prod elastic clusters - https://phabricator.wikimedia.org/T299151 (10bking) Banned elastic2035 and elastic2051 (which was already broken) via the following commands:  `curl -H 'Content-Type: application/json' -XPUT \ "http://localhost:9200/_cluster/settings" -d \ '{"tra...
[23:07:38] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2051.codfw.wmnet with OS stretch
[23:07:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:46] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search, 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2051.codfw.wmnet with OS stretch completed: - elastic2051 (**WARN*...
[23:12:12] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10RKemper) p:05Triage→03Medium
[23:15:40] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10RKemper) 05In progress→03Resolved
[23:15:48] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2051 - https://phabricator.wikimedia.org/T298674 (10RKemper) Re-image is complete
[23:24:01] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:25:13] <icinga-wm>	 RECOVERY - Check systemd state on elastic2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:19] <icinga-wm>	 PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:38:27] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:38:39] <wikibugs>	 (03PS1) 10Legoktm: Revert "LinksUpdate refactor" and follow-ups [core] (wmf/1.38.0-wmf.17) - 10https://gerrit.wikimedia.org/r/754046 (https://phabricator.wikimedia.org/T299244)
[23:39:59] <icinga-wm>	 RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:40:09] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[23:43:17] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:47:24] <James_F>	 legoktm: Well, it looks like it's passing CI. Should we JFDI and deploy (with no SRE or RelEng)?
[23:48:04] <legoktm>	 uhmmm
[23:48:13] <James_F>	 legoktm: My thoughts too. :-(
[23:48:47] <legoktm>	 how much longer are you going to be around for?
[23:49:17] <James_F>	 I can be around for a bit, but this isn't really my area of expertise. If it stampedes the DBs, for instance…
[23:49:38] <James_F>	 But it should just work as a reversion to the old status quo, hopefully?
[23:49:41] <legoktm>	 yeah, just wondering how long we should wait to see if Tim comes around
[23:49:48] <legoktm>	 rzl: are you still around?
[23:50:06] <James_F>	 It's only 18:50 here, I can be around for a couple of hours before turning into a pumpkin.
[23:50:18] <subbu[m]>	 were any extensions updated based on these refactors?
[23:50:43] <legoktm>	 good question, let's look through Tim's other patches
[23:50:49] <James_F>	 subbu[m]: Yes.
[23:50:51] <James_F>	 git #768464d0 - Identify lead images using a new parser hook instead of during LinksUpdate (task T176520) (task T296895) by Tim Starling
[23:50:52] <stashbot>	 T176520: Pageimage property (and possibly other page properties) not updated reliably after reverts - https://phabricator.wikimedia.org/T176520
[23:50:52] <stashbot>	 T296895: LinksUpdate hook review - https://phabricator.wikimedia.org/T296895
[23:51:17] <James_F>	 In PageImages. But just that, from grepping https://www.mediawiki.org/wiki/MediaWiki_1.38/wmf.17 for "links"
[23:51:23] <subbu>	 James_F, no, those are the other way around.
[23:51:33] <subbu>	 PageImages refactor blocked the LinksUpdated refactor.
[23:51:42] <James_F>	 Right, so it should be fine then?
[23:51:42] <legoktm>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GeoData/+/744910
[23:51:58] <James_F>	 That shipped in wmf.17.
[23:52:08] <James_F>	 Err. wmf.16.
[23:52:35] <legoktm>	 ah
[23:52:36] <legoktm>	 yeah
[23:52:48] <James_F>	 legoktm: Are you able to be around for a bit?
[23:52:55] <legoktm>	 afecf46c237b39af43efd9cd2d664b83ed7c4f16 is Add LinksUpdate::getPageId() and that went out in wmf.13
[23:52:57] <legoktm>	 yes
[23:53:18] <James_F>	 Cool, that makes two no-longer-qualified people to do it, 0+0 = 1, right?
[23:53:39] <legoktm>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Translate/+/743283 is likely to be backwards-compatible
[23:53:50] <legoktm>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/744908 another use of getPageId(), fine
[23:53:57] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[23:54:55] <subbu>	 to me, the translate change looks b/c yes.
[23:55:00] <legoktm>	 and the PageImages stuff landed earlier than wmf.17
[23:55:01] <James_F>	 The risky-patch comment for this says "Revert plan: Revert the patch, if the issue cannot be clearly identified and fixed easily", so at least when writing that they didn't expect it to break things if rolled back: https://phabricator.wikimedia.org/T293958#7612230
[23:56:15] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27
[23:56:28] <dduvall>	 legoktm, James_F: hi. sorry, i was out for a bit
[23:56:30] * dduvall reads
[23:56:34] <subbu>	 I am going to say this .... just because CI passed the reverts is not a big source of confidence in and of itself that nothing will break, but the risky-patch comment is a better indicator.
[23:56:55] <James_F>	 Totally.
[23:57:19] <James_F>	 I merged an OOUI update yesterday that broke all edits, and our CI passed it anyway. I'm not feeling massively confident in rolling this back.
[23:57:44] <James_F>	 Monday is a no-deploy day too, I believe.
[23:57:56] <legoktm>	 dduvall: we've identified the breaking change (LinksUpdate refactor), have a revert of the 4 patches that passes CI, now discussing how confident we feel about reverting it
[23:58:01] <legoktm>	 er, deploying the revert*