[00:00:09] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1001.wikimedia.org with reason: service restart
[00:00:50] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit.wikimedia.org with reason: service restart
[00:01:05] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit.wikimedia.org with reason: service restart
[00:01:18] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:02:47] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-b7-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[00:03:14] <icinga-wm>	 RECOVERY - ps1-b7-codfw-infeed-load-tower-A-phase-X on ps1-b7-codfw is OK: SNMP OK - ps1-b7-codfw-infeed-load-tower-A-phase-X 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:03:28] <mutante>	 !log gerrit - service restart to deploy config change to add second replica T313250
[00:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:31] <stashbot>	 T313250: Bring up gerrit2002 - https://phabricator.wikimedia.org/T313250
[00:03:38] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:05:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32271 and previous config saved to /var/cache/conftool/dbconfig/20220804-000536-marostegui.json
[00:06:25] <mutante>	 !log gerrit - [2022-08-04 00:05:33,173] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/analytics/geowiki.git started... [CONTEXT pushOneId="83ad5008" ]
[00:06:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:37] <mutante>	 !log gerrit - [2022-08-04 00:05:33,173] Replication to gerrit2@gerrit2002.wikimedia.org:/srv/gerrit/git/analytics/geowiki.git started.. T313250
[00:06:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:08:30] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul)
[00:12:54] <jinxer-wm>	 (Device rebooted) firing: Alert for device ps1-c1-codfw.mgmt.codfw.wmnet - Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[00:14:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[00:15:53] <wikibugs>	 (03PS1) 10Tim Starling: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247
[00:16:32] <wikibugs>	 (03PS2) 10Tim Starling: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247
[00:17:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 (owner: 10Tim Starling)
[00:17:54] <jinxer-wm>	 (Device rebooted) resolved: Device ps1-c1-codfw.mgmt.codfw.wmnet recovered from Device rebooted   - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted
[00:18:04] <wikibugs>	 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10Platonides) Changing the ordering (perhaps coupled with varnish redirecting all '?title=X&action=history' to the new '?action=history&ti...
[00:18:47] <wikibugs>	 10SRE, 10Gerrit, 10serviceops, 10serviceops-collab, and 2 others: replacement for gerrit2001, decom gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Dzahn)
[00:20:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P32272 and previous config saved to /var/cache/conftool/dbconfig/20220804-002043-marostegui.json
[00:27:27] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10sgrabarczuk) It's working. Thank you!
[00:31:40] <wikibugs>	 (03PS3) 10Tim Starling: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247
[00:33:43] <wikibugs>	 (03PS1) 10Dzahn: gerrit: decom gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/820248 (https://phabricator.wikimedia.org/T243027)
[00:35:22] <wikibugs>	 (03PS1) 10Dzahn: gerrit: remove hiera data for old replica [puppet] - 10https://gerrit.wikimedia.org/r/820249 (https://phabricator.wikimedia.org/T243027)
[00:35:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32273 and previous config saved to /var/cache/conftool/dbconfig/20220804-003549-marostegui.json
[00:35:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[00:35:54] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[00:36:05] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[00:36:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32274 and previous config saved to /var/cache/conftool/dbconfig/20220804-003611-marostegui.json
[00:36:31] <wikibugs>	 (03PS1) 10Dzahn: site: remove gerrit2001, merge gerrit1001/2002 regex [puppet] - 10https://gerrit.wikimedia.org/r/820250 (https://phabricator.wikimedia.org/T243027)
[00:37:53] <wikibugs>	 (03CR) 10Dzahn: "On gerrit2002 we merged the config change and a bit later I did the gerrit service restart and then it started replicating to gerrit2002! " [puppet] - 10https://gerrit.wikimedia.org/r/820249 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn)
[00:38:16] <wikibugs>	 (03CR) 10Dzahn: "(service restart and log on gerrit1001)" [puppet] - 10https://gerrit.wikimedia.org/r/820249 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn)
[00:38:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32275 and previous config saved to /var/cache/conftool/dbconfig/20220804-003822-marostegui.json
[00:39:30] <wikibugs>	 (03CR) 10Dzahn: "All this stuff comes after we removed it from gerrit config but before we run the decom cookbook on the machine. will need puppet ron on a" [puppet] - 10https://gerrit.wikimedia.org/r/820248 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn)
[00:40:08] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[00:40:30] <wikibugs>	 (03CR) 10Dzahn: "And then finally this will be the last merge after the decom cookbook ran. After this we should be able to call it a decom mission success" [puppet] - 10https://gerrit.wikimedia.org/r/820250 (https://phabricator.wikimedia.org/T243027) (owner: 10Dzahn)
[00:42:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "after deploying this also needed a gerrit service restart. after that was done it started..log lines like:" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[00:43:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "see /var/log/gerrit/replication_log" [puppet] - 10https://gerrit.wikimedia.org/r/815401 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn)
[00:44:04] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:45:08] <wikibugs>	 (03PS2) 10Dzahn: gerrit: remove hiera data for old replica [puppet] - 10https://gerrit.wikimedia.org/r/820249 (https://phabricator.wikimedia.org/T243027)
[00:53:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32276 and previous config saved to /var/cache/conftool/dbconfig/20220804-005328-marostegui.json
[00:56:12] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:08:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P32277 and previous config saved to /var/cache/conftool/dbconfig/20220804-010834-marostegui.json
[01:18:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] devtools: Allow for scap deployment of scap [puppet] - 10https://gerrit.wikimedia.org/r/820220 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[01:23:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312972)', diff saved to https://phabricator.wikimedia.org/P32278 and previous config saved to /var/cache/conftool/dbconfig/20220804-012341-marostegui.json
[01:23:46] <stashbot>	 T312972: Rename index su_normalized on table spoofuser on wmf wikis - https://phabricator.wikimedia.org/T312972
[01:25:02] <wikibugs>	 (03CR) 10Dzahn: "yes, this is obviously not the real secret prod key but a "fake key" but it's not as fake as the SNAKEOIL string which meant things would " [labs/private] - 10https://gerrit.wikimedia.org/r/820221 (https://phabricator.wikimedia.org/T314195) (owner: 10Dduvall)
[01:26:00] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:31:06] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[01:35:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[01:36:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: Stop managing /srv/phab/repos [puppet] - 10https://gerrit.wikimedia.org/r/820213 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[01:38:41] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop on phab1001 and others" [puppet] - 10https://gerrit.wikimedia.org/r/820213 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall)
[01:38:50] <wikibugs>	 (03PS4) 10Dzahn: phabricator: Provide script for ensuring correct config file ownership [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:45:51] <wikibugs>	 (03CR) 10Dzahn: "/usr/local/sbin/phab_deploy_ensure_config_ownership has been created" [puppet] - 10https://gerrit.wikimedia.org/r/820174 (https://phabricator.wikimedia.org/T313950) (owner: 10Dduvall)
[01:46:07] <wikibugs>	 (03PS1) 10Reedy: CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254
[01:48:13] <wikibugs>	 (03PS1) 10Reedy: wikitech: Remove old LDAP config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820255
[01:50:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:58:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to the Desktop Improvements project statistics for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10Dzahn) 05In progress→03Resolved Thanks for confirming!:) Closing as resolved.
[01:58:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Dzahn)
[02:20:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:41] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2022-08-04-022612-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T313296)
[02:56:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:17:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) >>! In T279664#8122731, @Joe wrote: > Do we expect that to happen regularly on a high percentage of requests? If 17% of all requests need to make...
[03:19:15] <wikibugs>	 (03PS1) 10KartikMistry: Enable SectionTranslation on 10 Wikipedias where ContentTranslation is default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820261 (https://phabricator.wikimedia.org/T308829)
[03:47:14] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) >>! In T279664#8123041, @MatthewVernon wrote: > Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably write/d...
[04:09:02] <logmsgbot>	 !log krinkle@mwmaint1002 pull aborted:  (duration: 00m 05s)
[04:17:48] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:29:26] <TimStarling>	 !log on mw2377 fiddling with CPU frequency control and doing benchmarks
[04:29:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:31:42] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:40:38] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[04:51:16] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:58:28] <wikibugs>	 (03PS3) 10KartikMistry: Update cxserver to 2022-08-04-022612-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T313296)
[05:10:46] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:16:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[05:17:02] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[05:17:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2030.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[05:22:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2030.codfw.wmnet with OS bullseye
[05:22:24] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bullseye
[05:23:57] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Disable notifications pdu C rows" [puppet] - 10https://gerrit.wikimedia.org/r/820266
[05:23:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Disable notifications on codfw racks" [puppet] - 10https://gerrit.wikimedia.org/r/820267
[05:26:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications pdu C rows" [puppet] - 10https://gerrit.wikimedia.org/r/820266 (owner: 10Marostegui)
[05:26:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Disable notifications on codfw racks" [puppet] - 10https://gerrit.wikimedia.org/r/820267 (owner: 10Marostegui)
[05:26:42] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:30:13] <wikibugs>	 (03PS1) 10Marostegui: production-m5.sql: Remove grants for labweb1001/labweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/820286 (https://phabricator.wikimedia.org/T314528)
[05:32:43] * kart_ updating cxserver..
[05:32:49] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-08-04-022612-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry)
[05:35:40] <wikibugs>	 (03Abandoned) 10Muehlenhoff: librenms: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/810323 (owner: 10Muehlenhoff)
[05:36:44] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2022-08-04-022612-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/820075 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry)
[05:36:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage
[05:38:34] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:39:04] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:40:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2030.codfw.wmnet with reason: host reimage
[05:41:41] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:42:35] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:43:21] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:43:22] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:43:42] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[05:44:14] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:49:40] <kart_>	 !log Updated cxserver to 2022-08-04-022612-production (T313296, T308248)
[05:49:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:49:44] <stashbot>	 T313296: Enable Content and Section translation on wikipedias with new MT support from Google - https://phabricator.wikimedia.org/T313296
[05:49:45] <stashbot>	 T308248: Newly supported languages in Google Translate - https://phabricator.wikimedia.org/T308248
[05:54:52] <icinga-wm>	 PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[05:57:22] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:57:56] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:58:16] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:59:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2030.codfw.wmnet with OS bullseye
[05:59:34] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2030.codfw.wmnet with OS bullseye completed: - ganeti2030 (**PASS**)   - Downtimed on...
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T0600).
[06:01:28] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Sustainability: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) - https://phabricator.wikimedia.org/T293614 (10MoritzMuehlenhoff) >>! In T293614#8127446, @Lucas_Werkmeister_WMDE wrote: > That’s great, thanks! In that...
[06:02:28] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff)
[06:05:14] <wikibugs>	 (03CR) 10DCausse: "I believe that the version of the corresponding Chart.yaml must be changed for this change to be deployed as a new chart." [deployment-charts] - 10https://gerrit.wikimedia.org/r/819752 (https://phabricator.wikimedia.org/T314426) (owner: 10Ebernhardson)
[06:06:06] <_joe_>	 !log restarted memcached on mc2038 to pick up the actual production configuration
[06:06:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:06:20] <icinga-wm>	 RECOVERY - Memcached on mc2038 is OK: TCP OK - 0.032 second response time on 10.192.0.191 port 11214 https://wikitech.wikimedia.org/wiki/Memcached
[06:21:00] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:25:49] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10tstarling) I was doing comparative benchmarks of eqiad and codfw. @ori suggested that I look at CPU scaling as a possible reason for the discrepancy. The performance impact of setting...
[06:31:05] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:35:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[06:38:44] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:48:56] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[06:54:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
[06:58:20] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[06:58:27] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[06:58:42] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T0700).
[07:00:12] <apergos>	 morning! there is one trainee signed up today but there are no patches in the queue. 
[07:00:42] <apergos>	 I am in the google meet in case our trainee turns up, in which case I'll let them know to reschedule 
[07:02:10] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:02:25] <jnuche>	 👍
[07:02:36] <jnuche>	 (and good morning!)
[07:02:37] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[07:03:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[07:03:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
[07:05:12] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[07:07:57] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Downtime D3 databases [puppet] - 10https://gerrit.wikimedia.org/r/820369 (https://phabricator.wikimedia.org/T310146)
[07:08:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es[2023-2025].codfw.wmnet with reason: codfw pdu maintenance
[07:09:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es[2023-2025].codfw.wmnet with reason: codfw pdu maintenance
[07:09:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A
[07:09:36] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2030.codfw.wmnet to cluster codfw and group A
[07:09:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Downtime D3 databases [puppet] - 10https://gerrit.wikimedia.org/r/820369 (https://phabricator.wikimedia.org/T310146) (owner: 10Marostegui)
[07:10:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2030.codfw.wmnet
[07:11:40] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[07:11:48] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[07:12:04] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:12:41] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Disable notifications DBs in C5 [puppet] - 10https://gerrit.wikimedia.org/r/820370 (https://phabricator.wikimedia.org/T310145)
[07:13:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications DBs in C5 [puppet] - 10https://gerrit.wikimedia.org/r/820370 (https://phabricator.wikimedia.org/T310145) (owner: 10Marostegui)
[07:14:04] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:14:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:16:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db[2135,2160].codfw.wmnet with reason: codfw pdu maintenance
[07:16:13] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221) (owner: 10Ayounsi)
[07:16:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2135,2160].codfw.wmnet with reason: codfw pdu maintenance
[07:16:20] <wikibugs>	 (03PS2) 10Ayounsi: add include for 2620:0:862:fe08::/64 PTR [dns] - 10https://gerrit.wikimedia.org/r/820089 (https://phabricator.wikimedia.org/T307221)
[07:17:26] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Disable notifications DBs in C6 [puppet] - 10https://gerrit.wikimedia.org/r/820371 (https://phabricator.wikimedia.org/T310145)
[07:18:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2030.codfw.wmnet
[07:19:19] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Disable notifications DBs in C6 [puppet] - 10https://gerrit.wikimedia.org/r/820371 (https://phabricator.wikimedia.org/T310145)
[07:20:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Disable notifications DBs in C6 [puppet] - 10https://gerrit.wikimedia.org/r/820371 (https://phabricator.wikimedia.org/T310145) (owner: 10Marostegui)
[07:21:16] <icinga-wm>	 PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[07:23:18] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:23:30] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:25:33] <wikibugs>	 (03PS1) 10Slyngshede: Initial check in [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820372
[07:28:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, 10Patch-For-Review: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10Marostegui) All db*, es* hosts powered off.
[07:29:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2030.codfw.wmnet to cluster codfw and group A
[07:30:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2030.codfw.wmnet to cluster codfw and group A
[07:30:57] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Marostegui) All db* hosts powered off
[07:31:03] <wikibugs>	 (03PS1) 10Ayounsi: Revert "geodns: Map out African countries by DC latency" [dns] - 10https://gerrit.wikimedia.org/r/820272
[07:31:13] <wikibugs>	 (03PS2) 10Ayounsi: Revert "geodns: Map out African countries by DC latency" [dns] - 10https://gerrit.wikimedia.org/r/820272
[07:32:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "geodns: Map out African countries by DC latency" [dns] - 10https://gerrit.wikimedia.org/r/820272 (owner: 10Ayounsi)
[07:35:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:36:21] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[07:36:35] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[07:40:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:40:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:42:25] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[07:42:39] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[07:44:40] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db2089 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820374 (https://phabricator.wikimedia.org/T313799)
[07:45:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:46:30] <wikibugs>	 10SRE, 10ops-codfw: codfw: Master PDU rack/setup row A, row B, rowC and row D task - https://phabricator.wikimedia.org/T309956 (10ayounsi) p:05Medium→03High There are currently 3 Icinga alerts for servers with a failed PSU: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka-main2002&...
[07:46:57] <godog>	 !log grow sda/sdb 3 by 100G on thanos-be1003 - T314275
[07:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:00] <stashbot>	 T314275: thanos-be2004 sdb3 fully used - https://phabricator.wikimedia.org/T314275
[07:47:05] <godog>	 !log grow sda/sdb 3 by 100G on thanos-be2002 - T314275
[07:47:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db2089 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820374 (https://phabricator.wikimedia.org/T313799) (owner: 10Marostegui)
[07:49:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db2089 from dbctl T313799', diff saved to https://phabricator.wikimedia.org/P32280 and previous config saved to /var/cache/conftool/dbconfig/20220804-074957-marostegui.json
[07:50:02] <stashbot>	 T313799: decommission db2089 - https://phabricator.wikimedia.org/T313799
[07:50:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: route k8s messages to k8s partition [puppet] - 10https://gerrit.wikimedia.org/r/820182 (https://phabricator.wikimedia.org/T314381) (owner: 10Cwhite)
[07:50:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: alert on Icinga max check latency [alerts] - 10https://gerrit.wikimedia.org/r/820072 (https://phabricator.wikimedia.org/T314353) (owner: 10Filippo Giunchedi)
[07:50:47] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[07:50:52] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] production-m5.sql: Remove grants for labweb1001/labweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/820286 (https://phabricator.wikimedia.org/T314528) (owner: 10Marostegui)
[07:52:34] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10Marostegui) @Papaul we have some doubts about whether C1 was done or not. Can you update the list of racks that were done yesterday? Thanks!
[07:54:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Remove grants for labweb1001/labweb1002 [puppet] - 10https://gerrit.wikimedia.org/r/820286 (https://phabricator.wikimedia.org/T314528) (owner: 10Marostegui)
[07:55:11] <marostegui>	 !log Remove grants for 208.80.154.160/208.80.155.109 T314528
[07:55:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:15] <stashbot>	 T314528: Revoke MariaDB grants for labweb1001/1002 - https://phabricator.wikimedia.org/T314528
[07:58:43] <wikibugs>	 (03PS4) 10Ladsgroup: site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102
[07:59:32] <wikibugs>	 10SRE: consider hybrid caching options for ssd+disk - https://phabricator.wikimedia.org/T88992 (10fgiunchedi) a:05fgiunchedi→03None
[07:59:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10fgiunchedi) a:05fgiunchedi→03None
[08:00:02] <wikibugs>	 10SRE, 10Security: Disable agent forwarding to important hosts - https://phabricator.wikimedia.org/T198138 (10fgiunchedi) a:05fgiunchedi→03None
[08:00:09] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[08:00:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:01:15] <wikibugs>	 10SRE, 10Observability-Alerting, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi)
[08:02:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] site.pp: Combine mariadb replicas and master in each section [puppet] - 10https://gerrit.wikimedia.org/r/820102 (owner: 10Ladsgroup)
[08:03:06] <wikibugs>	 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10fgiunchedi) 05Open→03Stalled We are now alerting on elevated max check latency, I'm going to stall the task and re-evaluate in a couple of months if we need to deploy auto-remed...
[08:04:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686
[08:04:16] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[08:04:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubestagetcd2002.codfw.wmnet with reason: Switch instance to plain disks, T311686
[08:04:55] <wikibugs>	 (03PS1) 10Ladsgroup: Start reading from new templatelinks columns in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820376 (https://phabricator.wikimedia.org/T306673)
[08:05:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:07:51] <icinga-wm>	 RECOVERY - BGP status on cr2-eqord is OK: BGP OK - up: 162, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:12:45] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: scap: remove configuration for deploy* [puppet] - 10https://gerrit.wikimedia.org/r/819630
[08:12:47] <wikibugs>	 (03Abandoned) 10Slyngshede: Initial check in [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820372 (owner: 10Slyngshede)
[08:12:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: scap: remove configuration for deploy* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819630 (owner: 10Giuseppe Lavagetto)
[08:13:01] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[08:16:24] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 10:45:00 on mc[2047-2048].codfw.wmnet with reason: PDU swap
[08:16:38] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:45:00 on mc[2047-2048].codfw.wmnet with reason: PDU swap
[08:16:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.2.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[08:18:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Marostegui) Any rough ETA about when these hosts will be ready? We are also seeing some mgmt alerts for db1186, db1187 and db1188 regarding their...
[08:18:35] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[08:19:15] <jelto>	 !log power off mc2047 and mc2048
[08:19:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:44] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: remove configuration for deploy* [puppet] - 10https://gerrit.wikimedia.org/r/819630 (owner: 10Giuseppe Lavagetto)
[08:19:48] <wikibugs>	 (03CR) 10Ladsgroup: auto_schema: Start depooling codfw replicas (031 comment) [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[08:19:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132, db111, db1127, db1143', diff saved to https://phabricator.wikimedia.org/P32281 and previous config saved to /var/cache/conftool/dbconfig/20220804-081958-root.json
[08:21:51] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[08:22:03] <wikibugs>	 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) I am off starting tomorrow and back online on the 17th. As we are not fully sure about the situation with 10.6 hosts, I...
[08:22:19] <logmsgbot>	 !log oblivian@mwmaint1002 pull aborted:  (duration: 00m 11s)
[08:23:53] <wikibugs>	 (03CR) 10Ayounsi: "Thanks for the refactor!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[08:24:41] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 10:36:00 on mc[2049-2050].codfw.wmnet with reason: PDU swap
[08:24:47] <wikibugs>	 (03CR) 10Marostegui: "Good, so after merging this, we need to start doing codfw master manually or listing it, right?" [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[08:24:55] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:36:00 on mc[2049-2050].codfw.wmnet with reason: PDU swap
[08:26:09] <wikibugs>	 (03CR) 10Ayounsi: PeeringDB API: initial commit (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/816701 (owner: 10Ayounsi)
[08:26:20] <jelto>	 !log power off mc2049 and mc2050
[08:26:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:33] <wikibugs>	 (03CR) 10Ladsgroup: auto_schema: Start depooling codfw replicas (031 comment) [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[08:27:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] auto_schema: Start depooling codfw replicas [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[08:28:58] <moritzm>	 !log imported gsasl 1.8.0-8+wmf1 to stretch-wikimedia
[08:29:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:38] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: maintenance: restart php-fpm if needed [puppet] - 10https://gerrit.wikimedia.org/r/820378
[08:32:29] <jelto>	 !log kubectl cordon kubernetes2022.codfw.wmnet
[08:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:10] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: maintenance: restart php-fpm if needed [puppet] - 10https://gerrit.wikimedia.org/r/820378
[08:35:03] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[08:35:56] <jelto>	 !log kubectl drain kubernetes2022.codfw.wmnet
[08:35:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:54] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=no; selector: name=kubernetes2022.codfw.wmnet
[08:38:57] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 10:22:00 on kubernetes2022.codfw.wmnet with reason: PDU swap
[08:39:10] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:22:00 on kubernetes2022.codfw.wmnet with reason: PDU swap
[08:39:15] <icinga-wm>	 RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28
[08:39:15] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: scap: do not restart php on the mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/819631
[08:39:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: scap: do not restart php on the mwmaint servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819631 (owner: 10Giuseppe Lavagetto)
[08:39:31] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: scap: do not restart php on the mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/819631
[08:39:35] <icinga-wm>	 RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[08:39:45] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[08:41:18] <_joe_>	 I have no idea of what's the reason for ^^
[08:41:26] <_joe_>	 and I don't have time to investigate tbh
[08:43:44] <logmsgbot>	 !log oblivian@deploy1002 Synchronized README: testing new scap configuration (duration: 03m 18s)
[08:45:17] <jelto>	 !log power off kubernetes2022
[08:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: do not restart php on the mwmaint servers [puppet] - 10https://gerrit.wikimedia.org/r/819631 (owner: 10Giuseppe Lavagetto)
[08:48:27] <moritzm>	 !log draining ganeti2017 T311686
[08:48:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:32] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[08:48:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add an option to use the PKI for etcd intra-cluster certificates [puppet] - 10https://gerrit.wikimedia.org/r/820090 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[08:49:56] <_joe_>	 btullis: ahem
[08:50:13] <_joe_>	 I'd like to be consulted on patches that can change the behaviour of etcd
[08:50:37] <_joe_>	 so, before I merge it with mine, did you stop puppet on all etcd nodes before merging this patch?
[08:50:39] <btullis>	 _joe_: Sincere apologies. It's a noop.
[08:51:03] <_joe_>	 btullis: did you verify that for all etcd clusters in prod? if so it's ok
[08:51:04] <btullis>	 So no, I wasn't planning to touch any existing etcd clusters. 
[08:51:19] <btullis>	 Yes, I ran a PCC check for all etcd roles.
[08:51:27] <_joe_>	 ok then :)
[08:51:27] <btullis>	 Only a parameter change that defaults to false.
[08:51:50] <_joe_>	 ok, merging :)
[08:51:56] <btullis>	 I would have added you for review, but I didn't know who would like to have been consulted.
[08:51:59] <btullis>	 Thanks.
[08:52:31] <_joe_>	 btullis: no problems, that's why I was telling you
[08:52:49] <btullis>	 Gotcha, thanks.
[08:52:55] <wikibugs>	 (03PS1) 10Ayounsi: Interface description: don't add the z_side when disabled [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820380
[08:53:51] <icinga-wm>	 PROBLEM - IPMI Sensor Status on es2021 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[08:53:53] <_joe_>	 usually "git blame" on the files you're changing is a good starting point to figure out who to ask for a review
[08:54:13] <_joe_>	 {{merged}} btw :)
[08:54:15] <marostegui>	 I will open a task for es2021
[08:55:07] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:55:20] <btullis>	 Strangely, my PCC run has disappeared from the gerrit comments. 
[08:55:48] <wikibugs>	 10ops-codfw, 10DBA: es2021 (B3) now power supply redudancy - https://phabricator.wikimedia.org/T314559 (10Marostegui)
[08:55:58] <wikibugs>	 10ops-codfw, 10DBA: es2021 (B3) now power supply redudancy - https://phabricator.wikimedia.org/T314559 (10Marostegui) p:05Triage→03Medium
[08:56:11] <btullis>	 But here was one: https://puppet-compiler.wmflabs.org/pcc-worker1001/1384/
[08:56:18] <wikibugs>	 10ops-codfw, 10DBA: es2021 (B3) lost power supply redundancy - https://phabricator.wikimedia.org/T314559 (10Marostegui)
[08:56:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for gsasl [puppet] - 10https://gerrit.wikimedia.org/r/820383
[08:56:34] <jayme>	 jelto: ^ bgp status need's an ack again
[08:56:54] <jayme>	 s/ack/downtime/
[08:56:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes2022.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:57:05] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:57:45] <jelto>	 I'll do that, same for CalicoDown (I thought I created the correct silence in alertmanager :/)
[08:57:52] <logmsgbot>	 !log oblivian@mwmaint1002 pull aborted:  (duration: 00m 18s)
[08:58:17] <moritzm>	 !log installing gsasl security updates
[08:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:13] <wikibugs>	 (03PS2) 10Ayounsi: Interface description: don't add the z_side when disabled [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820380
[09:02:08] <wikibugs>	 (03PS1) 10Urbanecm: SpecialEditGrowthConfigLogger: Update schema version [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820276 (https://phabricator.wikimedia.org/T314173)
[09:03:09] <logmsgbot>	 !log oblivian@mwmaint1002 pull aborted:  (duration: 00m 06s)
[09:04:17] <icinga-wm>	 PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:05:12] <urbanecm>	 jouncebot: nowandnext
[09:05:12] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 54 minute(s)
[09:05:12] <jouncebot>	 In 0 hour(s) and 54 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1000)
[09:05:19] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[09:05:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] SpecialEditGrowthConfigLogger: Update schema version [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820276 (https://phabricator.wikimedia.org/T314173) (owner: 10Urbanecm)
[09:10:01] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[09:11:43] <icinga-wm>	 PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:11:57] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:11:59] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:12:09] <icinga-wm>	 PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:12:55] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=inactive; selector: name=kubernetes2022.codfw.wmnet
[09:13:09] <icinga-wm>	 PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:13:23] <icinga-wm>	 PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:13:31] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 3/6 UP : OSPFv3: 3/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:13:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap: do not use double quotes to define an empty value [puppet] - 10https://gerrit.wikimedia.org/r/820384
[09:14:39] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[09:14:55] <wikibugs>	 (03PS1) 10Ayounsi: Don't add server name on disabled interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/820385
[09:15:29] <icinga-wm>	 RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:15:43] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] scap: do not use double quotes to define an empty value [puppet] - 10https://gerrit.wikimedia.org/r/820384 (owner: 10Giuseppe Lavagetto)
[09:15:43] <icinga-wm>	 RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:15:51] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:15:58] <wikibugs>	 (03PS4) 10Urbanecm: [beta] Growth: Switch to structured mentor list at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905)
[09:16:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [beta] Growth: Switch to structured mentor list at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[09:16:18] <wikibugs>	 (03PS2) 10Urbanecm: testwiki: Growth: Switch to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819053 (https://phabricator.wikimedia.org/T310905)
[09:16:23] <icinga-wm>	 RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:16:26] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] testwiki: Growth: Switch to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819053 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[09:16:35] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:16:39] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:16:47] <icinga-wm>	 RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:17:17] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2089 [puppet] - 10https://gerrit.wikimedia.org/r/820386 (https://phabricator.wikimedia.org/T313799)
[09:17:24] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Growth: Switch to structured mentor list at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808269 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[09:17:44] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Growth: Switch to structured mentor list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819053 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[09:18:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db2089.codfw.wmnet
[09:20:14] <wikibugs>	 (03PS1) 10Phuedx: beta: Remove $wgMediaViewerNetworkPerformanceSamplingFactor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820389 (https://phabricator.wikimedia.org/T310890)
[09:21:03] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01132 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[09:21:13] <icinga-wm>	 RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:21:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:21:55] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "+1.  I can see how it'd be a nice to have but agree it's not worth adding steps for other SREs." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820380 (owner: 10Ayounsi)
[09:22:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:22:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:22:21] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Interface description: don't add the z_side when disabled [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820380 (owner: 10Ayounsi)
[09:23:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[09:23:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:23:59] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[09:25:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [cookbooks] - 10https://gerrit.wikimedia.org/r/820385 (owner: 10Ayounsi)
[09:25:47] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: wmf-netbox.py update - ayounsi@cumin1001
[09:26:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2089 [puppet] - 10https://gerrit.wikimedia.org/r/820386 (https://phabricator.wikimedia.org/T313799) (owner: 10Marostegui)
[09:26:18] <wikibugs>	 (03PS1) 10Jbond: P:etcd::v3: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/820390
[09:26:35] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0614a39bf15252c95a96565dd7c986237f3d3323: testwiki: Growth: Switch to structured mentor list (T310905) (duration: 03m 38s)
[09:26:39] <jbond>	 btullis: FYI yuor change is causing pouppet failures, i think the above fixes it, about to merge
[09:26:40] <stashbot>	 T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905
[09:26:41] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Don't add server name on disabled interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/820385 (owner: 10Ayounsi)
[09:26:44] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Marostegui) a:03Papaul
[09:26:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:26:45] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2089.codfw.wmnet
[09:26:48] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialEditGrowthConfigLogger: Update schema version [extensions/GrowthExperiments] (wmf/1.39.0-wmf.23) - 10https://gerrit.wikimedia.org/r/820276 (https://phabricator.wikimedia.org/T314173) (owner: 10Urbanecm)
[09:26:49] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2089.codfw.wmnet` - db2089.codfw.wmnet (**PASS**)   - Downtimed host on Icinga/Alertmanager   - Found phys...
[09:26:56] <wikibugs>	 10ops-codfw, 10decommission-hardware: decommission db2089 - https://phabricator.wikimedia.org/T313799 (10Marostegui) @Papaul this is ready for you
[09:27:24] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: wmf-netbox.py update - ayounsi@cumin1001
[09:28:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:etcd::v3: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/820390 (owner: 10Jbond)
[09:28:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] P:etcd::v3: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/820390 (owner: 10Jbond)
[09:28:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:29:15] <wikibugs>	 (03PS1) 10Marostegui: db2177: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820391 (https://phabricator.wikimedia.org/T311494)
[09:29:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:29:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:29:35] <jbond>	 btullis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/820390 has fixed the issue
[09:29:39] <wikibugs>	 (03PS1) 10Urbanecm: testwiki: Growth: Assign enrollasmentor to * [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820392 (https://phabricator.wikimedia.org/T310905)
[09:29:41] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] testwiki: Growth: Assign enrollasmentor to * [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820392 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[09:30:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:30:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2177: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820391 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[09:30:50] <wikibugs>	 (03Merged) 10jenkins-bot: Don't add server name on disabled interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/820385 (owner: 10Ayounsi)
[09:31:14] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Growth: Assign enrollasmentor to * [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820392 (https://phabricator.wikimedia.org/T310905) (owner: 10Urbanecm)
[09:31:34] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 9:30:00 on 9 hosts with reason: PDU swap
[09:31:41] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 9:30:00 on 9 hosts with reason: PDU swap
[09:32:29] <jelto>	 !log set/pooled=inactive mw22[71-79].codfw.wmnet
[09:32:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:00] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2177 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820393 (https://phabricator.wikimedia.org/T311494)
[09:35:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2177 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/820393 (https://phabricator.wikimedia.org/T311494) (owner: 10Marostegui)
[09:35:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:35:31] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ddcd333015bb58a98709a5005a5db7e8519dd0a5: testwiki: Growth: Assign enrollasmentor to * (T310905) (duration: 03m 41s)
[09:35:34] <stashbot>	 T310905: Deploy structured wikitext mentor list to Wikimedia wikis - https://phabricator.wikimedia.org/T310905
[09:36:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:36:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:37:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2177 to s3 T311494', diff saved to https://phabricator.wikimedia.org/P32282 and previous config saved to /var/cache/conftool/dbconfig/20220804-093704-marostegui.json
[09:37:08] <stashbot>	 T311494: Productionize db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T311494
[09:37:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:37:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] jwt_authorizer: Provide microservice for JSON Web Token authorization [puppet] - 10https://gerrit.wikimedia.org/r/816018 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[09:37:56] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove unused $wgWBCSEnableDispatchingQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820397
[09:37:58] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove unused SearchSettingsForSDC.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398
[09:38:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:38:38] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0004921 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[09:38:50] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.23/extensions/GrowthExperiments/includes/EventLogging/SpecialEditGrowthConfigLogger.php: ba67dd940217e9f786f4349b4da0fe088475fde9: SpecialEditGrowthConfigLogger: Update schema version (T314173, T312148) (duration: 03m 18s)
[09:38:56] * urbanecm done
[09:38:56] <stashbot>	 T314173: editgrowthconfig schema: '' should NOT have additional properties, - https://phabricator.wikimedia.org/T314173
[09:38:56] <stashbot>	 T312148: Add instrumentation to Special:EditGrowthConfig - https://phabricator.wikimedia.org/T312148
[09:39:53] <jelto>	 !log power off mw22[71-79].codfw.wmnet
[09:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:14] <icinga-wm>	 PROBLEM - puppet last run on mwdebug2001 is CRITICAL: CRITICAL: Puppet has been disabled for 604998 seconds, message: joe messing with php-fpm - oblivian, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:41:01] <wikibugs>	 (03PS7) 10Thiemo Kreuz (WMDE): Streamline/modernize code in MWConfigCacheGenerator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737857
[09:41:39] <_joe_>	 sogj O
[09:42:01] <_joe_>	 err. off by one. Sigh I'll fix mwdebug2001
[09:42:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:43:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:43:33] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:43:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:44:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:45:40] <icinga-wm>	 RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:47:39] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Update requirements [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/820399 (owner: 10Ayounsi)
[09:47:59] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements - ayounsi@cumin1001
[09:49:28] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[09:49:35] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements - ayounsi@cumin1001
[09:50:01] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review, 10Sustainability: Enable bracketed-paste-mode for production shells (e.g. deployment, mwmaint) - https://phabricator.wikimedia.org/T293614 (10Lucas_Werkmeister_WMDE) 05Open→03Stalled
[09:54:44] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove unused $wgIncludejQueryMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820404 (https://phabricator.wikimedia.org/T280944)
[09:55:42] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:56:41] <Lucas_WMDE>	 yay, stashbot is back
[09:57:05] <Lucas_WMDE>	 should we manually repeat the logmsgbot !logs since 9:42 UTC?
[10:00:04] <jouncebot>	 mvolz: How many deployers does it take to do Services – Citoid / Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1000).
[10:00:36] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements + wmf-netbox - ayounsi@cumin1001
[10:00:44] <jynus>	 !log stop db2099 T310145
[10:00:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:48] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[10:02:14] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: update requirements + wmf-netbox - ayounsi@cumin1001
[10:03:17] <Lucas_WMDE>	 !log stashbot temporarily parted and lost several logs between 9:42 UTC and 9:49 UTC; mainly mwdebug helmfil start/done, also ayounsi sre.deploy.python-code cookbook to cumin1001, cumin2002; see IRC logs
[10:03:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:16] <icinga-wm>	 RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:05:00] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: trafficserver: allow x-wikimedia-debug to pick a php backend [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653)
[10:05:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: trafficserver: allow x-wikimedia-debug to pick a php backend (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819510 (https://phabricator.wikimedia.org/T312653) (owner: 10Giuseppe Lavagetto)
[10:11:20] <icinga-wm>	 PROBLEM - Host parse2014 is DOWN: PING CRITICAL - Packet loss = 100%
[10:12:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:12:46] <icinga-wm>	 PROBLEM - Host parse2011 is DOWN: PING CRITICAL - Packet loss = 100%
[10:12:58] <icinga-wm>	 PROBLEM - Host parse2012 is DOWN: PING CRITICAL - Packet loss = 100%
[10:13:00] <icinga-wm>	 PROBLEM - Host parse2013 is DOWN: PING CRITICAL - Packet loss = 100%
[10:13:00] <icinga-wm>	 PROBLEM - Host parse2015 is DOWN: PING CRITICAL - Packet loss = 100%
[10:15:16] <wikibugs>	 (03PS14) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859
[10:16:46] <icinga-wm>	 PROBLEM - Host mw2352 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:17:22] <icinga-wm>	 PROBLEM - Host mw2350 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:22] <icinga-wm>	 PROBLEM - Host mw2351 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:22] <icinga-wm>	 PROBLEM - Host mw2353 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:22] <icinga-wm>	 PROBLEM - Host mw2354 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:22] <icinga-wm>	 PROBLEM - Host mw2355 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:23] <icinga-wm>	 PROBLEM - Host mw2356 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:23] <icinga-wm>	 PROBLEM - Host mw2357 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:24] <icinga-wm>	 PROBLEM - Host mw2358 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:24] <icinga-wm>	 PROBLEM - Host mw2360 is DOWN: PING CRITICAL - Packet loss = 100%
[10:17:26] <icinga-wm>	 PROBLEM - Host mw2359 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:18] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): Use more compact PHP7 syntax where possible (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE))
[10:18:24] <icinga-wm>	 PROBLEM - Host mw2363 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:28] <icinga-wm>	 PROBLEM - Host mw2376 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:28] <icinga-wm>	 PROBLEM - Host mw2361 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:28] <icinga-wm>	 PROBLEM - Host mw2362 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:28] <icinga-wm>	 PROBLEM - Host mw2364 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:28] <icinga-wm>	 PROBLEM - Host mw2365 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:48] <icinga-wm>	 PROBLEM - Host mw2368 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:48] <icinga-wm>	 PROBLEM - Host mw2369 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:56] <icinga-wm>	 PROBLEM - Host mw2367 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:56] <icinga-wm>	 PROBLEM - Host mw2370 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:56] <icinga-wm>	 PROBLEM - Host mw2371 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:56] <icinga-wm>	 PROBLEM - Host mw2372 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:56] <icinga-wm>	 PROBLEM - Host mw2373 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:57] <icinga-wm>	 PROBLEM - Host mw2374 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:57] <icinga-wm>	 PROBLEM - Host mw2375 is DOWN: PING CRITICAL - Packet loss = 100%
[10:18:58] <icinga-wm>	 PROBLEM - Host mw2366 is DOWN: PING CRITICAL - Packet loss = 100%
[10:19:32] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 9:00:00 on 32 hosts with reason: PDU swap
[10:19:54] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 9:00:00 on 32 hosts with reason: PDU swap
[10:20:07] <wikibugs>	 (03PS3) 10Thiemo Kreuz (WMDE): Remove unused code from StaticSiteConfiguration class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737858
[10:20:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:22:54] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[10:25:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for gsasl [puppet] - 10https://gerrit.wikimedia.org/r/820383 (owner: 10Muehlenhoff)
[10:25:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:27:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[10:27:22] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[10:27:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on ganeti2017.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[10:30:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2017.codfw.wmnet with OS bullseye
[10:30:16] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS bullseye
[10:35:50] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[10:44:52] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[10:45:10] <icinga-wm>	 PROBLEM - Host backup2006 is DOWN: PING CRITICAL - Packet loss = 100%
[10:49:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2017.codfw.wmnet with reason: host reimage
[10:49:22] <icinga-wm>	 PROBLEM - Host db2126 is DOWN: PING CRITICAL - Packet loss = 100%
[10:49:56] <jynus>	 em, that is not supposed to happen
[10:50:17] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Aklapper) @Siko_WMDE: Please make any potential internal docs point to https://phabricator.wikimedia.org/tag/ldap-access-requests/ which has canonical instructions for such requests, for future refer...
[10:51:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Aklapper)
[10:51:55] <jynus>	 backup2006 seems down on mgmt interface to, so power, not network
[10:52:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2017.codfw.wmnet with reason: host reimage
[10:53:43] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2022.codfw.wmnet
[10:53:58] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2015.codfw.wmnet
[10:55:09] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[10:55:39] <jynus>	 ah, I see now, the server was not put back up, and the alert just expired
[11:01:10] <wikibugs>	 (03PS1) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129)
[11:01:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi)
[11:01:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Enable OIDC in CAS - https://phabricator.wikimedia.org/T311999 (10ayounsi)
[11:01:49] <wikibugs>	 (03PS1) 10Marostegui: db2126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820417
[11:02:36] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[11:02:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2126: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/820417 (owner: 10Marostegui)
[11:04:54] <wikibugs>	 (03PS2) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129)
[11:05:11] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10taavi)
[11:08:06] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10jbond) from the scripot above we get the following list   db2051.codfw.wmnet db2057.codfw.wmnet db2063.codfw.wmnet kafka1001.eqiad.wmnet kafka1002.eqiad.wmnet kafka10...
[11:09:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2017.codfw.wmnet with OS bullseye
[11:09:58] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2017.codfw.wmnet with OS bullseye completed: - ganeti2017 (**PASS**)   - Downtimed on...
[11:10:35] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10jcrespo) > This is likely host that where decommissioned before the current decommissioning scripts which force a puppet clean.  Based on logs at T220002#5574262 that...
[11:12:56] <wikibugs>	 (03PS3) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129)
[11:14:42] <icinga-wm>	 PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:16:44] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10jbond) 05Open→03Resolved a:03jbond > the decom script "lied" to us and had a bug and didn't delete the certs, even if it told us that it did This i think is the...
[11:17:03] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates fom puppet master - https://phabricator.wikimedia.org/T314564 (10jbond)
[11:17:07] <wikibugs>	 10SRE: Puppet certificate discrepancies - https://phabricator.wikimedia.org/T250483 (10jbond)
[11:18:54] <icinga-wm>	 RECOVERY - Puppet CA expired certs on puppetmaster1001 is OK: OK: all puppet agent certs fine https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate
[11:24:52] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[11:26:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet
[11:34:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet
[11:35:15] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-herron: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239 (10jbond)
[11:36:03] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-herron: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239 (10jbond) 05Open→03Resolved a:03jbond Closing all servers listed have been decomissioned
[11:36:27] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Puppet hosts with their cert revoked can still run puppet - https://phabricator.wikimedia.org/T184444 (10jbond)
[11:36:45] <wikibugs>	 (03PS4) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129)
[11:37:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2017.codfw.wmnet to cluster codfw and group D
[11:40:34] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations: remove old puppet certificates from puppet master - https://phabricator.wikimedia.org/T314564 (10Aklapper)
[11:41:00] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:41:12] <moritzm>	 !log installing libpgjava security updates
[11:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2017.codfw.wmnet to cluster codfw and group D
[11:45:40] <wikibugs>	 (03PS1) 10Jbond: P:puppetmasters: Convert 004 puppetmasteres to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820428 (https://phabricator.wikimedia.org/T314136)
[11:46:43] <moritzm>	 !log installing Linux 5.10.127-2 kernels on Bullseye hosts
[11:46:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:43] <wikibugs>	 (03PS2) 10Jbond: P:puppetmasters: Convert 004 puppetmasteres to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820428 (https://phabricator.wikimedia.org/T314136)
[11:50:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 7 DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36620/console" [puppet] - 10https://gerrit.wikimedia.org/r/820428 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[11:52:19] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[11:53:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] P:puppetmasters: Convert 004 puppetmasteres to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820428 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[11:55:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:01:16] <wikibugs>	 (03PS1) 10Jbond: hieradata: migrate idp-test2002 to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820431
[12:02:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] hieradata: migrate idp-test2002 to canaries [puppet] - 10https://gerrit.wikimedia.org/r/820431 (owner: 10Jbond)
[12:03:01] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:03:23] <jbond>	 !log send sretest100[12] and idp-test2001 to the new puppetmaster[12]004 servers to test
[12:03:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:47] <wikibugs>	 (03PS5) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852)
[12:10:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:19:11] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:26:05] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:31:50] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff)
[12:32:07] <wikibugs>	 10SRE, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10CDanis) @BTullis are you still interested in this?
[12:36:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hieradata: switch traffic to cloudrabbit1001-3 [puppet] - 10https://gerrit.wikimedia.org/r/816818 (owner: 10Majavah)
[12:43:18] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:45:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) https://gerrit.wikimedia.org/r/c/operations/puppet/+/816818 merged
[12:45:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) Before this is closed out I want to review the firewall rules and see if we need to limit port access to VM + prod networks
[12:46:38] <wikibugs>	 (03PS4) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858)
[12:47:52] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Dzahn) Hello @Siko_WMDE   please create a user on the Wikitech wiki ( https://wikitech.wikimedia.org/wiki/Special:CreateAccount) and let us know the user name you picked once done.  A...
[12:48:08] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:48:42] <moritzm>	 !log installing Linux 4.19.249 kernels on Buster hosts
[12:48:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:02] <icinga-wm>	 RECOVERY - Disk space on gitlab2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=gitlab2002&var-datasource=codfw+prometheus/ops
[12:50:54] <icinga-wm>	 RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:51:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Papaul) @ayounsi see @Marostegui comment above.  Thnaks
[12:54:40] <wikibugs>	 (03PS1) 10CDanis: Print VO API response when we do escalate [software/klaxon] - 10https://gerrit.wikimedia.org/r/820439 (https://phabricator.wikimedia.org/T313603)
[12:58:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] sysfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811226 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1300).
[13:00:05] <jouncebot>	 danisztls and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:01:02] <taavi>	 o/ I have a patch of my own coming in a sec
[13:01:05] <Lucas_WMDE>	 o/
[13:01:14] <danisztls>	 o/
[13:01:23] <Lucas_WMDE>	 (I need 10 more minutes or so, if anyone else wants to start deploying first)
[13:02:38] <taavi>	 i'll start from danisztls's patch then
[13:02:54] <wikibugs>	 (03PS3) 10Majavah: QuickSurveys: Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819175 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[13:03:14] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] QuickSurveys: Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819175 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[13:04:13] <wikibugs>	 (03PS1) 10Majavah: Remove unused CA P3P config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820441
[13:04:44] <wikibugs>	 (03Merged) 10jenkins-bot: QuickSurveys: Deploy research incentive survey to Bengali wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/819175 (https://phabricator.wikimedia.org/T314333) (owner: 10DDesouza)
[13:05:05] <Lucas_WMDE>	 ok
[13:05:24] <taavi>	 danisztls: can you test on mwdebug1001 please?
[13:05:29] <danisztls>	 taavi: yes
[13:06:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:06:54] <taavi>	 that alert seems to be for codfw, probably due to the dc maintenance, ignoring
[13:07:06] <moritzm>	 !log installing jetty9 security updates
[13:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:08] <danisztls>	 taavi: looks good
[13:08:23] <taavi>	 thanks, syncing
[13:09:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Dedicated cloudrabbit nodes in eqiad1 - https://phabricator.wikimedia.org/T314522 (10Andrew) >>! In T314522#8131078, @Andrew wrote: > Before this is closed out I want to review the firewall rules and see if we need to limit port access to VM + p...
[13:11:16] <wikibugs>	 (03PS1) 10Jbond: O:puppetmaster: introduce new puppetmaster[12]004 backends [puppet] - 10https://gerrit.wikimedia.org/r/820442 (https://phabricator.wikimedia.org/T314136)
[13:11:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:11:26] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[13:11:52] <logmsgbot>	 !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:819175|QuickSurveys: Deploy research incentive survey to Bengali wiki (T314333)]] (duration: 03m 26s)
[13:11:56] <stashbot>	 T314333: Deploy Research Incentive Survey on Bengali Wikipedia - https://phabricator.wikimedia.org/T314333
[13:11:58] <taavi>	 danisztls: and it's live!
[13:12:07] <wikibugs>	 (03PS2) 10Majavah: Remove unused CA P3P config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820441
[13:12:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:12:08] <danisztls>	 taavi: thanks
[13:12:15] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820441 (owner: 10Majavah)
[13:13:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:13:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:13:36] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused CA P3P config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820441 (owner: 10Majavah)
[13:14:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:14:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36622/console" [puppet] - 10https://gerrit.wikimedia.org/r/820442 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[13:14:17] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: T310145
[13:14:20] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[13:14:31] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2065.codfw.wmnet with reason: T310145
[13:14:33] <jbond>	 !log intorudce new puppetmaster backends puppetmaster[12]004
[13:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] O:puppetmaster: introduce new puppetmaster[12]004 backends [puppet] - 10https://gerrit.wikimedia.org/r/820442 (https://phabricator.wikimedia.org/T314136) (owner: 10Jbond)
[13:15:26] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10bking)
[13:15:48] <icinga-wm>	 PROBLEM - Check systemd state on mw2386 is CRITICAL: CRITICAL - degraded: The following units failed: php7.2-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:09] <Lucas_WMDE>	 taavi: should I continue with my config changes?
[13:17:23] <taavi>	 Lucas_WMDE: still syncing mine, just a sec
[13:17:26] <Lucas_WMDE>	 ok
[13:17:55] <logmsgbot>	 !log taavi@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:820441|Remove unused CA P3P config]] (duration: 03m 09s)
[13:18:03] <taavi>	 Lucas_WMDE: all done
[13:18:28] <Lucas_WMDE>	 thanks
[13:19:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:20:19] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove unused $wgWBCSEnableDispatchingQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820397
[13:20:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgWBCSEnableDispatchingQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820397 (owner: 10Lucas Werkmeister (WMDE))
[13:21:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) 05Resolved→03Open
[13:21:35] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused $wgWBCSEnableDispatchingQueryBuilder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820397 (owner: 10Lucas Werkmeister (WMDE))
[13:22:19] <Lucas_WMDE>	 pulled to mwdebug1001, testing a bit
[13:23:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:23:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:24:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:25:06] <Lucas_WMDE>	 scap only restarting php-fpm on ~260 instead of ~300 hosts, I assume due to the codfw stuff
[13:26:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/SearchSettingsForSDC.php: Config: [[gerrit:820397|Remove unused $wgWBCSEnableDispatchingQueryBuilder]] (duration: 03m 01s)
[13:26:59] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "CCing other people who edited this file… is it okay to remove, or do you want to keep it around for future convenience?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820398 (owner: 10Lucas Werkmeister (WMDE))
[13:27:12] <Lucas_WMDE>	 I’ll skip ^ that change for now and do the other two removals first
[13:27:56] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove unused $wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470)
[13:29:05] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470) (owner: 10Lucas Werkmeister (WMDE))
[13:29:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:30:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:30:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:30:26] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused $wgLegacyJavaScriptGlobals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470) (owner: 10Lucas Werkmeister (WMDE))
[13:31:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:31:38] <Lucas_WMDE>	 syncing
[13:34:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/klaxon] - 10https://gerrit.wikimedia.org/r/820439 (https://phabricator.wikimedia.org/T313603) (owner: 10CDanis)
[13:34:24] <wikibugs>	 (03Abandoned) 10Jbond: P:adduser: apply adduser before any packages are installed [puppet] - 10https://gerrit.wikimedia.org/r/819541 (https://phabricator.wikimedia.org/T235067) (owner: 10Jbond)
[13:34:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820402|Remove unused $wgLegacyJavaScriptGlobals (T72470)]] (1/2) (duration: 02m 58s)
[13:34:38] <stashbot>	 T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470
[13:35:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:36:15] <wikibugs>	 (03Abandoned) 10Jbond: P:sretest: import blackbox to sretest to check if its just genrally slow [puppet] - 10https://gerrit.wikimedia.org/r/817789 (owner: 10Jbond)
[13:36:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:37:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:37:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:37:54] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:820402|Remove unused $wgLegacyJavaScriptGlobals (T72470)]] (2/2) (duration: 02m 59s)
[13:38:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) Hi! @MunizaA had a problem with her laptop, and she needs to add a new ssh public key to access the cluster.  The new key is here P32283  @CDanis could you help us with this plea...
[13:38:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:38:46] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove unused $wgIncludejQueryMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820404 (https://phabricator.wikimedia.org/T280944)
[13:39:48] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: T310145
[13:39:48] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgIncludejQueryMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820404 (https://phabricator.wikimedia.org/T280944) (owner: 10Lucas Werkmeister (WMDE))
[13:39:54] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[13:40:02] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2066.codfw.wmnet with reason: T310145
[13:40:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:40:16] <wikibugs>	 (03PS3) 10Jbond: cli: Add ability to override th amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556
[13:40:18] <wikibugs>	 (03CR) 10Jbond: cli: Add ability to override th amount of retries and backoffs (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond)
[13:40:41] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused $wgIncludejQueryMigrate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820404 (https://phabricator.wikimedia.org/T280944) (owner: 10Lucas Werkmeister (WMDE))
[13:40:46] <wikibugs>	 (03CR) 10Jbond: cli: Add ability to override th amount of retries and backoffs (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond)
[13:41:06] <wikibugs>	 (03PS4) 10Jbond: cli: Add ability to override the amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556
[13:43:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:44:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:44:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:44:42] <wikibugs>	 (03PS1) 10Andrew Bogott: Trove: fix copy/paste user with trove_guest_rabbit_pass [puppet] - 10https://gerrit.wikimedia.org/r/820452
[13:45:03] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820404|Remove unused $wgIncludejQueryMigrate (T280944)]] (1/2) (duration: 02m 58s)
[13:45:03] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:45:06] <stashbot>	 T280944: Phase out jQuery Migrate v3 - https://phabricator.wikimedia.org/T280944
[13:45:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:45:34] <wikibugs>	 (03PS2) 10Andrew Bogott: Trove: fix copy/paste error with trove_guest_rabbit_pass [puppet] - 10https://gerrit.wikimedia.org/r/820452
[13:47:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10RhinosF1) a:05MunizaA→03None Hi @Diego, the SRE on duty changes weekly. It is now @mutante. I'll make sure they see this.
[13:48:18] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:820404|Remove unused $wgIncludejQueryMigrate (T280944)]] (2/2) (duration: 03m 03s)
[13:48:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Trove: fix copy/paste error with trove_guest_rabbit_pass [puppet] - 10https://gerrit.wikimedia.org/r/820452 (owner: 10Andrew Bogott)
[13:49:45] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10bking)
[13:49:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Muniza - https://phabricator.wikimedia.org/T292955 (10diego) Thanks @RhinosF1 !
[13:49:56] <Lucas_WMDE>	 anything else to deploy?
[13:50:06] <Lucas_WMDE>	 otherwise I might do another one for MathUseRestBase
[13:50:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:51:08] <Reedy>	 Lucas_WMDE: I have a couple of mw-config patches from the unused config thing if you want :P
[13:51:18] <Lucas_WMDE>	 sure ^^
[13:51:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:51:30] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:51:52] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Remove unused $wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820454 (https://phabricator.wikimedia.org/T274436)
[13:52:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:52:41] <Lucas_WMDE>	 Reedy: like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/820255 ?
[13:53:02] <Reedy>	 yeah, that one and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/820254
[13:53:28] <Lucas_WMDE>	 alright, grepping for the names from the first one
[13:53:37] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): wikitech: Remove old LDAP config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820255 (owner: 10Reedy)
[13:53:57] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "No occurrences in all of deploy1002:/srv/mediawiki-staging outside of wikitech.php 👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820255 (owner: 10Reedy)
[13:54:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10WMDE-leszek) I confirm @Siko_WMDE's identity, and approve the request. Thank you!
[13:54:55] <wikibugs>	 (03Merged) 10jenkins-bot: wikitech: Remove old LDAP config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820255 (owner: 10Reedy)
[13:55:26] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "That is indeed the correct variable name in php-1.39.0-wmf.23/extensions/StopForumSpam/extension.json / php-1.39.0-wmf.23/extensions/StopF" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254 (owner: 10Reedy)
[13:55:42] <Lucas_WMDE>	 Reedy: want to test them on mwdebug?
[13:55:43] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row D new PDUs - https://phabricator.wikimedia.org/T310146 (10MatthewVernon) All the ms-* nodes in `C4` & `C7` must be back and properly in service before we can start on `D2`, I'm afraid. I'll be on IRC, but please don't star...
[13:55:46] <Lucas_WMDE>	 otherwise I’m happy to sync them directly
[13:56:27] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10MatthewVernon) Having moved `C2` to today, it needs to wait until all the ms-* nodes in `D2` are fully back up before starting.
[13:56:36] <Reedy>	 I don't see much point testing them either :)
[13:56:38] <Reedy>	 Feel free to sync <3
[13:56:41] <Lucas_WMDE>	 sounds good :)
[13:56:43] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10MatthewVernon)
[13:56:50] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254 (owner: 10Reedy)
[13:56:53] <Lucas_WMDE>	 syncing
[13:56:54] <wikibugs>	 (03CR) 10Elukey: "Ben, the change seems to fail for the new hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[13:57:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:58:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:58:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:58:37] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2058,2064].codfw.wmnet with reason: PDU work
[13:58:52] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2058,2064].codfw.wmnet with reason: PDU work
[13:58:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ec46c9c7-d251-4875-87f9-040b391ea22a) set by mvernon@cumin1001 for 1 day, 0:00:00 on 2 host(s) and...
[13:59:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:59:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/wikitech.php: Config: [[gerrit:820255|wikitech: Remove old LDAP config vars]] (duration: 02m 54s)
[13:59:52] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254 (owner: 10Reedy)
[14:01:13] <Lucas_WMDE>	 jouncebot: now
[14:01:14] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 58 minute(s)
[14:01:16] <Lucas_WMDE>	 ok
[14:01:50] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820254 (owner: 10Reedy)
[14:03:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10ayounsi) That's something for #ops-eqiad, I think there was some confusion during the provisioning of those hosts:  For example, [[ https://netbox...
[14:03:48] <wikibugs>	 (03CR) 10Jforrester: "Oops, thanks for spotting this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470) (owner: 10Lucas Werkmeister (WMDE))
[14:03:52] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 103, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:04:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:04:26] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgLegacyJavaScriptGlobals (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820402 (https://phabricator.wikimedia.org/T72470) (owner: 10Lucas Werkmeister (WMDE))
[14:04:51] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: T310145
[14:04:54] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[14:05:05] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2033.codfw.wmnet with reason: T310145
[14:05:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:05:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:05:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:820254|CommonSettings-labs: Fix usage of $wgSFSValidateIPListLocationMD5]] (duration: 02m 51s)
[14:06:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:06:33] <Lucas_WMDE>	 phuedx: should we deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/820389 too? (MediaViewer unused cleanup)
[14:06:39] <Lucas_WMDE>	 if you happen to be around
[14:07:00] <wikibugs>	 (03CR) 10Btullis: Bootstrap etcd on the dse_k8s_etcd cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/820416 (https://phabricator.wikimedia.org/T313129) (owner: 10Btullis)
[14:07:04] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Remove unused $wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820454 (https://phabricator.wikimedia.org/T274436)
[14:07:20] <Lucas_WMDE>	 in the meantime I’ll do the MathUseRestBase cleanup
[14:08:45] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove unused $wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820454 (https://phabricator.wikimedia.org/T274436) (owner: 10Lucas Werkmeister (WMDE))
[14:09:57] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, no references to this variable outside of IS-labs.php on deploy1002:/srv/mediawiki-staging." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820389 (https://phabricator.wikimedia.org/T310890) (owner: 10Phuedx)
[14:10:04] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused $wgMathUseRestBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820454 (https://phabricator.wikimedia.org/T274436) (owner: 10Lucas Werkmeister (WMDE))
[14:11:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:12:22] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:12:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:13:10] <wikibugs>	 (03PS1) 10Jforrester: Wikifunctions: Drop two config items moved to docker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820459
[14:13:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:13:34] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:820454|Remove unused $wgMathUseRestBase (T274436)]] (duration: 03m 01s)
[14:13:36] <stashbot>	 T274436: Enable RESTbaseless validation in wikibase - https://phabricator.wikimedia.org/T274436
[14:14:23] <Lucas_WMDE>	 I think I’ll stop there for now :)
[14:14:40] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:12] <wikibugs>	 (03PS1) 10Samtar: DefaultConfig: add a (humorous) deployment message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820460
[14:16:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] DefaultConfig: add a (humorous) deployment message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820460 (owner: 10Samtar)
[14:16:19] <Reedy>	 TheresNoTime: Computer says no.
[14:16:29] <TheresNoTime>	 noooooooo
[14:17:07] <TheresNoTime>	 There was a critical error during execution of Flake8: plugin code for `flake8-logging-format[logging-format]` does not match ^[A-Z]{1,3}[0-9]{0,3}$
[14:17:18] <TheresNoTime>	 unrelated CI error? :>
[14:17:58] <Lucas_WMDE>	 TheresNoTime: lmao
[14:18:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:18:34] <icinga-wm>	 PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_uwsgi-striker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:18:34] <Reedy>	 computer says NO
[14:18:54] <TheresNoTime>	 😭
[14:19:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:19:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:20:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:20:15] <Lucas_WMDE>	 looks like we need https://github.com/globality-corp/flake8-logging-format/commit/f3cdb24468241ebe85e41b0bd2e8958c76b4dec6
[14:20:30] <Lucas_WMDE>	 I guess flake8 got stricter about requirements for its plugins
[14:21:21] <Emperor>	 !log shutdown ms-be20[58,64].codfw.wmnet for PDU swap T310145
[14:21:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:23] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[14:21:26] <Lucas_WMDE>	 https://pypi.org/project/flake8-logging-format/#history doesn’t show a published version that could include this fix though :<
[14:22:41] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on logstash2035.codfw.wmnet with reason: pdu
[14:22:52] <godog>	 !log poweroff logstash2035 - T310145
[14:22:55] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on logstash2035.codfw.wmnet with reason: pdu
[14:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:57] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: T310145
[14:23:11] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2032.codfw.wmnet with reason: T310145
[14:24:28] <Lucas_WMDE>	 filed https://phabricator.wikimedia.org/T314576
[14:24:43] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 4:30:00 on gitlab-runner2003.codfw.wmnet with reason: PDU swap
[14:24:47] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: T310145
[14:25:00] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2001.codfw.wmnet with reason: T310145
[14:25:07] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:30:00 on gitlab-runner2003.codfw.wmnet with reason: PDU swap
[14:25:16] <jelto>	 !log power off gitlab-runner2003
[14:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:32] <icinga-wm>	 RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:29:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10bking)
[14:30:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to wmde and nda for Simon Kock (WMDE) - https://phabricator.wikimedia.org/T314563 (10Siko_WMDE) Hi @Dzahn,  I already created a user on Wikitech wiki, the name is:   Siko_WMDE  Thank you and best regards, Simon
[14:31:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:31:52] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: T310145
[14:31:57] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[14:32:18] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2011.codfw.wmnet with reason: T310145
[14:32:24] <wikibugs>	 (03PS1) 10Samtar: requirements.txt: Pin flake8 to v4.0.1 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820462 (https://phabricator.wikimedia.org/T314576)
[14:32:42] <Reedy>	 heh
[14:32:53] <Reedy>	 TheresNoTime: Might be worth filing a task about doing that more widely...
[14:33:41] <TheresNoTime>	 I think taavi mentioned the flake8 updates end of last month caused quite a few issues
[14:33:53] * TheresNoTime assumes there must already be a task..
[14:35:09] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] requestctl: Add a reminder to "requestctl commit" after enable/disable [software/conftool] - 10https://gerrit.wikimedia.org/r/817351 (https://phabricator.wikimedia.org/T305580) (owner: 10RLazarus)
[14:35:17] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2016.codfw.wmnet
[14:35:22] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2020.codfw.wmnet
[14:35:28] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2025.codfw.wmnet
[14:35:49] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2007.codfw.wmnet
[14:36:26] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2009.codfw.wmnet
[14:37:15] <wikibugs>	 (03PS1) 10Ayounsi: Netbox: add hourly postgres backups [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677)
[14:37:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:38:38] <phuedx>	 Lucas_WMDE: I wasn't around :) I've scheduled that patch for deployment next week. I'd deploy it myself in the meantime but I haven't regenerated my keys yet
[14:38:45] <phuedx>	 Thanks for the ping though
[14:38:48] <Lucas_WMDE>	 phuedx: ok, sounds good!
[14:39:00] <Lucas_WMDE>	 I already gave it a +1 ^^
[14:40:32] <wikibugs>	 (03PS6) 10Eigyan: [config]: Add click event logging for mobile and desktop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812391 (https://phabricator.wikimedia.org/T310852)
[14:40:40] <icinga-wm>	 PROBLEM - Host elastic2082.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:40:45] <TheresNoTime>	 logged T314577 ^^
[14:40:45] <stashbot>	 T314577: Flake8 5.0.0 release breaking CI jobs - https://phabricator.wikimedia.org/T314577
[14:41:12] <RhinosF1>	 TheresNoTime: they not already a task?
[14:41:19] <RhinosF1>	 i thought 5.0.2 fixed it
[14:41:34] <icinga-wm>	 PROBLEM - Host elastic2081.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:41:36] <icinga-wm>	 PROBLEM - Host wdqs2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:41:48] <TheresNoTime>	 RhinosF1: not that I could find, and I guess I should rename that to `5.0.0+`
[14:42:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:42:56] <RhinosF1>	 TheresNoTime: 5.0.0 created too many issues. I know a lot got fixed by one of .1 .2 or .3
[14:43:14] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1002/36624/" [puppet] - 10https://gerrit.wikimedia.org/r/820463 (https://phabricator.wikimedia.org/T262677) (owner: 10Ayounsi)
[14:43:34] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:43:56] <Lucas_WMDE>	 the failing CI build installed flake8==5.0.4 so I guess it’s still broken in that version
[14:43:58] <icinga-wm>	 PROBLEM - Host elastic2065.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:43:58] <icinga-wm>	 PROBLEM - Host elastic2066.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:44:38] <icinga-wm>	 PROBLEM - Host mc2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:44:38] <icinga-wm>	 PROBLEM - Host mc2048.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:44:47] <RhinosF1>	 Lucas_WMDE: we're on .4 now?
[14:44:58] <icinga-wm>	 PROBLEM - Host logstash2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:03] <Lucas_WMDE>	 apparently yes
[14:45:12] <icinga-wm>	 PROBLEM - Host ms-backup2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:15] <wikibugs>	 (03PS2) 10Samtar: DefaultConfig: add a (humorous) deployment message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820460
[14:45:42] <icinga-wm>	 PROBLEM - Host ms-be2058.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:45:50] <icinga-wm>	 PROBLEM - Host ms-be2064.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:46:40] <TheresNoTime>	 seeee it is a good change (if CI gets fixed) :D
[14:47:21] <Lucas_WMDE>	 :D
[14:47:53] <RhinosF1>	 TheresNoTime: ci went v+2, i can give you a +1 because i have a working mouse
[14:48:03] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] DefaultConfig: add a (humorous) deployment message [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/820460 (owner: 10Samtar)
[14:48:10] <icinga-wm>	 PROBLEM - Host backup2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[14:48:12] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:48:27] <RhinosF1>	 Lucas_WMDE: i'm waiting to bother bumping my projects until it settles
[14:49:19] <TheresNoTime>	 getting that merged won't count for https://twitter.com/TheresNoTimeFor/status/1534271845641469965 though :(
[14:49:27] <wikibugs>	 10SRE, 10conftool: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 (10CDanis)
[14:49:29] <Lucas_WMDE>	 I’m sure the flake8 maintainers have already copiously advised everyone to pin their dependencies and use pip-tools etc.
[14:49:37] <Lucas_WMDE>	 or is it only the Pallets folks that like to do that ^^
[14:49:54] <Lucas_WMDE>	 TheresNoTime: your fault for being so specific with “MediaWiki core”
[14:50:00] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove rabbitmq profile from cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/820465 (https://phabricator.wikimedia.org/T314522)
[14:52:14] <icinga-wm>	 RECOVERY - Host ms-be2064.mgmt is UP: PING WARNING - Packet loss = 71%, RTA = 42.71 ms
[14:54:09] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/pcc-worker1003/36625/cloudcontrol1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/820465 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott)
[14:54:32] <icinga-wm>	 RECOVERY - Host backup2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms
[14:55:18] <jinxer-wm>	 (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:56:07] <XioNoX>	 !log draining codfw-ulsfo link - T310310
[14:56:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:22] <icinga-wm>	 RECOVERY - Host ms-be2058.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms
[14:56:22] <icinga-wm>	 RECOVERY - Host ms-backup2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms
[14:56:27] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mc[2030-2031].codfw.wmnet with reason: PDU swap
[14:56:42] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mc[2030-2031].codfw.wmnet with reason: PDU swap
[14:56:48] <icinga-wm>	 RECOVERY - Host elastic2065.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[14:56:48] <icinga-wm>	 RECOVERY - Host elastic2066.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.60 ms
[14:57:28] <icinga-wm>	 RECOVERY - Host mc2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.51 ms
[14:57:29] <icinga-wm>	 RECOVERY - Host mc2048.mgmt is UP: PING OK - Packet loss = 0%, RTA = 82.31 ms
[14:57:48] <icinga-wm>	 RECOVERY - Host logstash2035.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms
[14:58:09] <jelto>	 !log power off mc20[30-31]
[14:58:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:46] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[15:00:18] <jinxer-wm>	 (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:00:58] <icinga-wm>	 RECOVERY - Host elastic2082.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[15:00:58] <icinga-wm>	 RECOVERY - Host elastic2081.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.03 ms
[15:01:00] <icinga-wm>	 RECOVERY - Host wdqs2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.83 ms
[15:01:02] <wikibugs>	 (03PS1) 10Milimetric: role::common::aqs: update mw history [puppet] - 10https://gerrit.wikimedia.org/r/820468
[15:01:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove rabbitmq profile from cloudcontrol nodes [puppet] - 10https://gerrit.wikimedia.org/r/820465 (https://phabricator.wikimedia.org/T314522) (owner: 10Andrew Bogott)
[15:01:54] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] role::common::aqs: update mw history [puppet] - 10https://gerrit.wikimedia.org/r/820468 (owner: 10Milimetric)
[15:05:20] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[15:05:24] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[15:06:49] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on logstash2002.codfw.wmnet with reason: pdu
[15:07:03] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on logstash2002.codfw.wmnet with reason: pdu
[15:07:38] <_joe_>	 !log pwoering down mc203{0,1}
[15:07:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:52] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on alert1001 is CRITICAL: instance=mc1048 site=eqiad tunnel=mc2030_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:09:22] <godog>	 !log poweroff logstash2002 - T310145
[15:09:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:25] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[15:10:11] <wikibugs>	 (03CR) 10Ahmon Dancy: scap: do not restart php on the mwmaint servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819631 (owner: 10Giuseppe Lavagetto)
[15:11:12] <icinga-wm>	 RECOVERY - IPMI Sensor Status on kafka-main2002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:11:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool hosts for PDU maint (T310145)', diff saved to https://phabricator.wikimedia.org/P32284 and previous config saved to /var/cache/conftool/dbconfig/20220804-151121-ladsgroup.json
[15:12:48] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be[2058,2064].codfw.wmnet
[15:12:48] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be[2058,2064].codfw.wmnet
[15:13:24] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=ats-tls
[15:13:30] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=ats-be
[15:13:36] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp203[12]\.codfw\.wmnet,service=varnish-fe
[15:13:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance (T310145)
[15:13:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2114,2126,2166].codfw.wmnet with reason: Maintenance (T310145)
[15:14:02] <icinga-wm>	 PROBLEM - Host db2126.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:10] <icinga-wm>	 PROBLEM - Host db2102.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:12] <icinga-wm>	 PROBLEM - Host db2114.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:22] <icinga-wm>	 PROBLEM - Host db2165.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:15:22] <icinga-wm>	 PROBLEM - Host db2166.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:12] <icinga-wm>	 PROBLEM - Host parse2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:12] <icinga-wm>	 PROBLEM - Host parse2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:16] <Amir1>	 I was too late to shut it down but it's fine, it's depooled and downtimed
[15:16:25] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on restbase[2016,2020,2025].codfw.wmnet with reason: PDU maintenance
[15:16:40] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on restbase[2016,2020,2025].codfw.wmnet with reason: PDU maintenance
[15:16:46] <icinga-wm>	 PROBLEM - Host logstash2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:16:55] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.
[15:19:32] <icinga-wm>	 PROBLEM - Host wdqs2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:19:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool C6 for PDU maint (T310145)', diff saved to https://phabricator.wikimedia.org/P32285 and previous config saved to /var/cache/conftool/dbconfig/20220804-151958-ladsgroup.json
[15:20:02] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[15:20:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance (T310145)
[15:21:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db[2116,2127,2167-2168].codfw.wmnet,es2022.codfw.wmnet with reason: Maintenance (T310145)
[15:21:35] <XioNoX>	 !log un-drain codfw-ulsfo link - T310310
[15:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:49] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=ats-tls
[15:23:56] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=ats-be
[15:24:04] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp203[78]\.codfw\.wmnet,service=varnish-fe
[15:24:30] <icinga-wm>	 PROBLEM - Host restbase2016.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:24:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on cp[2037-2038].codfw.wmnet with reason: shutdown for PDU upgrade
[15:25:19] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cp[2037-2038].codfw.wmnet with reason: shutdown for PDU upgrade
[15:25:20] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:30:00 on phab2001.codfw.wmnet with reason: PDU swap
[15:25:28] <jelto>	 !log power off phab2001
[15:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:34] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:30:00 on phab2001.codfw.wmnet with reason: PDU swap
[15:26:32] <icinga-wm>	 PROBLEM - Host mc2030.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:27:09] <sukhe>	 !log power off cp2037,cp2038: PDU upgrade
[15:27:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:54] <icinga-wm>	 PROBLEM - Host restbase2020.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:27:58] <icinga-wm>	 PROBLEM - Host elastic2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:27:59] <icinga-wm>	 PROBLEM - Host elastic2032.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:28:48] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[15:28:54] <icinga-wm>	 PROBLEM - Host parse2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:29:04] <icinga-wm>	 PROBLEM - Host mc2031.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:29:20] <icinga-wm>	 PROBLEM - Host gitlab-runner2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:30:34] <icinga-wm>	 PROBLEM - Host restbase2025.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:06] <icinga-wm>	 PROBLEM - Host ores2006 is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2008 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:32:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10RobH)
[15:32:56] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:34:15] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] service::docker: Add SyslogIdentifier to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/820237 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis)
[15:34:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10RobH)
[15:34:58] <icinga-wm>	 PROBLEM - Host ores2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:35:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10BTullis)
[15:35:44] <icinga-wm>	 PROBLEM - Host phab2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:36:22] <icinga-wm>	 PROBLEM - Host ml-serve2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:36:28] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ml-serve2006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:36:30] <icinga-wm>	 PROBLEM - Host ps1-c5-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:37:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10RobH)
[15:37:06] <_joe_>	 !log uncordoning ml-serve200{1,6}
[15:37:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:rack/setup/install new machine learning hosts - https://phabricator.wikimedia.org/T314587 (10RobH) a:03Jclark-ctr
[15:38:58] <icinga-wm>	 PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:40:19] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] striker: route syslog output to ELK cluster via kafka [puppet] - 10https://gerrit.wikimedia.org/r/820238 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis)
[15:41:44] <icinga-wm>	 PROBLEM - Host ganeti2011.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:44] <icinga-wm>	 PROBLEM - Host ganeti2012.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:42:08] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ganeti2012 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:44:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10ssingh)
[15:45:04] <wikibugs>	 (03PS1) 10Ahmon Dancy: Update the known host key for gerrit2002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/820474 (https://phabricator.wikimedia.org/T243027)
[15:46:00] <icinga-wm>	 RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:47:24] <icinga-wm>	 RECOVERY - Host elastic2033.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.51 ms
[15:47:24] <icinga-wm>	 RECOVERY - Host elastic2032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.62 ms
[15:48:06] <icinga-wm>	 RECOVERY - Host ganeti2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.65 ms
[15:48:06] <icinga-wm>	 RECOVERY - Host ganeti2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.50 ms
[15:48:20] <icinga-wm>	 RECOVERY - Host parse2011.mgmt is UP: PING OK - Packet loss = 0%, RTA = 44.89 ms
[15:48:38] <icinga-wm>	 RECOVERY - Host phab2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms
[15:48:46] <icinga-wm>	 RECOVERY - Host gitlab-runner2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 47.35 ms
[15:48:48] <icinga-wm>	 RECOVERY - Host parse2011 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms
[15:49:12] <icinga-wm>	 PROBLEM - Host wdqs2008 is DOWN: PING CRITICAL - Packet loss = 100%
[15:49:18] <icinga-wm>	 RECOVERY - Host ml-serve2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[15:49:23] <wikibugs>	 (03PS1) 10BCornwall: Revert "Revert "geodns: Map out African countries by DC latency"" [dns] - 10https://gerrit.wikimedia.org/r/820486
[15:49:58] <icinga-wm>	 RECOVERY - Host restbase2025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms
[15:50:42] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: T310145
[15:50:46] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[15:50:56] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic2048.codfw.wmnet with reason: T310145
[15:50:58] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[15:51:08] <icinga-wm>	 RECOVERY - Host db2165.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms
[15:51:59] <icinga-wm>	 RECOVERY - Host wdqs2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms
[15:52:30] <wikibugs>	 (03PS2) 10Ahmon Dancy: Update the known host key for gerrit2002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/820474 (https://phabricator.wikimedia.org/T243027)
[15:52:32] <icinga-wm>	 RECOVERY - Host mc2030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms
[15:52:36] <icinga-wm>	 RECOVERY - Host parse2012 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms
[15:52:42] <icinga-wm>	 RECOVERY - Host parse2013 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms
[15:53:52] <icinga-wm>	 RECOVERY - Host restbase2020.mgmt is UP: PING OK - Packet loss = 0%, RTA = 46.31 ms
[15:54:14] <icinga-wm>	 RECOVERY - Host db2102.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms
[15:54:16] <icinga-wm>	 RECOVERY - Host parse2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms
[15:54:20] <icinga-wm>	 RECOVERY - Host db2114.mgmt is UP: PING WARNING - Packet loss = 66%, RTA = 33.74 ms
[15:54:24] <icinga-wm>	 RECOVERY - Host ores2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 40.39 ms
[15:55:10] <icinga-wm>	 RECOVERY - Host parse2012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[15:55:40] <icinga-wm>	 RECOVERY - Host logstash2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.98 ms
[15:56:48] <icinga-wm>	 RECOVERY - Host restbase2016.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms
[15:57:19] <icinga-wm>	 PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100%
[15:58:16] <icinga-wm>	 PROBLEM - Host ganeti2012 is DOWN: PING CRITICAL - Packet loss = 100%
[15:58:18] <icinga-wm>	 PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:58:22] <icinga-wm>	 PROBLEM - Host build2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:59:32] <icinga-wm>	 RECOVERY - Host db2126.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms
[16:00:00] <icinga-wm>	 RECOVERY - Host db2166.mgmt is UP: PING OK - Packet loss = 0%, RTA = 35.49 ms
[16:00:04] <icinga-wm>	 RECOVERY - Host ganeti2012 is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms
[16:00:05] <jouncebot>	 jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1600). Please do the needful.
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:22] <icinga-wm>	 RECOVERY - Host ores2006 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
[16:00:28] <icinga-wm>	 RECOVERY - Host mc2031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms
[16:00:30] <icinga-wm>	 RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms
[16:00:36] <icinga-wm>	 RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 33.36 ms
[16:00:52] <dancy>	 jbond/rzl: Can you process https://gerrit.wikimedia.org/r/c/operations/puppet/+/820474 ?
[16:00:54] <icinga-wm>	 PROBLEM - Check systemd state on ganeti2012 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service,prometheus-ganeti-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:01:02] <icinga-wm>	 RECOVERY - Host build2001 is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms
[16:01:44] <icinga-wm>	 RECOVERY - Host db2126 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
[16:02:02] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[16:02:07] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[16:02:54] <logmsgbot>	 !log ebysans@deploy1002 Started deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288]
[16:02:58] <icinga-wm>	 RECOVERY - Check systemd state on ganeti2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:02:58] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet with reason: PDU work
[16:03:14] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be[2036,2049,2054].codfw.wmnet,thanos-be2003.codfw.wmnet with reason: PDU work
[16:03:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0f30d2ec-1037-4449-b903-79ae6c2ccede) set by mvernon@cumin1001 for 1 day, 0:00:00 on 4 host(s) and...
[16:03:30] <icinga-wm>	 PROBLEM - Host db2095.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:03:32] <icinga-wm>	 PROBLEM - Host db2115.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:04:10] <icinga-wm>	 PROBLEM - IPMI Sensor Status on ores2006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:04:12] <icinga-wm>	 PROBLEM - Host es2022.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:04:28] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase1016 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7103 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[16:05:01] <wikibugs>	 (03PS5) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858)
[16:05:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 (10bking)
[16:06:09] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:06:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on ml-serve2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:06:22] <icinga-wm>	 PROBLEM - Host db2127.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:29] <Emperor>	 !log shutdown ms-be20[39,49,54].codfw.wmnet,thanos-be2003 for PDU swap T310145
[16:06:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:32] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[16:06:32] <icinga-wm>	 PROBLEM - Host mw2361.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:32] <icinga-wm>	 PROBLEM - Host mw2360.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:32] <icinga-wm>	 PROBLEM - Host mw2362.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:32] <icinga-wm>	 PROBLEM - Host mw2356.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:32] <icinga-wm>	 PROBLEM - Host mw2359.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:33] <icinga-wm>	 PROBLEM - Host mw2363.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:49] <jinxer-wm>	 (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning
[16:06:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[16:06:49] <jinxer-wm>	 (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning
[16:06:58] <icinga-wm>	 PROBLEM - Host db2167.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:58] <icinga-wm>	 PROBLEM - Host mw2350.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:58] <icinga-wm>	 PROBLEM - Host mw2351.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:58] <icinga-wm>	 PROBLEM - Host mw2352.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:58] <icinga-wm>	 PROBLEM - Host mw2353.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:59] <icinga-wm>	 PROBLEM - Host mw2354.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:06:59] <icinga-wm>	 PROBLEM - Host mw2355.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:00] <icinga-wm>	 PROBLEM - Host mw2357.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:00] <icinga-wm>	 PROBLEM - Host mw2358.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:02] <icinga-wm>	 PROBLEM - Host ml-serve-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:10] <icinga-wm>	 PROBLEM - Host webperf2003 is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:16] <icinga-wm>	 PROBLEM - Host ganeti2014 is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:18] <icinga-wm>	 PROBLEM - Host wdqs2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:26] <icinga-wm>	 PROBLEM - Host dragonfly-supernode2001 is DOWN: PING CRITICAL - Packet loss = 100%
[16:07:40] <icinga-wm>	 PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:07:46] <icinga-wm>	 PROBLEM - Host db2135.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:02] <icinga-wm>	 PROBLEM - Host dbproxy2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:12] <icinga-wm>	 RECOVERY - Host ganeti2014 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms
[16:08:14] <icinga-wm>	 PROBLEM - Host mw2364.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:14] <icinga-wm>	 PROBLEM - Host mw2365.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:08:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Update the known host key for gerrit2002.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/820474 (https://phabricator.wikimedia.org/T243027) (owner: 10Ahmon Dancy)
[16:08:48] <icinga-wm>	 ACKNOWLEDGEMENT - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating Andrew Bogott side-effect of rabbitmq work https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:08:56] <icinga-wm>	 PROBLEM - Host db2099.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:09:02] <icinga-wm>	 PROBLEM - Host db2116.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:09:16] <icinga-wm>	 PROBLEM - Host db2168.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:09:26] <icinga-wm>	 PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[16:09:36] <icinga-wm>	 PROBLEM - Host db2179.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:09:36] <icinga-wm>	 PROBLEM - Host db2180.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:09:56] <icinga-wm>	 PROBLEM - Host ganeti2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:09:58] <icinga-wm>	 RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:10:02] <icinga-wm>	 PROBLEM - Host ganeti2013.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:10:04] <icinga-wm>	 PROBLEM - Host parse2014.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:10:06] <icinga-wm>	 PROBLEM - Host parse2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[16:10:38] <icinga-wm>	 RECOVERY - Host webperf2003 is UP: PING OK - Packet loss = 0%, RTA = 34.79 ms
[16:10:58] <icinga-wm>	 RECOVERY - Host ml-serve-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 35.62 ms
[16:10:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:11:09] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "Revert "geodns: Map out African countries by DC latency"" [dns] - 10https://gerrit.wikimedia.org/r/820486 (owner: 10BCornwall)
[16:11:36] <icinga-wm>	 RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:11:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[16:12:12] <cwhite>	 !log poweroff logstash2028 - T310145
[16:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:16] <stashbot>	 T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145
[16:12:34] <icinga-wm>	 RECOVERY - IPMI Sensor Status on ganeti2012 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:12:40] <icinga-wm>	 PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
[16:14:40] <icinga-wm>	 PROBLEM - Host logstash2028 is DOWN: PING CRITICAL - Packet loss = 100%
[16:15:25] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Start depooling codfw replicas [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[16:15:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job dragonfly_supernode in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[16:15:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[16:15:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:16:56] <wikibugs>	 (03Merged) 10jenkins-bot: auto_schema: Start depooling codfw replicas [software] - 10https://gerrit.wikimedia.org/r/820142 (https://phabricator.wikimedia.org/T314486) (owner: 10Ladsgroup)
[16:17:55] <brett>	 !log deploying authdns - geodns: Map out African countries by DC latency (T311472)
[16:17:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:59] <stashbot>	 T311472: DRMRS: Geodns Configuration -- Phase 2  - https://phabricator.wikimedia.org/T311472
[16:18:44] <icinga-wm>	 RECOVERY - k8s requests count to the API on ml-serve-ctrl2001 is OK: (C)100 ge (W)50 ge 1.192 https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=1
[16:19:10] <icinga-wm>	 RECOVERY - Host db2135.mgmt is UP: PING OK - Packet loss = 0%, RTA = 363.55 ms
[16:19:28] <icinga-wm>	 RECOVERY - Host dbproxy2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.71 ms
[16:19:30] <icinga-wm>	 RECOVERY - Host wdqs2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.87 ms
[16:19:59] <dancy>	 Thanks jbond!
[16:20:10] <icinga-wm>	 RECOVERY - Host ganeti2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.67 ms
[16:20:24] <icinga-wm>	 RECOVERY - Host db2095.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.58 ms
[16:20:26] <icinga-wm>	 RECOVERY - Host db2115.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms
[16:20:26] <icinga-wm>	 RECOVERY - Host db2099.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.30 ms
[16:20:32] <icinga-wm>	 RECOVERY - Host db2116.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.72 ms
[16:20:34] <Amir1>	 jouncebot: nowandnext
[16:20:34] <jouncebot>	 For the next 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1600)
[16:20:34] <jouncebot>	 In 0 hour(s) and 39 minute(s): Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220804T1700)
[16:20:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on ml-serve-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[16:21:06] <icinga-wm>	 RECOVERY - Host es2022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms
[16:21:34] <icinga-wm>	 RECOVERY - Host ganeti2013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.94 ms
[16:22:01] <wikibugs>	 (03PS2) 10Ladsgroup: Start reading from new templatelinks columns in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820376 (https://phabricator.wikimedia.org/T306673)
[16:22:05] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Start reading from new templatelinks columns in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820376 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[16:22:06] <icinga-wm>	 RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[16:22:59] <wikibugs>	 (03Merged) 10jenkins-bot: Start reading from new templatelinks columns in commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820376 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[16:23:44] <icinga-wm>	 RECOVERY - Host mw2350 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms
[16:23:46] <icinga-wm>	 RECOVERY - Host mw2353 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms
[16:23:48] <icinga-wm>	 RECOVERY - Host mw2354 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms
[16:23:52] <icinga-wm>	 RECOVERY - Host mw2351 is UP: PING OK - Packet loss = 0%, RTA = 33.65 ms
[16:24:00] <icinga-wm>	 RECOVERY - Host mw2352 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms
[16:24:00] <icinga-wm>	 RECOVERY - Host mw2355 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms
[16:24:04] <icinga-wm>	 RECOVERY - Host mw2356.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms
[16:24:22] <icinga-wm>	 RECOVERY - Host wdqs2008 is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms
[16:24:30] <icinga-wm>	 RECOVERY - Host mw2351.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[16:24:30] <icinga-wm>	 RECOVERY - Host mw2350.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms
[16:24:30] <icinga-wm>	 RECOVERY - Host mw2352.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms
[16:24:30] <icinga-wm>	 RECOVERY - Host mw2353.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms
[16:24:30] <icinga-wm>	 RECOVERY - Host mw2355.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.64 ms
[16:24:30] <icinga-wm>	 RECOVERY - Host mw2354.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.70 ms
[16:24:38] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[16:25:42] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:25:58] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 244 bytes in 1.183 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:26:14] <icinga-wm>	 PROBLEM - Blazegraph process -wdqs-categories- on wdqs2008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:26:44] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2008 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.135 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[16:26:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[16:27:44] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2008 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:27:46] <icinga-wm>	 RECOVERY - Host parse2014.mgmt is UP: PING OK - Packet loss = 0%, RTA = 45.22 ms
[16:27:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[16:27:51] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[16:28:00] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2008 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.231 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:28:20] <icinga-wm>	 RECOVERY - Blazegraph process -wdqs-categories- on wdqs2008 is OK: PROCS OK: 1 process with UID = 499 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[16:28:39] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:820376|Start reading from new templatelinks columns in commons (T306673)]] (duration: 03m 00s)
[16:28:42] <stashbot>	 T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673
[16:28:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[16:29:55] <icinga-wm>	 RECOVERY - Host mw2357 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms
[16:29:57] <icinga-wm>	 RECOVERY - Host mw2356 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms
[16:29:57] <icinga-wm>	 RECOVERY - Host mw2360 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms
[16:29:59] <icinga-wm>	 RECOVERY - Host mw2362 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms
[16:29:59] <icinga-wm>	 RECOVERY - Host mw2361 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms
[16:30:01] <icinga-wm>	 RECOVERY - Host mw2363 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms
[16:30:07] <icinga-wm>	 RECOVERY - Host mw2358 is UP: PING OK - Packet loss = 0%, RTA = 33.29 ms
[16:30:07] <icinga-wm>	 RECOVERY - Host mw2357.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms
[16:30:09] <icinga-wm>	 RECOVERY - Host mw2365 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms
[16:30:11] <icinga-wm>	 RECOVERY - Host mw2358.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms
[16:30:11] <icinga-wm>	 RECOVERY - Host mw2359 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms
[16:30:13] <icinga-wm>	 RECOVERY - Host mw2364 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms
[16:30:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool D3 for PDU maint', diff saved to https://phabricator.wikimedia.org/P32286 and previous config saved to /var/cache/conftool/dbconfig/20220804-163037-ladsgroup.json
[16:30:39] <stashbot>	 D3: test - ignore - https://phabricator.wikimedia.org/D3
[16:30:45] <icinga-wm>	 RECOVERY - Host db2127.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.73 ms
[16:30:53] <icinga-wm>	 RECOVERY - Host mw2359.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms
[16:30:53] <icinga-wm>	 RECOVERY - Host mw2360.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms
[16:30:53] <icinga-wm>	 RECOVERY - Host mw2361.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.69 ms
[16:30:53] <icinga-wm>	 RECOVERY - Host mw2362.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.91 ms
[16:30:53] <icinga-wm>	 RECOVERY - Host mw2363.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.26 ms
[16:31:11] <icinga-wm>	 RECOVERY - Host mw2364.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.77 ms
[16:31:11] <icinga-wm>	 RECOVERY - Host mw2365.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms
[16:31:47] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[16:32:13] <icinga-wm>	 RECOVERY - Host db2168.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms
[16:32:25] <icinga-wm>	 RECOVERY - Host parse2015 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms
[16:32:25] <icinga-wm>	 RECOVERY - Host parse2014 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms
[16:32:33] <icinga-wm>	 RECOVERY - Host db2179.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.79 ms
[16:32:33] <icinga-wm>	 RECOVERY - Host db2180.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.33 ms
[16:32:54] <logmsgbot>	 !log ebysans@deploy1002 Finished deploy [analytics/refinery@2553288]: Regular analytics weekly train [analytics/refinery@2553288] (duration: 29m 59s)
[16:32:59] <icinga-wm>	 RECOVERY - Host parse2015.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms