[00:14:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Dzahn) >>! In T301581#7705703, @RhinosF1 wrote: > SRE will be able to check on their tracking sheets or confirm with legal. No need to worry but thanks for being super clear... [00:44:11] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:25] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [01:01:17] (03Abandoned) 10MSantos: maps: don't load maps services in maps master [puppet] - 10https://gerrit.wikimedia.org/r/762477 (owner: 10MSantos) [01:40:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T0200) [02:04:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:05:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:07:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.22 [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762569 [02:07:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.22 [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762569 (owner: 10TrainBranchBot) [02:07:55] !log mbsantos@deploy1002 Started deploy [kartotherian/deploy@3dc404c] (eqiad): Merge "Update kartotherian-package to f239c6e" [02:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:14] !log mbsantos@deploy1002 Finished deploy [kartotherian/deploy@3dc404c] (eqiad): Merge "Update kartotherian-package to f239c6e" (duration: 06m 19s) [02:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:48] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.38.0-wmf.22 [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762569 (owner: 10TrainBranchBot) [02:22:36] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.22 [core] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762569 (owner: 10TrainBranchBot) [02:27:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:28:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling db2136 (after maint)', diff saved to https://phabricator.wikimedia.org/P20746 and previous config saved to /var/cache/conftool/dbconfig/20220215-023518-ladsgroup.json [02:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:04] 10SRE, 10ops-codfw, 10DBA: codfw: db2136: Correctable memory error rate exceeded for DIMM_B5 - https://phabricator.wikimedia.org/T301713 (10Ladsgroup) Replicated caught up, repooled. [05:08:13] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:20:07] (03PS1) 10KartikMistry: Update cxserver to 2022-02-15-050044-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/762575 (https://phabricator.wikimedia.org/T301443) [05:35:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:35:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [05:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [05:46:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:46:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [05:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [05:50:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [05:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [05:54:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [05:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T300381)', diff saved to https://phabricator.wikimedia.org/P20747 and previous config saved to /var/cache/conftool/dbconfig/20220215-055441-marostegui.json [05:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:46] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [05:56:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300381)', diff saved to https://phabricator.wikimedia.org/P20748 and previous config saved to /var/cache/conftool/dbconfig/20220215-055655-marostegui.json [05:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:49] !log Remove watchdog@10.% user from es1-es5 T301442 [05:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:54] T301442: Audit and remove watchdog user - https://phabricator.wikimedia.org/T301442 [05:59:55] !log Remove watchdog@10.% user from pc1-pc3 T301442 [05:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P20749 and previous config saved to /var/cache/conftool/dbconfig/20220215-061200-marostegui.json [06:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:08] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/762532 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [06:27:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P20750 and previous config saved to /var/cache/conftool/dbconfig/20220215-062705-marostegui.json [06:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:20] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [06:36:25] (03PS2) 10Marostegui: mariadb: Promote db1183 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/762146 (https://phabricator.wikimedia.org/T301219) [06:37:17] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1183 to m3 master [puppet] - 10https://gerrit.wikimedia.org/r/762146 (https://phabricator.wikimedia.org/T301219) (owner: 10Marostegui) [06:42:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300381)', diff saved to https://phabricator.wikimedia.org/P20751 and previous config saved to /var/cache/conftool/dbconfig/20220215-064209-marostegui.json [06:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:42:16] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:42:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [06:42:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [06:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [06:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:46:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [06:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T300381)', diff saved to https://phabricator.wikimedia.org/P20752 and previous config saved to /var/cache/conftool/dbconfig/20220215-064631-marostegui.json [06:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300381)', diff saved to https://phabricator.wikimedia.org/P20753 and previous config saved to /var/cache/conftool/dbconfig/20220215-065139-marostegui.json [06:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:45] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:06:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P20754 and previous config saved to /var/cache/conftool/dbconfig/20220215-070644-marostegui.json [07:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:27] (03PS1) 10Marostegui: db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762683 (https://phabricator.wikimedia.org/T301654) [07:09:07] (03CR) 10Marostegui: [C: 03+2] db1107: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762683 (https://phabricator.wikimedia.org/T301654) (owner: 10Marostegui) [07:09:35] (03CR) 10Ayounsi: [C: 03+1] "Thanks! No strong preference here, as long as Rancid works I'm fine :)" [puppet] - 10https://gerrit.wikimedia.org/r/762536 (https://phabricator.wikimedia.org/T211459) (owner: 10Dzahn) [07:21:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P20755 and previous config saved to /var/cache/conftool/dbconfig/20220215-072149-marostegui.json [07:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300381)', diff saved to https://phabricator.wikimedia.org/P20756 and previous config saved to /var/cache/conftool/dbconfig/20220215-073653-marostegui.json [07:36:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:36:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [07:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:00] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [07:37:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T300381)', diff saved to https://phabricator.wikimedia.org/P20757 and previous config saved to /var/cache/conftool/dbconfig/20220215-073701-marostegui.json [07:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:46] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Joe) >> **First**: How difficult & how much overhead would it be to make the proxy redirect requests made to internal doma... [07:40:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300381)', diff saved to https://phabricator.wikimedia.org/P20758 and previous config saved to /var/cache/conftool/dbconfig/20220215-074005-marostegui.json [07:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:11] (03CR) 10Elukey: [C: 03+2] Add ml-serve200[7,8] to the k8s ml-serve-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/762491 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:42:18] (03PS3) 10Elukey: Add ml-serve200[7,8] to the k8s ml-serve-codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/762491 (https://phabricator.wikimedia.org/T300744) [07:55:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P20759 and previous config saved to /var/cache/conftool/dbconfig/20220215-075510-marostegui.json [07:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] !log Failover m3 from db1107 to db1183 - T301219 [08:00:05] Amir1, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:08] T301219: Switchover m3 master (db1107 -> db1183) - https://phabricator.wikimedia.org/T301219 [08:00:10] o/ [08:00:27] good morning! Looks like there's nothing to deploy :-) [08:00:58] all done [08:01:35] I seem to be able to write to phab [08:05:46] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:07:14] (03PS1) 10Muehlenhoff: Update MOU date for danira [puppet] - 10https://gerrit.wikimedia.org/r/762738 [08:07:18] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:09:46] (03PS1) 10Marostegui: db2134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762739 (https://phabricator.wikimedia.org/T301654) [08:10:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P20760 and previous config saved to /var/cache/conftool/dbconfig/20220215-081015-marostegui.json [08:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:41] (03CR) 10Marostegui: [C: 03+2] db2134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762739 (https://phabricator.wikimedia.org/T301654) (owner: 10Marostegui) [08:11:31] (03CR) 10Muehlenhoff: [C: 03+2] Update MOU date for danira [puppet] - 10https://gerrit.wikimedia.org/r/762738 (owner: 10Muehlenhoff) [08:12:46] (03PS1) 10Marostegui: db2134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762741 (https://phabricator.wikimedia.org/T301654) [08:13:59] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) After more digging: I have no idea why envoy would report the upstream time spent as 2 seconds, when it really is 20. Looks like a bug there. So: mos... [08:14:03] (03CR) 10Marostegui: [C: 03+2] db2134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762741 (https://phabricator.wikimedia.org/T301654) (owner: 10Marostegui) [08:14:25] (03CR) 10Marostegui: "This was supposed to be db2135" [puppet] - 10https://gerrit.wikimedia.org/r/762741 (https://phabricator.wikimedia.org/T301654) (owner: 10Marostegui) [08:15:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2135.codfw.wmnet with OS bullseye [08:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:52] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:20:27] (KubernetesRsyslogDown) firing: rsyslog on ml-serve2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org [08:20:32] (03PS1) 10Marostegui: Revert "db2134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762749 [08:21:23] 2008 is me [08:21:27] (03CR) 10Marostegui: [C: 03+2] Revert "db2134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762749 (owner: 10Marostegui) [08:22:25] elukey: old skool [08:25:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300381)', diff saved to https://phabricator.wikimedia.org/P20761 and previous config saved to /var/cache/conftool/dbconfig/20220215-082519-marostegui.json [08:25:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:25:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [08:25:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:27] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:25:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T300381)', diff saved to https://phabricator.wikimedia.org/P20762 and previous config saved to /var/cache/conftool/dbconfig/20220215-082533-marostegui.json [08:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:47] kormat: Filippo is my teacher (he can say "Thanos is me" though, way more powerful) [08:26:40] ACKNOWLEDGEMENT - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: Marostegui known https://wikitech.wikimedia.org/wiki/HAProxy [08:30:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300381)', diff saved to https://phabricator.wikimedia.org/P20763 and previous config saved to /var/cache/conftool/dbconfig/20220215-083039-marostegui.json [08:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:30:48] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [08:32:32] PROBLEM - Host ml-serve2007 is DOWN: PING CRITICAL - Packet loss = 100% [08:32:35] !log installing apache security updates on thanos nodes [08:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:45] I am rebooting ml-serve2007 [08:34:50] RECOVERY - Host ml-serve2007 is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms [08:40:25] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests, 10Patch-For-Review: create bullseye VM for Etherpad upgrade (and upgrade it to 1.8.16) - https://phabricator.wikimedia.org/T300568 (10Volans) >>! In T300568#7708409, @Dzahn wrote: > @Volans Yes, it has been fixed by making etherpad listen on "::"... [08:44:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2135.codfw.wmnet with OS bullseye [08:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:22] (03PS1) 10Marostegui: Revert "db2134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762750 [08:45:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P20764 and previous config saved to /var/cache/conftool/dbconfig/20220215-084544-marostegui.json [08:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:22] (03CR) 10Marostegui: [C: 03+2] Revert "db2134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762750 (owner: 10Marostegui) [08:47:35] I am going to start running the train deployment. [08:51:25] (03CR) 10Jelto: [C: 03+2] gitlab: rename test instance, use letsencrypt certs [puppet] - 10https://gerrit.wikimedia.org/r/762495 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [08:51:31] (03PS1) 10Marostegui: sanitarium_multiinstance.my.cnf: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/762744 (https://phabricator.wikimedia.org/T268869) [08:53:41] (03CR) 10Marostegui: [C: 03+2] sanitarium_multiinstance.my.cnf: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/762744 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [08:54:45] !log imported openjdk-8 8u322-b06-1~deb10u1 for buster-wikimedia (forward port of latest Java 8 security fixes) [08:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:02] !log rolling out python3-wmflib 1.0.2-1 across the fleet [08:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/760615 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [08:58:32] (03CR) 10Elukey: [C: 03+2] Add ml-serve200[7,8] to the k8s ml-serve-codfw cluster [homer/public] - 10https://gerrit.wikimedia.org/r/762496 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:00:05] hashar and jeena: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T0900). [09:00:51] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGMT (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/762536 (https://phabricator.wikimedia.org/T211459) (owner: 10Dzahn) [09:04:25] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-serve2007.codfw.wmnet [09:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:29] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ml-serve2008.codfw.wmnet [09:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:47] (03PS1) 10Hashar: testwikis wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762745 [09:08:49] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762745 (owner: 10Hashar) [09:09:16] (03PS1) 10Filippo Giunchedi: hieradata: swap prometheus100[46] [puppet] - 10https://gerrit.wikimedia.org/r/762766 (https://phabricator.wikimedia.org/T296199) [09:09:32] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762745 (owner: 10Hashar) [09:09:36] !log hashar@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.22 refs T300198 [09:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:41] T300198: 1.38.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T300198 [09:13:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:14:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [09:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:56] (03PS1) 10Giuseppe Lavagetto: shellbox-media: use local disk for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/762768 (https://phabricator.wikimedia.org/T292322) [09:15:28] dancy: `stage-train` is really a blessing :] [09:15:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [09:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300381)', diff saved to https://phabricator.wikimedia.org/P20766 and previous config saved to /var/cache/conftool/dbconfig/20220215-091554-marostegui.json [09:15:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:15:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:59] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [09:16:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:16:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [09:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T300381)', diff saved to https://phabricator.wikimedia.org/P20767 and previous config saved to /var/cache/conftool/dbconfig/20220215-091606-marostegui.json [09:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:00] (03CR) 10jerkins-bot: [V: 04-1] shellbox-media: use local disk for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/762768 (https://phabricator.wikimedia.org/T292322) (owner: 10Giuseppe Lavagetto) [09:17:24] PROBLEM - Host mc2023 is DOWN: PING CRITICAL - Packet loss = 100% [09:18:06] RECOVERY - Host mc2023 is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [09:18:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300381)', diff saved to https://phabricator.wikimedia.org/P20768 and previous config saved to /var/cache/conftool/dbconfig/20220215-091811-marostegui.json [09:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:24] 09:24:05 Started sync-apaches [09:33:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P20769 and previous config saved to /var/cache/conftool/dbconfig/20220215-093316-marostegui.json [09:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [09:33:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [09:33:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [09:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 6 hosts with reason: Maintenance [09:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:25] (03PS2) 10Ladsgroup: db-production: Stop writes to es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762557 (https://phabricator.wikimedia.org/T300976) [09:41:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [09:41:47] (03CR) 10Marostegui: [C: 03+1] db-production: Stop writes to es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762557 (https://phabricator.wikimedia.org/T300976) (owner: 10Ladsgroup) [09:43:12] (03PS1) 10Volans: netbox: inject also the device status [software/homer] - 10https://gerrit.wikimedia.org/r/762774 [09:43:29] hashar: when can I do a sync? [09:43:54] 75% of sync-apaches done [09:43:58] so in like half an hour or so [09:45:31] awesome [09:48:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P20771 and previous config saved to /var/cache/conftool/dbconfig/20220215-094821-marostegui.json [09:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T300006 [09:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:35] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [09:49:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T300006 [09:49:38] !log migrate instances off ganeti1022 [09:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:53] scap-cdb-rebuild in progress [09:52:01] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/758986 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [09:55:30] !log hashar@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.22 refs T300198 (duration: 45m 55s) [09:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:35] T300198: 1.38.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T300198 [09:55:45] running scap clean for wmf.20 [09:56:16] (03CR) 10Ladsgroup: [C: 03+2] db-production: Stop writes to es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762557 (https://phabricator.wikimedia.org/T300976) (owner: 10Ladsgroup) [09:56:52] (03CR) 10Volans: "replies inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [09:56:59] (03Merged) 10jenkins-bot: db-production: Stop writes to es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762557 (https://phabricator.wikimedia.org/T300976) (owner: 10Ladsgroup) [09:58:46] !log hashar@deploy1002 Pruned MediaWiki: 1.38.0-wmf.20 (duration: 03m 08s) [09:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:53] ohhh, https://versions.toolforge.org/ has fancy new percentages below the group names [10:00:05] hashar and jeena: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T0900) [10:00:05] kormat, marostegui, and Amir1: That opportune time is upon us again. Time for a Database primary switchover for es5 deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T1000). [10:00:13] o/ [10:00:19] o/ [10:00:35] o/ [10:00:57] deploying the read-only for es5 now [10:01:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:16] (03PS2) 10Ladsgroup: mariadb: Promote es1023 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/762558 (https://phabricator.wikimedia.org/T300976) [10:01:25] !log ladsgroup@deploy1002 Synchronized wmf-config/db-production.php: Config: [[gerrit:762557|db-production: Stop writes to es5 (T300976)]] (duration: 00m 49s) [10:01:26] Amir1: you can sync your change [10:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:30] T300976: Switchover es5 master - https://phabricator.wikimedia.org/T300976 [10:01:33] I will promote group0 wikis after [10:01:41] hashar: thanks [10:01:50] (03CR) 10Muehlenhoff: [C: 04-1] aptrepo: add docker packages to thirdparty/ci for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758986 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [10:01:50] I might have screwed up with the hours of the windows :/ [10:02:19] hashar: I did actually, we postponed this window last week and I messed up when to put it [10:02:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set es1023 with weight 0 T300006', diff saved to https://phabricator.wikimedia.org/P20772 and previous config saved to /var/cache/conftool/dbconfig/20220215-100253-ladsgroup.json [10:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:58] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [10:03:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300381)', diff saved to https://phabricator.wikimedia.org/P20773 and previous config saved to /var/cache/conftool/dbconfig/20220215-100325-marostegui.json [10:03:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:03:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:31] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:03:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T300381)', diff saved to https://phabricator.wikimedia.org/P20774 and previous config saved to /var/cache/conftool/dbconfig/20220215-100333-marostegui.json [10:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:19] (03CR) 10Elukey: [C: 03+2] ml-services: update editquality predictor image [deployment-charts] - 10https://gerrit.wikimedia.org/r/762532 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [10:06:57] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote es1023 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/762558 (https://phabricator.wikimedia.org/T300976) (owner: 10Ladsgroup) [10:07:15] (03CR) 10Elukey: ml-services: add arwiki & bnwiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [10:08:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:08:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:32] marostegui: can you check if there are still writes coming? I see the master log position moving [10:08:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300381)', diff saved to https://phabricator.wikimedia.org/P20775 and previous config saved to /var/cache/conftool/dbconfig/20220215-100840-marostegui.json [10:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:45] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:08:57] Amir1: it is probably the heartbeat, but let me check [10:09:14] aah, okay I should check binlog [10:09:53] binlog is heartbeat only [10:10:00] Amir1: it is the heartbeat [10:10:12] cool [10:10:19] then starting the switchover now [10:10:21] !log Starting es5 eqiad failover from es1024 to es1023 - T300006 [10:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:26] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [10:10:40] edit rate looking good [10:12:46] marostegui: > Please remember to run the following commands as root to update the events if they are Mediawiki databases: [10:12:47] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: swap prometheus100[46] [puppet] - 10https://gerrit.wikimedia.org/r/762766 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [10:12:50] Should I run it? [10:13:07] Amir1: yes [10:13:28] also remember to give some weight to the new master if the old one will be depooled later [10:14:00] Amir1: let me know when I can promote group0 wikis :] [10:14:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote es1023 to es5 primary and set section read-write T300006', diff saved to https://phabricator.wikimedia.org/P20776 and previous config saved to /var/cache/conftool/dbconfig/20220215-101412-root.json [10:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:28] hashar: sure [10:14:30] marostegui: sure [10:14:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:02] hashar: please promote, nothing left to do (except clean up and giving weight to master) [10:15:11] Amir1: there's the RW enablement left [10:15:13] on MW [10:15:13] awesome [10:15:35] hashar: hang on [10:15:48] marostegui: that can happen later? [10:15:54] I thought it's okay [10:16:08] Amir1: sure, as long as you are on top of that yes [10:16:19] yeah, don't worry [10:16:25] ok [10:16:56] (03PS1) 10Ladsgroup: Revert "db-production: Stop writes to es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762751 (https://phabricator.wikimedia.org/T300976) [10:17:18] holding holding :D [10:17:25] (03CR) 10Ayounsi: [C: 03+1] netbox: inject also the device status [software/homer] - 10https://gerrit.wikimedia.org/r/762774 (owner: 10Volans) [10:18:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Setting weight to es1023 T300006', diff saved to https://phabricator.wikimedia.org/P20777 and previous config saved to /var/cache/conftool/dbconfig/20220215-101817-root.json [10:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:23] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [10:18:44] (03PS21) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [10:19:08] (03CR) 10Jbond: reposync: add new class to manage syncing repositories (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [10:19:16] marostegui: how can I check if a section is correctly on rw mode in dbctl? [10:19:16] (03PS22) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [10:19:22] Amir1: checking [10:19:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:47] Amir1: it looks good [10:20:59] cool, now I'm turning on writes then [10:21:05] ok [10:21:06] (03CR) 10Ladsgroup: [C: 03+2] Revert "db-production: Stop writes to es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762751 (https://phabricator.wikimedia.org/T300976) (owner: 10Ladsgroup) [10:21:59] (03Merged) 10jenkins-bot: Revert "db-production: Stop writes to es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762751 (https://phabricator.wikimedia.org/T300976) (owner: 10Ladsgroup) [10:22:28] (03PS1) 10Kevin Bazira: ml-services: add bswiki & cawiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/762777 (https://phabricator.wikimedia.org/T301415) [10:22:56] (03PS1) 10Kormat: Add 2022/drop_fr_img_star_cols_T300774.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762778 (https://phabricator.wikimedia.org/T300774) [10:23:26] !log ladsgroup@deploy1002 Synchronized wmf-config/db-production.php: Config: [[gerrit:762751|Revert "db-production: Stop writes to es5" (T300976)]] (duration: 00m 55s) [10:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:31] T300976: Switchover es5 master - https://phabricator.wikimedia.org/T300976 [10:23:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P20778 and previous config saved to /var/cache/conftool/dbconfig/20220215-102345-marostegui.json [10:23:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:23:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:59] marostegui: I see writes coming but orchestrator is mad [10:24:12] Amir1: yes, that's one of the last steps in the switchover checklist [10:24:15] I assume it's the " Clean up heartbeat table(s)." [10:24:21] Amir1: yep [10:24:32] Amir1: can you update the task to see where we are at? I am a bit lost on which steps were done and not :) [10:24:52] Amir1: don't worry about the orchestrator lag at the moment, it is not affecting MW, we can deal with it with no rush [10:25:12] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: Restarting to pick up Java security updates - hnowlan@cumin1001 [10:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:26:07] Amir1: query killers were done too I believe right? [10:26:22] no, where should I even run them? [10:26:28] aaah [10:26:34] that's the curl thingy [10:26:43] yeeeep [10:26:43] yes, ran them [10:26:46] ok [10:27:20] so for the orchestrator part you simply need to go to the new master and do a select from heartbeat table, grab the OLD master server_id and run: delete from heartbeat where server_id=XXXXX [10:27:24] (with replication enabled) [10:27:37] that will remove the old entry from the old master and orchestrator will be happy about it and the lag will be gone [10:27:45] ok [10:27:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge cron runner: better support for secondary nodes [puppet] - 10https://gerrit.wikimedia.org/r/762470 (owner: 10Majavah) [10:28:07] (03CR) 10Volans: [C: 03+2] netbox: inject also the device status [software/homer] - 10https://gerrit.wikimedia.org/r/762774 (owner: 10Volans) [10:28:57] done [10:29:03] this really should be automated :D [10:29:06] loos good now [10:29:48] I do the DNS now, in the mean time hashar I think you're free to go, sorry for the mix up [10:30:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:53] (03Merged) 10jenkins-bot: netbox: inject also the device status [software/homer] - 10https://gerrit.wikimedia.org/r/762774 (owner: 10Volans) [10:31:04] (03PS1) 10Ladsgroup: Update es5 master [dns] - 10https://gerrit.wikimedia.org/r/762779 (https://phabricator.wikimedia.org/T300976) [10:31:33] (03CR) 10Klausman: [C: 03+1] ml-services: update editquality predictor image [deployment-charts] - 10https://gerrit.wikimedia.org/r/762532 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [10:32:42] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Update es5 master [dns] - 10https://gerrit.wikimedia.org/r/762779 (https://phabricator.wikimedia.org/T300976) (owner: 10Ladsgroup) [10:33:27] ran the update in authdns [10:35:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:36:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [10:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:22] zarcillo is already updated [10:37:23] | es5 | codfw | es2023 | [10:37:34] | es5 | eqiad | es1023 | [10:38:12] PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-node-exporter.service,wmf_auto_restart_prometheus-swagger-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:29] Amir1: great [10:38:44] hashar: we are done :) [10:38:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P20779 and previous config saved to /var/cache/conftool/dbconfig/20220215-103849-marostegui.json [10:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [10:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:03] (03CR) 10Ladsgroup: Add 2022/drop_fr_img_star_cols_T300774.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762778 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [10:42:27] Amir1: cool thanks ! [10:42:35] will promote group0 wikis in a few, I am finishing a meeting [10:45:30] (JobUnavailable) firing: (3) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [10:45:31] (03PS1) 10MMandere: admin: Add saisuman production public key [puppet] - 10https://gerrit.wikimedia.org/r/762781 (https://phabricator.wikimedia.org/T300708) [10:49:18] (03CR) 10Cathal Mooney: "Few comments inline. Makes sense to keep this open for now though agreed. Drop me a line if you've any questions on the overall work Suk" [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [10:51:21] (03PS1) 10Filippo Giunchedi: prometheus: override probe service address [puppet] - 10https://gerrit.wikimedia.org/r/762782 (https://phabricator.wikimedia.org/T291946) [10:53:24] (03PS1) 10Ladsgroup: es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762783 (https://phabricator.wikimedia.org/T300006) [10:53:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300381)', diff saved to https://phabricator.wikimedia.org/P20780 and previous config saved to /var/cache/conftool/dbconfig/20220215-105354-marostegui.json [10:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:01] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [10:54:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [10:55:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33796/console" [puppet] - 10https://gerrit.wikimedia.org/r/762782 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [10:57:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:57:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:25] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762783 (https://phabricator.wikimedia.org/T300006) (owner: 10Ladsgroup) [10:59:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org [11:00:03] ok running group0 now [11:00:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack::galera: drop puppetmaster firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/760643 (owner: 10Majavah) [11:00:54] (03PS1) 10Hashar: group0 wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762787 [11:00:56] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762787 (owner: 10Hashar) [11:01:01] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: Restarting to pick up Java security updates - hnowlan@cumin1001 [11:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:34] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.22 refs T300198 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762787 (owner: 10Hashar) [11:01:49] (03CR) 10Ssingh: [C: 04-1] Add Wikidough's IPv6 anycast network in esams (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [11:02:42] (03CR) 10Ssingh: [C: 04-1] "(The -1 is intentional since I plan to address the bgp6_out in another CR, as discussed above)" [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [11:04:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance [11:04:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance [11:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1024 (T300006)', diff saved to https://phabricator.wikimedia.org/P20781 and previous config saved to /var/cache/conftool/dbconfig/20220215-110420-ladsgroup.json [11:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:25] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [11:05:21] RECOVERY - Check systemd state on prometheus1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:05] (03PS1) 10Ssingh: Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/762788 (https://phabricator.wikimedia.org/T301165) [11:07:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:07:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [11:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [11:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:29] (03CR) 10Majavah: Add Wikidough's IPv6 anycast network in esams (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [11:10:36] somehow the php-fpm-restarts phase has been idling for 8 minutes :/ [11:10:41] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.22 refs T300198 [11:10:45] ah [11:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:46] T300198: 1.38.0-wmf.22 deployment blockers - https://phabricator.wikimedia.org/T300198 [11:10:52] had to press ENTER... [11:14:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:14:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [11:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:27] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:14:46] (03PS1) 10Elukey: kserve-inference: add configuration for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/762790 [11:14:48] (03PS1) 10Elukey: ml-services: move revscoring-editquality to the new config [deployment-charts] - 10https://gerrit.wikimedia.org/r/762791 [11:15:13] (03PS1) 10Marostegui: analytics_multiinstance.my.cnf: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/762792 (https://phabricator.wikimedia.org/T268869) [11:15:17] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.4.0 [software/homer] - 10https://gerrit.wikimedia.org/r/762793 [11:15:34] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.4.0 [software/homer] - 10https://gerrit.wikimedia.org/r/762793 (owner: 10Volans) [11:17:21] (03CR) 10Marostegui: [C: 03+2] analytics_multiinstance.my.cnf: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/762792 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [11:18:27] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.4.0 [software/homer] - 10https://gerrit.wikimedia.org/r/762793 (owner: 10Volans) [11:19:43] (03PS2) 10Elukey: kserve-inference: add configuration for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/762790 [11:19:45] (03PS2) 10Elukey: ml-services: move revscoring-editquality to the new config [deployment-charts] - 10https://gerrit.wikimedia.org/r/762791 [11:23:14] !log rolling out Java 8 security updates for buster [11:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:50] looks all quiet [11:31:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:31:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [11:31:13] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/762798 [11:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:34] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/762798 (owner: 10Volans) [11:38:20] (03CR) 10Vgutierrez: [C: 04-1] "You can't use a CNAME record at zone apex." [dns] - 10https://gerrit.wikimedia.org/r/762075 (https://phabricator.wikimedia.org/T301592) (owner: 10Andrew Bogott) [11:40:37] (03CR) 10Ayounsi: [C: 03+1] Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/762788 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [11:40:56] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/762798 (owner: 10Volans) [11:42:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:42:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [11:42:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [11:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [11:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:20] (03PS1) 10Volans: Upstream release v2.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/762800 [11:44:03] (03CR) 10Ssingh: [C: 03+2] Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/762788 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [11:44:35] (03Merged) 10jenkins-bot: Add Wikidough's IPv6 anycast network in esams [homer/public] - 10https://gerrit.wikimedia.org/r/762788 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [11:45:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db2104.codfw.wmnet with OS bullseye [11:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:49:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [11:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T300381)', diff saved to https://phabricator.wikimedia.org/P20782 and previous config saved to /var/cache/conftool/dbconfig/20220215-114950-marostegui.json [11:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:55] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [11:50:07] !log running homer for Gerrit 762788 and T301165 [11:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:12] T301165: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 [11:57:52] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) @ssingh I haven't had time to go through all of this and work it out, but some things seem clear enough. As per @Majavah's comments on the CR, i... [11:57:52] (03PS1) 10Volans: Upstream release v0.4.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/762801 [11:58:23] (03CR) 10Volans: [C: 03+2] Upstream release v2.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/762800 (owner: 10Volans) [12:00:22] (03CR) 10Volans: [V: 03+2 C: 03+2] Upstream release v0.4.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/762801 (owner: 10Volans) [12:01:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::services::ntp: filter out self on peers list [puppet] - 10https://gerrit.wikimedia.org/r/761637 (owner: 10Majavah) [12:01:56] sukhe: are you done with homer? I'm about to deploy a new release and don't want to mess up your deploy ;) [12:02:20] volans: all done thank you :) [12:02:26] gl! [12:04:23] (03PS1) 10Vgutierrez: cache::envoy: Bound envoy to the same NUMA node as the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/762802 (https://phabricator.wikimedia.org/T271421) [12:06:02] (03PS23) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [12:07:42] (03PS1) 10Jelto: gitlab: add ferm rules and fix listen_addresses for test instance [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) [12:07:51] (03CR) 10Jbond: "ready" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:08:03] (03PS2) 10Vgutierrez: cache::envoy: Bound envoy to the same NUMA node as the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/762802 (https://phabricator.wikimedia.org/T271421) [12:08:24] (03Merged) 10jenkins-bot: Upstream release v2.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/762800 (owner: 10Volans) [12:08:55] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33798/console" [puppet] - 10https://gerrit.wikimedia.org/r/762802 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [12:10:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300381)', diff saved to https://phabricator.wikimedia.org/P20783 and previous config saved to /var/cache/conftool/dbconfig/20220215-121028-marostegui.json [12:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:34] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:10:45] (03CR) 10Jbond: [C: 03+2] "LGTM will merge thanks" [puppet] - 10https://gerrit.wikimedia.org/r/762518 (owner: 10Ebernhardson) [12:10:48] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33799/console" [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [12:11:07] sukhe: thank you, sir! [12:11:17] (03CR) 10Jelto: gitlab: add ferm rules and fix listen_addresses for test instance [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [12:12:00] (03PS3) 10Vgutierrez: cache::envoy: Bound envoy to the same NUMA node as the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/762802 (https://phabricator.wikimedia.org/T271421) [12:12:03] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33800/console" [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [12:15:04] (03PS2) 10Jelto: gitlab: add ferm rules and fix listen_addresses for test instance [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) [12:15:38] (03CR) 10jerkins-bot: [V: 04-1] reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:16:40] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ayounsi) Puppet will automatically filter v4 from b6 neighbors and should do the right thing when v4 and v6 IPs are mixed in `neighbors_list` https://git... [12:17:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2104.codfw.wmnet with OS bullseye [12:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:54] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33801/console" [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [12:25:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P20784 and previous config saved to /var/cache/conftool/dbconfig/20220215-122533-marostegui.json [12:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:20] (03PS1) 10Ladsgroup: db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762804 (https://phabricator.wikimedia.org/T300510) [12:32:38] !log Modifying anycast_import policy on cr1-eqiad to validate / prep for changes to support wikidough IPv6. [12:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762804 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [12:40:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:40:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20785 and previous config saved to /var/cache/conftool/dbconfig/20220215-124035-ladsgroup.json [12:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:41] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [12:40:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P20786 and previous config saved to /var/cache/conftool/dbconfig/20220215-124043-marostegui.json [12:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T300510)', diff saved to https://phabricator.wikimedia.org/P20787 and previous config saved to /var/cache/conftool/dbconfig/20220215-124207-ladsgroup.json [12:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:24] (03PS1) 10Kormat: auto_schema: Use wmflib to detect screen/tmux/etc. [software] - 10https://gerrit.wikimedia.org/r/762805 [12:42:55] (03CR) 10jerkins-bot: [V: 04-1] auto_schema: Use wmflib to detect screen/tmux/etc. [software] - 10https://gerrit.wikimedia.org/r/762805 (owner: 10Kormat) [12:43:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1170.eqiad.wmnet with OS bullseye [12:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:59] (03PS2) 10Kormat: auto_schema: Use wmflib to detect screen/tmux/etc. [software] - 10https://gerrit.wikimedia.org/r/762805 [12:46:11] (03CR) 10Ladsgroup: [C: 03+1] auto_schema: Use wmflib to detect screen/tmux/etc. [software] - 10https://gerrit.wikimedia.org/r/762805 (owner: 10Kormat) [12:46:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.provision for host es1024.mgmt.eqiad.wmnet with reboot policy GRACEFUL [12:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:51] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts elastic2035.codfw.wmnet [12:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:05] (03CR) 10Kormat: [C: 03+2] auto_schema: Use wmflib to detect screen/tmux/etc. [software] - 10https://gerrit.wikimedia.org/r/762805 (owner: 10Kormat) [12:49:10] 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: `elastic2035.codfw.wmnet` - elastic2035.codfw.wmnet (**PASS... [12:49:38] (03Merged) 10jenkins-bot: auto_schema: Use wmflib to detect screen/tmux/etc. [software] - 10https://gerrit.wikimedia.org/r/762805 (owner: 10Kormat) [12:50:28] 10SRE, 10ops-codfw, 10Discovery-Search (Current work), 10Patch-For-Review: Degraded RAID on elastic2035 - https://phabricator.wikimedia.org/T298853 (10Volans) The `sre.hosts.decommission` cookbook was left hanging at the homer step waiting for confirmation since Feb. 4th. I noticed because I was deploying... [12:51:07] !log uploaded spicerack_2.0.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [12:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:54] !log volans@deploy1002 Started deploy [homer/deploy@94bed87]: Release v0.4.0 [12:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:35] (03PS2) 10Kormat: Add 2022/drop_fr_img_star_cols_T300774.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762778 (https://phabricator.wikimedia.org/T300774) [12:53:42] (03PS1) 10Elukey: base::kernel: explicitly load overlay when overlayfs is true [puppet] - 10https://gerrit.wikimedia.org/r/762806 (https://phabricator.wikimedia.org/T300744) [12:53:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1024.mgmt.eqiad.wmnet with reboot policy GRACEFUL [12:53:44] (03CR) 10Kormat: Add 2022/drop_fr_img_star_cols_T300774.py (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762778 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [12:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:57] (03CR) 10jerkins-bot: [V: 04-1] Add 2022/drop_fr_img_star_cols_T300774.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762778 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [12:54:15] (03PS3) 10Kormat: Add 2022/drop_fr_img_star_cols_T300774.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762778 (https://phabricator.wikimedia.org/T300774) [12:54:22] !log volans@deploy1002 Finished deploy [homer/deploy@94bed87]: Release v0.4.0 (duration: 01m 28s) [12:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:31] (03PS1) 10Cathal Mooney: Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) [12:55:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T300381)', diff saved to https://phabricator.wikimedia.org/P20788 and previous config saved to /var/cache/conftool/dbconfig/20220215-125548-marostegui.json [12:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:54] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [12:55:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [12:55:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [12:55:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [12:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:01] (03CR) 10jerkins-bot: [V: 04-1] Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [12:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [12:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:28] (03PS1) 10Kormat: auto_schema: Remove debugging output file. [software] - 10https://gerrit.wikimedia.org/r/762808 [12:56:54] (03CR) 10jerkins-bot: [V: 04-1] auto_schema: Remove debugging output file. [software] - 10https://gerrit.wikimedia.org/r/762808 (owner: 10Kormat) [12:57:51] !log volans@cumin2002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet with reason: Release v0.4.0 - volans@cumin2002 [12:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:01] (03PS2) 10Kormat: auto_schema: Remove debugging output file. [software] - 10https://gerrit.wikimedia.org/r/762808 [12:58:41] !log volans@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet with reason: Release v0.4.0 - volans@cumin2002 [12:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:35] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Remove debugging output file. [software] - 10https://gerrit.wikimedia.org/r/762808 (owner: 10Kormat) [13:00:41] !log filippo@puppetmaster1001 conftool action : set/weight=10; selector: name=prometheus1006.eqiad.wmnet [13:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:50] (03CR) 10Ladsgroup: [C: 03+1] Add 2022/drop_fr_img_star_cols_T300774.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762778 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [13:00:52] !log filippo@puppetmaster1001 conftool action : set/weight=10; selector: name=prometheus2006.codfw.wmnet [13:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:03] (03Merged) 10jenkins-bot: auto_schema: Remove debugging output file. [software] - 10https://gerrit.wikimedia.org/r/762808 (owner: 10Kormat) [13:01:24] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus2006.codfw.wmnet [13:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:29] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1006.eqiad.wmnet [13:01:31] (03PS24) 10Jbond: reposync: add new class to manage syncing repositories [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) [13:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:02] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: override probe service address [puppet] - 10https://gerrit.wikimedia.org/r/762782 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [13:03:34] (03CR) 10Kormat: [C: 03+2] Add 2022/drop_fr_img_star_cols_T300774.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762778 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [13:03:57] (03Merged) 10jenkins-bot: Add 2022/drop_fr_img_star_cols_T300774.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762778 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [13:05:28] (03PS1) 10Volans: homer: daily check only active devices in Netbox [puppet] - 10https://gerrit.wikimedia.org/r/762810 [13:07:20] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: introduce some TJF checks [puppet] - 10https://gerrit.wikimedia.org/r/762811 (https://phabricator.wikimedia.org/T277653) [13:07:56] (03PS2) 10Arturo Borrero Gonzalez: toolforge: automated-tests: introduce some TJF checks [puppet] - 10https://gerrit.wikimedia.org/r/762811 (https://phabricator.wikimedia.org/T277653) [13:08:11] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) Thanks @ayounsi makes sense. I wonder what will happen in this case, with the input being two IPv4 addresses? Will it interpret neighbors_list... [13:11:50] (03PS2) 10Cathal Mooney: Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) [13:12:19] (03CR) 10jerkins-bot: [V: 04-1] Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [13:14:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1170.eqiad.wmnet with OS bullseye [13:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:14:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T300381)', diff saved to https://phabricator.wikimedia.org/P20789 and previous config saved to /var/cache/conftool/dbconfig/20220215-131427-marostegui.json [13:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:32] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [13:14:47] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:14:49] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:31] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:15:33] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:16] (03PS3) 10Cathal Mooney: Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) [13:16:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: automated-tests: introduce some TJF checks [puppet] - 10https://gerrit.wikimedia.org/r/762811 (https://phabricator.wikimedia.org/T277653) (owner: 10Arturo Borrero Gonzalez) [13:16:48] (03CR) 10jerkins-bot: [V: 04-1] Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [13:16:56] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:16:57] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:44] (03PS1) 10Kormat: 2022/T300774: Fix logic in check() [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762815 [13:18:09] (03PS4) 10Cathal Mooney: Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) [13:18:31] (03PS2) 10Kormat: 2022/T300774: Fix logic in check() [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762815 (https://phabricator.wikimedia.org/T300774) [13:19:01] (03CR) 10jerkins-bot: [V: 04-1] Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [13:19:10] (03CR) 10Kormat: [C: 03+2] 2022/T300774: Fix logic in check() [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762815 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [13:19:32] (03Merged) 10jenkins-bot: 2022/T300774: Fix logic in check() [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762815 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [13:20:25] (03CR) 10Muehlenhoff: "The approach looks fine, but let's rather use kmod::module?" [puppet] - 10https://gerrit.wikimedia.org/r/762806 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:21:15] (03PS5) 10Cathal Mooney: Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) [13:21:41] (03PS2) 10Elukey: base::kernel: explicitly load overlay when overlayfs is true [puppet] - 10https://gerrit.wikimedia.org/r/762806 (https://phabricator.wikimedia.org/T300744) [13:22:03] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33802/console" [puppet] - 10https://gerrit.wikimedia.org/r/762802 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:22:08] (03CR) 10Elukey: base::kernel: explicitly load overlay when overlayfs is true (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762806 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:22:20] (03CR) 10jerkins-bot: [V: 04-1] Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [13:24:22] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/762806 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:24:49] (03PS6) 10Cathal Mooney: Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) [13:25:00] !log disable puppet on cache::(text|upload)_envoy nodes [13:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:19] (03CR) 10jerkins-bot: [V: 04-1] Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [13:26:55] (03PS7) 10Cathal Mooney: Adjust CR Internal Anycast BGP Templates [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) [13:28:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20790 and previous config saved to /var/cache/conftool/dbconfig/20220215-132857-ladsgroup.json [13:28:58] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] cache::envoy: Bound envoy to the same NUMA node as the main NIC [puppet] - 10https://gerrit.wikimedia.org/r/762802 (https://phabricator.wikimedia.org/T271421) (owner: 10Vgutierrez) [13:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:03] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [13:31:14] !log installing lxml security updates [13:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:09] !log enable puppet on cache::(text|upload)_envoy nodes [13:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:54] !log rolling restart of envoy on cp nodes [13:33:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300381)', diff saved to https://phabricator.wikimedia.org/P20791 and previous config saved to /var/cache/conftool/dbconfig/20220215-133354-marostegui.json [13:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:02] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [13:35:48] (03PS1) 10Kormat: 2002/T300774: Make command idempotent. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762818 (https://phabricator.wikimedia.org/T300774) [13:40:43] (03CR) 10Ladsgroup: 2002/T300774: Make command idempotent. (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762818 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [13:42:05] (03PS2) 10Kormat: 2022/T300774: Make command idempotent. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762818 (https://phabricator.wikimedia.org/T300774) [13:42:23] (03CR) 10Kormat: 2022/T300774: Make command idempotent. (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762818 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [13:44:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P20792 and previous config saved to /var/cache/conftool/dbconfig/20220215-134402-ladsgroup.json [13:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:06] (03CR) 10Ladsgroup: [C: 03+2] 2022/T300774: Make command idempotent. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762818 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [13:44:30] (03Merged) 10jenkins-bot: 2022/T300774: Make command idempotent. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/762818 (https://phabricator.wikimedia.org/T300774) (owner: 10Kormat) [13:48:42] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10Ottomata) @Joe I think most of the usages of webproxy that folks here are concerned with aren't by production services. T... [13:49:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P20793 and previous config saved to /var/cache/conftool/dbconfig/20220215-134859-marostegui.json [13:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P20794 and previous config saved to /var/cache/conftool/dbconfig/20220215-135907-ladsgroup.json [13:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:17] (03CR) 10Cathal Mooney: "Comment inline. This will be required, but format needs to be changed to a dict structure as shown in my comment." [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [14:00:04] Lucas_WMDE and Urbanecm: Dear deployers, time to do the UTC afternoon backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T1400). [14:00:04] No Gerrit patches in the queue for this window AFAICS. [14:00:30] (03PS1) 104nn1l2: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762819 (https://phabricator.wikimedia.org/T301647) [14:00:31] o/ [14:00:35] Indeed, nothing to do. [14:00:43] Hello Lucas_WMDE :-) [14:00:44] looks like nn1l2 is about to add a patch? [14:00:47] hi :) [14:00:53] 10SRE, 10ops-codfw, 10DC-Ops: Installation issues on PowerEdge R440 Restbase servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299652 (10Papaul) 05Openβ†’03Resolved [14:01:11] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [14:01:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [14:01:13] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [14:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [14:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:37] !log volans@cumin2002 START - Cookbook sre.hosts.test-cookbook testing new spicerack release [14:02:38] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2002.codfw.wmnet with reason: testing new spicerack [14:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:41] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2002.codfw.wmnet with reason: testing new spicerack [14:02:41] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.test-cookbook (exit_code=0) testing new spicerack release [14:02:41] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti1022.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [14:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti1022.eqiad.wmnet with reason: Remove from Ganeti cluster for reimage [14:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:55] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) a:05Jclark-ctrβ†’03Papaul [14:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:28] Please don't close B&C window [14:03:40] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) [14:03:50] 10SRE, 10ops-eqiad: Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed - https://phabricator.wikimedia.org/T299527 (10MoritzMuehlenhoff) One more server is ready and downtimed; ganeti1022 [14:04:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:04:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P20795 and previous config saved to /var/cache/conftool/dbconfig/20220215-140404-marostegui.json [14:04:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T300775)', diff saved to https://phabricator.wikimedia.org/P20796 and previous config saved to /var/cache/conftool/dbconfig/20220215-140408-marostegui.json [14:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:18] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [14:04:18] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10akosiaris) >>! In T300977#7710861, @Ottomata wrote: > @Joe I think most of the usages of webproxy that folks here are conc... [14:05:34] I added one patch although a bit late: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1949266&oldid=1949049 [14:06:06] !log deployed spicerack v2.0.0 on cumin hosts [14:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:04] jouncebot: now [14:07:04] For the next 0 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T1400) [14:07:20] !log removing java packages from maps2005 [14:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:51] Lucas_WMDE: you here? [14:07:59] yes [14:08:13] Did you close B&C? [14:08:29] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:08:34] I submitted a patch although a bit late [14:08:41] it’s not closed [14:08:46] but idk if I’ll be able to review it in time [14:08:58] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: add ferm rules and fix listen_addresses for test instance [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [14:09:10] okay, I'm around for the next one hour [14:09:26] If you could review it, please ping me. [14:10:11] ok, I’m taking a look now [14:10:38] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10akosiaris) >>! In T292322#7710140, @Joe wrote: > After more digging: I have no idea why envoy would report the upstream time spent as 2 seconds, when it r... [14:14:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T300510)', diff saved to https://phabricator.wikimedia.org/P20797 and previous config saved to /var/cache/conftool/dbconfig/20220215-141411-ladsgroup.json [14:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:18] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [14:18:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, diffConfig says no effective change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762819 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:19:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300381)', diff saved to https://phabricator.wikimedia.org/P20798 and previous config saved to /var/cache/conftool/dbconfig/20220215-141908-marostegui.json [14:19:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [14:19:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [14:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:14] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [14:19:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T300381)', diff saved to https://phabricator.wikimedia.org/P20799 and previous config saved to /var/cache/conftool/dbconfig/20220215-141916-marostegui.json [14:19:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762819 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:20:55] (03Merged) 10jenkins-bot: InitialiseSettings: General cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762819 (https://phabricator.wikimedia.org/T301647) (owner: 104nn1l2) [14:21:15] (03PS1) 10Filippo Giunchedi: pontoon: set permissions on auto.yaml at bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/762822 [14:21:47] nn1l2: the change is on mwdebug1001, let’s test it a bit [14:21:56] thanks [14:23:59] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus1004.eqiad.wmnet [14:24:01] 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10bking) Hello Cathal, Ryan and I are in training this week, but I would love to meet with you (maybe next week) and talk specifics about network (or any... [14:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:11] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts prometheus1004.eqiad.wmnet [14:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:52] LGTM. nothing seems broken in Farsi (my native language) projects [14:25:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300510)', diff saved to https://phabricator.wikimedia.org/P20800 and previous config saved to /var/cache/conftool/dbconfig/20220215-142511-ladsgroup.json [14:25:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:25:15] frwiki also looks fine [14:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:17] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [14:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:22] syncing [14:28:33] !log installing clamav security updates on otrs1001 / ticket.wikimedia.org [14:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:58] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:762819|InitialiseSettings: General cleanup (T301647)]] (wgAddGroups F-I) (duration: 02m 41s) [14:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:03] T301647: Clean up InitialiseSettings - https://phabricator.wikimedia.org/T301647 [14:29:48] nn1l2: I added (wgAddGroups F-I) to the log message there – I think it would be a good idea to include something like that in the commit message subject line as well [14:30:01] !log UTC afternoon backport window done [14:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:10] thanks, okay [14:31:00] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: set permissions on auto.yaml at bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/762822 (owner: 10Filippo Giunchedi) [14:31:35] (03PS1) 10Jelto: hiera::role::common::idp add gitlab-replica to CAS-SSO [puppet] - 10https://gerrit.wikimedia.org/r/762823 (https://phabricator.wikimedia.org/T297411) [14:32:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:32:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [14:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:17] (03CR) 10Elukey: [C: 03+2] base::kernel: explicitly load overlay when overlayfs is true [puppet] - 10https://gerrit.wikimedia.org/r/762806 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [14:35:57] (03CR) 10Jelto: "We created a new, puppet-managed GitLab test instance. The goal is to keep it as close as possible to production configuration. So SSO log" [puppet] - 10https://gerrit.wikimedia.org/r/762823 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [14:37:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2005.codfw.wmnet with OS bullseye [14:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:33] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [14:38:28] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh) >>! In T301165#7710616, @ayounsi wrote: > Puppet will automatically filter v4 from v6 neighbors and should do the right thing when v4 and v6 IPs a... [14:38:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [14:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:30] (03PS1) 10Volans: Import ArgparseFormatter from spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/762824 [14:39:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300381)', diff saved to https://phabricator.wikimedia.org/P20803 and previous config saved to /var/cache/conftool/dbconfig/20220215-143934-marostegui.json [14:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:39] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [14:40:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P20804 and previous config saved to /var/cache/conftool/dbconfig/20220215-144016-ladsgroup.json [14:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:48] !log removing java packages from all maps hosts [14:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:39] (03CR) 10Ssingh: [C: 04-1] "I think we should abandon this commit since we have added code in other places; I5380e3 and I9988a." [homer/public] - 10https://gerrit.wikimedia.org/r/761364 (https://phabricator.wikimedia.org/T301165) (owner: 10Ssingh) [14:41:58] (03PS1) 10Filippo Giunchedi: Decom prometheus[12]00[34] [puppet] - 10https://gerrit.wikimedia.org/r/762825 (https://phabricator.wikimedia.org/T296199) [14:42:58] seeking kind souls to sanity check ^ [14:43:03] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:27] (KubernetesCalicoDown) firing: ml-serve2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [14:44:41] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:46:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [14:46:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/762825 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [14:47:04] thank you moritzm, appreciate it [14:47:14] * godog scribbles down +1 beer [14:47:21] (03PS1) 10Volans: Import ArgparseFormatter from spicerack [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/762826 [14:47:31] (03CR) 10Filippo Giunchedi: [C: 03+2] Decom prometheus[12]00[34] [puppet] - 10https://gerrit.wikimedia.org/r/762825 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [14:49:07] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) > Do we anticipate other for uses of the anycast service? I don't think we would rule out introducing new services based on it, but right now I... [14:50:45] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10Michael.Hay) [14:50:55] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus1004.eqiad.wmnet [14:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P20805 and previous config saved to /var/cache/conftool/dbconfig/20220215-145438-marostegui.json [14:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:20] (03PS1) 10Filippo Giunchedi: cr: remove prometheus[12]00[34] from ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/762827 (https://phabricator.wikimedia.org/T296199) [14:55:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P20806 and previous config saved to /var/cache/conftool/dbconfig/20220215-145521-ladsgroup.json [14:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:15] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Restarting to pick up Java security updates - hnowlan@cumin1001 [14:56:16] (03PS1) 10Volans: sre.hosts.reimage: convert call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/762828 [14:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:27] (KubernetesCalicoDown) resolved: ml-serve2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [14:59:57] (KubernetesCalicoDown) firing: ml-serve2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [14:59:58] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/pagelib (Get CSS bundle from wikimedia-page-library) timed out before a response was received: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:00:54] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:04:57] (KubernetesCalicoDown) resolved: ml-serve2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:05:11] (KubernetesCalicoDown) firing: ml-serve2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:06:54] this is me reimaging, sorry for the noise [15:07:05] (03PS1) 10Majavah: remove clush modules, profiles and roles [puppet] - 10https://gerrit.wikimedia.org/r/762829 (https://phabricator.wikimedia.org/T298191) [15:07:17] (03CR) 10Ssingh: Adjust CR Internal Anycast BGP Templates (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [15:07:56] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 88, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:09:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2005.codfw.wmnet with OS bullseye [15:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P20807 and previous config saved to /var/cache/conftool/dbconfig/20220215-150943-marostegui.json [15:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:11] (KubernetesCalicoDown) resolved: ml-serve2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org [15:10:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T300510)', diff saved to https://phabricator.wikimedia.org/P20808 and previous config saved to /var/cache/conftool/dbconfig/20220215-151026-ladsgroup.json [15:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:32] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [15:11:34] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus1004.eqiad.wmnet [15:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:46] PROBLEM - Check systemd state on kubernetes2009 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:59] (03CR) 10Jbond: [C: 04-1] "see inline for comments" [puppet] - 10https://gerrit.wikimedia.org/r/762823 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [15:16:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/762824 (owner: 10Volans) [15:17:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host es1024.eqiad.wmnet with OS bullseye [15:17:13] (03CR) 10Majavah: hiera::role::common::idp add gitlab-replica to CAS-SSO (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762823 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [15:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/762828 (owner: 10Volans) [15:18:00] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 121, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:18:03] (03CR) 10Jbond: [C: 03+1] Import ArgparseFormatter from spicerack [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/762826 (owner: 10Volans) [15:19:24] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10RhinosF1) I assume @JBennett is approving manager as with all Tumult Labs tickets in last few days. If so, please approve. [15:19:46] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10RhinosF1) Cc Ottomata as well given it's analytics access [15:20:12] (03PS3) 10Vgutierrez: cache::haproxy: Log X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762411 (https://phabricator.wikimedia.org/T290005) [15:21:09] (03PS6) 10Vgutierrez: mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) [15:22:35] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2009 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:22:43] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Log X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762411 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:23:22] (03CR) 10Elukey: [C: 03+1] Import ArgparseFormatter from spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/762824 (owner: 10Volans) [15:23:26] (03CR) 10Vgutierrez: [C: 03+2] mtail::cache_haproxy: Split TTFB bucket by X-Cache-Status [puppet] - 10https://gerrit.wikimedia.org/r/762412 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:24:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T300381)', diff saved to https://phabricator.wikimedia.org/P20809 and previous config saved to /var/cache/conftool/dbconfig/20220215-152448-marostegui.json [15:24:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:24:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:54] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [15:24:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10elukey) [15:24:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T300381)', diff saved to https://phabricator.wikimedia.org/P20810 and previous config saved to /var/cache/conftool/dbconfig/20220215-152455-marostegui.json [15:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:02] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10Ottomata) Approved. This user account should have an account expiry date. [15:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:44] 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10RhinosF1) Expiry dates for these is 13 September 2022 I believe [15:26:02] ottomata: that was super fast! [15:26:27] :) [15:32:03] (03CR) 10Volans: [C: 03+2] Import ArgparseFormatter from spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/762824 (owner: 10Volans) [15:32:12] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: convert call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/762828 (owner: 10Volans) [15:32:41] 10SRE, 10Discovery, 10Infrastructure-Foundations, 10netops: Speed up network connections for Elastic hosts - https://phabricator.wikimedia.org/T301577 (10cmooney) Thanks Brian > Ryan and I are in training this week, but I would love to meet with you (maybe next week) and talk specifics about network (or a... [15:34:44] (03Merged) 10jenkins-bot: Import ArgparseFormatter from spicerack [cookbooks] - 10https://gerrit.wikimedia.org/r/762824 (owner: 10Volans) [15:34:46] (03Merged) 10jenkins-bot: sre.hosts.reimage: convert call to downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/762828 (owner: 10Volans) [15:35:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/762781 (https://phabricator.wikimedia.org/T300708) (owner: 10MMandere) [15:36:52] RECOVERY - Check systemd state on kubernetes2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:53] (03PS1) 10Muehlenhoff: netboot.cfg: Use globbing for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/762838 [15:44:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300381)', diff saved to https://phabricator.wikimedia.org/P20811 and previous config saved to /var/cache/conftool/dbconfig/20220215-154427-marostegui.json [15:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:34] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [15:48:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1024.eqiad.wmnet with OS bullseye [15:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:38] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2009 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:56:19] (03PS1) 10Ottomata: Add cyrus-sasl and pyhive [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762843 [15:56:26] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [15:56:36] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Fix env_vars - should be CXXFLAGS, not CXX_FLAGS [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762522 (owner: 10Ottomata) [15:56:40] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add cyrus-sasl and pyhive [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/762843 (owner: 10Ottomata) [15:57:33] (03CR) 10MMandere: "Double checked the key with the user via Slack, and key matches." [puppet] - 10https://gerrit.wikimedia.org/r/762781 (https://phabricator.wikimedia.org/T300708) (owner: 10MMandere) [15:57:40] (03CR) 10MMandere: [C: 03+2] admin: Add saisuman production public key [puppet] - 10https://gerrit.wikimedia.org/r/762781 (https://phabricator.wikimedia.org/T300708) (owner: 10MMandere) [15:57:46] (03PS1) 10AOkoth: vrts: rename profile variables [puppet] - 10https://gerrit.wikimedia.org/r/762845 (https://phabricator.wikimedia.org/T293942) [15:59:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P20812 and previous config saved to /var/cache/conftool/dbconfig/20220215-155931-marostegui.json [15:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] cwhite: #bothumor My software never has bugs. It just develops random features. Rise for Logstash switchback to eqiad. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T1600). [16:00:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T300006)', diff saved to https://phabricator.wikimedia.org/P20813 and previous config saved to /var/cache/conftool/dbconfig/20220215-160055-ladsgroup.json [16:00:58] o/ [16:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:00] !log filippo@cumin1001 START - Cookbook sre.hosts.decommission for hosts prometheus2004.codfw.wmnet [16:01:01] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [16:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:05] (03PS1) 10Ottomata: Install anaconda-wmf-base on all workers, and anaconda-wmf only on stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/762846 [16:02:24] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1003/33803/" [puppet] - 10https://gerrit.wikimedia.org/r/762845 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [16:03:25] (03PS2) 10Jelto: gitlab: move gitlab test instance to wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/762823 (https://phabricator.wikimedia.org/T297411) [16:06:57] (03CR) 10Filippo Giunchedi: [C: 03+1] netboot.cfg: Use globbing for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/762838 (owner: 10Muehlenhoff) [16:06:59] (03CR) 10Jelto: gitlab: move gitlab test instance to wmcloud.org (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/762823 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [16:10:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Import ArgparseFormatter from spicerack [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/762826 (owner: 10Volans) [16:11:20] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts prometheus2004.codfw.wmnet [16:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P20814 and previous config saved to /var/cache/conftool/dbconfig/20220215-161436-marostegui.json [16:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:22] (03PS1) 10Razzi: site: add datahubsearch1001 in insetup role [puppet] - 10https://gerrit.wikimedia.org/r/762850 (https://phabricator.wikimedia.org/T301383) [16:15:53] (03PS2) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/759783 (https://phabricator.wikimedia.org/T299168) [16:16:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P20815 and previous config saved to /var/cache/conftool/dbconfig/20220215-161601-ladsgroup.json [16:16:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:09] (03CR) 10Cwhite: [C: 03+2] hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/759783 (https://phabricator.wikimedia.org/T299168) (owner: 10Cwhite) [16:17:51] (03CR) 10Razzi: [C: 03+2] "I'm proceeding with this; should be pretty low risk since this is a new host so I'll see any issues I set it up." [puppet] - 10https://gerrit.wikimedia.org/r/762850 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [16:18:06] (03PS1) 10BBlack: lvs1018 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/762852 (https://phabricator.wikimedia.org/T301142) [16:18:08] (03CR) 10Btullis: site: add datahubsearch1001 in insetup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762850 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [16:19:18] (03PS2) 10Giuseppe Lavagetto: shellbox-media: use local disk for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/762768 (https://phabricator.wikimedia.org/T292322) [16:19:20] (03PS1) 10Giuseppe Lavagetto: Rakefile: enhance output on broken yaml/new assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/762853 [16:19:22] (03PS1) 10Giuseppe Lavagetto: Shellbox: several enhancements [deployment-charts] - 10https://gerrit.wikimedia.org/r/762854 [16:21:59] (03PS2) 10Andrew Bogott: nfs-mounts.yaml.erb: remove nfs mounts for project-proxy [puppet] - 10https://gerrit.wikimedia.org/r/762543 (https://phabricator.wikimedia.org/T301715) [16:22:01] (03PS1) 10Andrew Bogott: wmcs-cinder-backup-manager: add more nfs volume backups [puppet] - 10https://gerrit.wikimedia.org/r/762856 (https://phabricator.wikimedia.org/T301280) [16:22:17] (03PS1) 10Marostegui: zarcillo.sql.erb: Zarcillo grants [puppet] - 10https://gerrit.wikimedia.org/r/762857 [16:24:23] (03CR) 10Marostegui: [C: 03+2] zarcillo.sql.erb: Zarcillo grants [puppet] - 10https://gerrit.wikimedia.org/r/762857 (owner: 10Marostegui) [16:25:28] (03CR) 10Muehlenhoff: site: add datahubsearch1001 in insetup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762850 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [16:26:07] !log lvs1014 - downtimed - stopping puppet+pybal to fail traffic over to lvs1020 - T301142 [16:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:28] T301142: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 [16:26:50] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:27:46] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:28:57] (03CR) 10Ayounsi: [C: 03+1] Adjust CR Internal Anycast BGP Templates (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [16:29:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T300381)', diff saved to https://phabricator.wikimedia.org/P20816 and previous config saved to /var/cache/conftool/dbconfig/20220215-162941-marostegui.json [16:29:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [16:29:43] (bgp alerts triggered by lvs12014 works) [16:29:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [16:29:46] *1014 [16:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:47] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [16:29:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T300381)', diff saved to https://phabricator.wikimedia.org/P20817 and previous config saved to /var/cache/conftool/dbconfig/20220215-162949-marostegui.json [16:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:27] (03CR) 10Ayounsi: [C: 03+1] cr: remove prometheus[12]00[34] from ACLs [homer/public] - 10https://gerrit.wikimedia.org/r/762827 (https://phabricator.wikimedia.org/T296199) (owner: 10Filippo Giunchedi) [16:31:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P20818 and previous config saved to /var/cache/conftool/dbconfig/20220215-163106-ladsgroup.json [16:31:08] (03PS1) 10Volans: Remove ArgparseFormatter as it's now unused [cookbooks] - 10https://gerrit.wikimedia.org/r/762860 [16:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:15] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM. Nice one :)" [puppet] - 10https://gerrit.wikimedia.org/r/762810 (owner: 10Volans) [16:31:16] ACKNOWLEDGEMENT - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black T301142 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:31:16] ACKNOWLEDGEMENT - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black T301142 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:31:20] ACKNOWLEDGEMENT - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black T301142 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:31:20] ACKNOWLEDGEMENT - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black T301142 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:31:32] no double-click prevention on the icinga form heh [16:31:33] (03CR) 10Volans: [C: 03+2] homer: daily check only active devices in Netbox [puppet] - 10https://gerrit.wikimedia.org/r/762810 (owner: 10Volans) [16:33:43] (03CR) 10Andrew Bogott: [C: 03+2] nfs-mounts.yaml.erb: remove nfs mounts for project-proxy [puppet] - 10https://gerrit.wikimedia.org/r/762543 (https://phabricator.wikimedia.org/T301715) (owner: 10Andrew Bogott) [16:33:54] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cinder-backup-manager: add more nfs volume backups [puppet] - 10https://gerrit.wikimedia.org/r/762856 (https://phabricator.wikimedia.org/T301280) (owner: 10Andrew Bogott) [16:34:17] (03PS2) 10Giuseppe Lavagetto: Rakefile: enhance output on broken yaml/new assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/762853 [16:34:20] (03PS2) 10Giuseppe Lavagetto: Shellbox: several enhancements [deployment-charts] - 10https://gerrit.wikimedia.org/r/762854 [16:34:24] (03PS3) 10Giuseppe Lavagetto: shellbox-media: use local disk for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/762768 (https://phabricator.wikimedia.org/T292322) [16:35:19] (03CR) 10BBlack: [C: 03+2] lvs1018 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/762852 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [16:35:25] (03PS2) 10BBlack: lvs1018 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/762852 (https://phabricator.wikimedia.org/T301142) [16:38:02] !log lvs1018 - puppeting into prod role for first time [16:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:20] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti1011.eqiad.wmnet with OS buster [16:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:25] 10SRE, 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ganeti1011.eqiad.wmnet with OS buster [16:39:32] !log logstash switchback to eqiad complete T299168 [16:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:37] T299168: Upgrade OpenSearch - https://phabricator.wikimedia.org/T299168 [16:41:12] (03PS3) 10Giuseppe Lavagetto: Rakefile: enhance output on broken yaml/new assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/762853 [16:41:14] (03PS3) 10Giuseppe Lavagetto: Shellbox: several enhancements [deployment-charts] - 10https://gerrit.wikimedia.org/r/762854 [16:41:16] (03PS4) 10Giuseppe Lavagetto: shellbox-media: use local disk for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/762768 (https://phabricator.wikimedia.org/T292322) [16:44:18] (03PS2) 10Accraze: ml-services: add arwiki & bnwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) [16:46:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T300006)', diff saved to https://phabricator.wikimedia.org/P20819 and previous config saved to /var/cache/conftool/dbconfig/20220215-164611-ladsgroup.json [16:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:17] T300006: Upgrade es5 to Bullseye - https://phabricator.wikimedia.org/T300006 [16:47:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: enhance output on broken yaml/new assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/762853 (owner: 10Giuseppe Lavagetto) [16:47:48] (03CR) 10Accraze: [C: 03+1] kserve-inference: add configuration for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/762790 (owner: 10Elukey) [16:48:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Michael.hay - https://phabricator.wikimedia.org/T301782 (10Htriedman) @RhinosF1 Correct assumptions on both counts β€”Β @JBennett is the approving manager, and the expiry date is 13 September 2022 [16:48:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Htriedman) @Dzahn checking in with Legal now [16:49:17] (03CR) 10Elukey: [C: 03+2] kserve-inference: add configuration for revscoring models [deployment-charts] - 10https://gerrit.wikimedia.org/r/762790 (owner: 10Elukey) [16:50:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1011.eqiad.wmnet with reason: host reimage [16:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300381)', diff saved to https://phabricator.wikimedia.org/P20820 and previous config saved to /var/cache/conftool/dbconfig/20220215-165015-marostegui.json [16:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:20] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [16:50:26] (03CR) 10Accraze: [C: 03+1] ml-services: move revscoring-editquality to the new config [deployment-charts] - 10https://gerrit.wikimedia.org/r/762791 (owner: 10Elukey) [16:50:37] (03CR) 10Elukey: [C: 03+2] ml-services: move revscoring-editquality to the new config [deployment-charts] - 10https://gerrit.wikimedia.org/r/762791 (owner: 10Elukey) [16:50:59] 10SRE, 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10Cmjohnson) @MoritzMuehlenhoff No, I turn the internal USB off during setup. I did double-check, I think what happened is, I rearranged disks without powering off the server. I pulled the power and did a r... [16:51:01] (03Merged) 10jenkins-bot: Rakefile: enhance output on broken yaml/new assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/762853 (owner: 10Giuseppe Lavagetto) [16:51:01] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Shellbox: several enhancements [deployment-charts] - 10https://gerrit.wikimedia.org/r/762854 (owner: 10Giuseppe Lavagetto) [16:51:58] !log lvs1018 - reboot [16:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:42] PROBLEM - Host lvs1018 is DOWN: PING CRITICAL - Packet loss = 100% [16:53:33] (03PS1) 10Razzi: dhcpd: Fix hostname for datahubsearch1001, use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762867 (https://phabricator.wikimedia.org/T301383) [16:54:17] (03CR) 10jerkins-bot: [V: 04-1] dhcpd: Fix hostname for datahubsearch1001, use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762867 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [16:54:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1011.eqiad.wmnet with reason: host reimage [16:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:50] (03PS1) 10BBlack: Add lvs1018 to pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/762869 [16:54:57] (03CR) 10Razzi: site: add datahubsearch1001 in insetup role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/762850 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [16:55:16] (03Merged) 10jenkins-bot: Shellbox: several enhancements [deployment-charts] - 10https://gerrit.wikimedia.org/r/762854 (owner: 10Giuseppe Lavagetto) [16:55:32] (03PS2) 10Razzi: dhcpd: Fix hostname for datahubsearch1001, use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762867 (https://phabricator.wikimedia.org/T301383) [16:55:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [16:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:07] RECOVERY - Host lvs1018 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:56:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [16:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:19] 10SRE, 10SRE-Access-Requests: saisuman ssh production public keys reused for WMCS - https://phabricator.wikimedia.org/T300708 (10MMandere) @SCherukuwada, we now have your new public associated with your account. You should be able to access production servers. Please give it a try! [16:57:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host contint2002.mgmt.codfw.wmnet with reboot policy FORCED [16:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:52] (03CR) 10BBlack: [C: 03+2] Add lvs1018 to pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/762869 (owner: 10BBlack) [16:59:33] (03CR) 10Razzi: "Thanks for helping me troubleshoot that last patch, going to pause for a moment for reviews for this one :)" [puppet] - 10https://gerrit.wikimedia.org/r/762867 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [17:00:05] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:59] βœ… [17:02:29] (03PS2) 10Ottomata: Install anaconda-wmf-base on all workers, and anaconda-wmf only on client nodes [puppet] - 10https://gerrit.wikimedia.org/r/762846 [17:03:04] (03PS1) 10Majavah: dynamicproxy: manage dns in the api [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) [17:05:05] (03CR) 10jerkins-bot: [V: 04-1] dynamicproxy: manage dns in the api [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [17:05:09] PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:05:18] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1011.eqiad.wmnet with OS buster [17:05:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P20821 and previous config saved to /var/cache/conftool/dbconfig/20220215-170520-marostegui.json [17:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:24] 10SRE, 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host ganeti1011.eqiad.wmnet with OS buster completed: - ganeti1011 (**PASS**) - Removed from Puppet and PuppetDB... [17:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:37] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint2002.mgmt.codfw.wmnet with reboot policy FORCED [17:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:43] PROBLEM - PyBal connections to etcd on lvs1018 is CRITICAL: CRITICAL: 0 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [17:07:49] (03PS2) 10Majavah: dynamicproxy: manage dns in the api [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) [17:08:15] (03CR) 10Ayounsi: New function and changes to wmf-netbox plugin to support EVPN config. (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/760566 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [17:08:53] !log cr[12]-eqiad: manual edit static fallback route for high-traffic2 from lvs1014 to lvs1018 - T301142 [17:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:58] T301142: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 [17:09:43] (03CR) 10jerkins-bot: [V: 04-1] dynamicproxy: manage dns in the api [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) (owner: 10Majavah) [17:10:24] (03PS3) 10Majavah: dynamicproxy: manage dns in the api [puppet] - 10https://gerrit.wikimedia.org/r/762871 (https://phabricator.wikimedia.org/T295246) [17:10:25] RECOVERY - SSH on analytics1063.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:12:05] PROBLEM - pybal on lvs1018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:13:56] (03PS5) 10Giuseppe Lavagetto: shellbox-media: use local disk for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/762768 (https://phabricator.wikimedia.org/T292322) [17:14:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/762867 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [17:14:39] !log lvs1018 - bringing pybal online for production upload traffic [17:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:45] RECOVERY - pybal on lvs1018 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:15:47] RECOVERY - PyBal backends health check on lvs1018 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:16:30] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "looks good to me! compiler shows the rename but no other changes: https://puppet-compiler.wmflabs.org/pcc-worker1001/33804/otrs1001.eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/762845 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [17:16:46] (03CR) 10Razzi: [C: 03+2] dhcpd: Fix hostname for datahubsearch1001, use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/762867 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [17:17:21] RECOVERY - PyBal connections to etcd on lvs1018 is OK: OK: 36 connections established with conf1004.eqiad.wmnet:4001 (min=36) https://wikitech.wikimedia.org/wiki/PyBal [17:17:41] (03CR) 10Muehlenhoff: site: add datahubsearch1001 in insetup role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762850 (https://phabricator.wikimedia.org/T301383) (owner: 10Razzi) [17:18:22] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "diffs look ok" [deployment-charts] - 10https://gerrit.wikimedia.org/r/762768 (https://phabricator.wikimedia.org/T292322) (owner: 10Giuseppe Lavagetto) [17:19:21] 10SRE, 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10MoritzMuehlenhoff) 05Openβ†’03Resolved Thanks! Buster is exactly what we need here, so closing. [17:20:07] (03PS1) 10BBlack: Remove lvs1014 from pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/762874 (https://phabricator.wikimedia.org/T301142) [17:20:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P20822 and previous config saved to /var/cache/conftool/dbconfig/20220215-172024-marostegui.json [17:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:40] (03PS1) 10Marostegui: db2074,db2094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762875 (https://phabricator.wikimedia.org/T301313) [17:20:46] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Restarting to pick up Java security updates - hnowlan@cumin1001 [17:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:57] (03CR) 10Marostegui: [C: 03+2] db2074,db2094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/762875 (https://phabricator.wikimedia.org/T301313) (owner: 10Marostegui) [17:22:25] (03Merged) 10jenkins-bot: shellbox-media: use local disk for /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/762768 (https://phabricator.wikimedia.org/T292322) (owner: 10Giuseppe Lavagetto) [17:23:30] (03PS1) 10BBlack: lvs1014: unconfigure towards spare::system [puppet] - 10https://gerrit.wikimedia.org/r/762876 (https://phabricator.wikimedia.org/T301142) [17:25:00] (03CR) 10Dzahn: [V: 03+1 C: 03+2] rancid: rm .placeholder, ensure config dir exist, avoid puppet flap [puppet] - 10https://gerrit.wikimedia.org/r/762536 (https://phabricator.wikimedia.org/T211459) (owner: 10Dzahn) [17:26:46] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply on main [17:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:50] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: sync on main [17:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:30] (03CR) 10Dzahn: "noop on netmon1002 confirmed, I will check tomorrow if it stopped flapping" [puppet] - 10https://gerrit.wikimedia.org/r/762536 (https://phabricator.wikimedia.org/T211459) (owner: 10Dzahn) [17:28:31] (03PS1) 10Giuseppe Lavagetto: shellbox-media: use DNS-1123 names for volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/762878 [17:28:44] (03PS2) 10Giuseppe Lavagetto: shellbox-media: use DNS-1123 names for volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/762878 [17:28:46] PROBLEM - DPKG on an-coord1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:28:51] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] shellbox-media: use DNS-1123 names for volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/762878 (owner: 10Giuseppe Lavagetto) [17:30:00] (03CR) 10Dzahn: [C: 03+1] "I saw the comments about using wmcloud.org and IDP in cloud. This changes makes sense as a response to those. ACK!" [puppet] - 10https://gerrit.wikimedia.org/r/762823 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [17:32:00] (03PS1) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) [17:32:20] (03Merged) 10jenkins-bot: shellbox-media: use DNS-1123 names for volumes [deployment-charts] - 10https://gerrit.wikimedia.org/r/762878 (owner: 10Giuseppe Lavagetto) [17:32:31] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply on main [17:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:36] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: sync on main [17:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:44] (03CR) 10BBlack: [C: 03+2] Remove lvs1014 from pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/762874 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [17:33:58] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply on main [17:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:19] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: sync on main [17:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300381)', diff saved to https://phabricator.wikimedia.org/P20823 and previous config saved to /var/cache/conftool/dbconfig/20220215-173529-marostegui.json [17:35:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:35:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:36] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [17:35:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T300381)', diff saved to https://phabricator.wikimedia.org/P20824 and previous config saved to /var/cache/conftool/dbconfig/20220215-173536-marostegui.json [17:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:00] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply on main [17:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:30] (03CR) 10BBlack: [C: 03+2] lvs1014: unconfigure towards spare::system [puppet] - 10https://gerrit.wikimedia.org/r/762876 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [17:36:51] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: sync on main [17:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:27] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) >>! In T292322#7710927, @akosiaris wrote: >>>! In T292322#7710140, @Joe wrote: >> After more digging: I have no idea why envoy would report the upstr... [17:38:11] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply on main [17:38:14] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply on main [17:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:42] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply on main [17:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:47] (03CR) 10Dzahn: gitlab: add ferm rules and fix listen_addresses for test instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [17:38:59] (03PS2) 10Muehlenhoff: netboot.cfg: Use globbing for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/762838 [17:39:16] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: sync on main [17:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:49] (03CR) 10Muehlenhoff: [C: 03+2] netboot.cfg: Use globbing for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/762838 (owner: 10Muehlenhoff) [17:39:54] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing list for ptwiki's checkusers - https://phabricator.wikimedia.org/T301614 (10Tks4Fish) @Ladsgroup asked my fellow CUs, all good with "wikipedia-pt-checkusers" :) [17:40:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host contint2002.mgmt.codfw.wmnet with reboot policy FORCED [17:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:20] (03CR) 10Ayounsi: "Awesome! Lots to review, so more of a first pass." [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [17:42:47] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [17:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:03] (03PS1) 10Volans: dhcp: fix lowercase serial matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/762883 [17:45:41] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1031.eqiad.wmnet with OS buster [17:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin... [17:46:05] (03CR) 10Ottomata: [C: 03+2] Install anaconda-wmf-base on all workers, and anaconda-wmf only on client nodes [puppet] - 10https://gerrit.wikimedia.org/r/762846 (owner: 10Ottomata) [17:47:03] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:15] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host contint2002.mgmt.codfw.wmnet with reboot policy FORCED [17:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:27] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1014.eqiad.wmnet with OS buster [17:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:36] 10SRE, 10ops-eqiad, 10Traffic-Icebox, 10Patch-For-Review: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1014.eqiad.wmnet with OS buster [17:52:23] (03PS1) 10Dzahn: gitlab: avoid $realm check, simplify ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/762886 [17:52:50] (03CR) 10jerkins-bot: [V: 04-1] dhcp: fix lowercase serial matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/762883 (owner: 10Volans) [17:53:35] (03CR) 10Jbond: [C: 03+1] dhcp: fix lowercase serial matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/762883 (owner: 10Volans) [17:54:01] (03CR) 10Dzahn: gitlab: add ferm rules and fix listen_addresses for test instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762803 (https://phabricator.wikimedia.org/T297411) (owner: 10Jelto) [17:55:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300381)', diff saved to https://phabricator.wikimedia.org/P20826 and previous config saved to /var/cache/conftool/dbconfig/20220215-175508-marostegui.json [17:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:14] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [17:58:41] (03CR) 10Dzahn: [C: 04-2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/33805/ not working yet, WIP" [puppet] - 10https://gerrit.wikimedia.org/r/762886 (owner: 10Dzahn) [17:59:00] RECOVERY - DPKG on an-coord1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:59:16] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1014.eqiad.wmnet with reason: host reimage [17:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] chrisalbon and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T1800). [18:02:05] 10SRE-swift-storage, 10MW-on-K8s, 10Shellbox, 10serviceops, and 2 others: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Joe) As I feared, no significant change is seen when using an host-mounted emptyDir in the container. I would assume the shellbox server spends most of it... [18:02:29] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1014.eqiad.wmnet with reason: host reimage [18:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:38] (03PS1) 10BBlack: lvs1019 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/762888 (https://phabricator.wikimedia.org/T301142) [18:03:45] (03PS2) 10Volans: dhcp: fix lowercase serial matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/762883 [18:05:41] !log lvs1015 - stopping puppet+pybal to begin transition to lvs1019 - T301142 [18:05:45] bblack: Failed to log message to wiki. Somebody should check the error logs. [18:05:46] T301142: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 [18:07:41] (03PS2) 10Dzahn: gitlab: avoid $realm check, simplify ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/762886 [18:10:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P20827 and previous config saved to /var/cache/conftool/dbconfig/20220215-181012-marostegui.json [18:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:34] (03CR) 10BBlack: [C: 03+2] lvs1019 interface/role setup [puppet] - 10https://gerrit.wikimedia.org/r/762888 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [18:12:48] (03CR) 10Jbond: [C: 03+1] dhcp: fix lowercase serial matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/762883 (owner: 10Volans) [18:12:58] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1014.eqiad.wmnet with OS buster [18:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:06] 10SRE, 10ops-eqiad, 10Traffic-Icebox, 10Patch-For-Review: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1014.eqiad.wmnet with OS buster completed: - lvs1014 (**PASS**) - Do... [18:15:11] (03CR) 10Volans: [C: 03+2] dhcp: fix lowercase serial matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/762883 (owner: 10Volans) [18:16:03] (03CR) 10Cathal Mooney: "reply inline" [homer/public] - 10https://gerrit.wikimedia.org/r/762807 (https://phabricator.wikimedia.org/T301165) (owner: 10Cathal Mooney) [18:16:20] (03PS1) 10Ladsgroup: Revert "es1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762765 [18:16:26] (03PS2) 10Ladsgroup: Revert "es1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762765 [18:16:31] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "es1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762765 (owner: 10Ladsgroup) [18:17:52] (03PS3) 10Dzahn: gitlab: avoid $realm check, simplify ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/762886 [18:18:25] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1031.eqiad.wmnet with OS buster [18:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001... [18:18:35] (03CR) 10jerkins-bot: [V: 04-1] gitlab: avoid $realm check, simplify ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/762886 (owner: 10Dzahn) [18:20:00] (03PS4) 10Dzahn: gitlab: avoid $realm check, simplify ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/762886 [18:22:39] (03CR) 10Dzahn: [C: 04-2] "yea.. actually not sure how to do it properly: https://puppet-compiler.wmflabs.org/pcc-worker1003/33807/gitlab1001.wikimedia.org/index.htm" [puppet] - 10https://gerrit.wikimedia.org/r/762886 (owner: 10Dzahn) [18:23:48] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:25:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P20828 and previous config saved to /var/cache/conftool/dbconfig/20220215-182519-marostegui.json [18:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:25] (03Merged) 10jenkins-bot: dhcp: fix lowercase serial matching [software/spicerack] - 10https://gerrit.wikimedia.org/r/762883 (owner: 10Volans) [18:27:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host contint2002.mgmt.codfw.wmnet with reboot policy FORCED [18:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:53] (03PS2) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) [18:31:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10Papaul) [18:33:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host contint2002.mgmt.codfw.wmnet with reboot policy FORCED [18:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gerrit2002.mgmt.codfw.wmnet with reboot policy FORCED [18:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:57] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) [18:36:25] !log lvs1019 - first prod puppetization + pybal start [18:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:00] (03PS8) 10Cathal Mooney: Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) [18:39:40] !log beginning rolling restart of kafka-logging clusters for updates [18:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T300381)', diff saved to https://phabricator.wikimedia.org/P20829 and previous config saved to /var/cache/conftool/dbconfig/20220215-184023-marostegui.json [18:40:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [18:40:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [18:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:40:29] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [18:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [18:40:33] 10SRE, 10Gerrit, 10serviceops: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10Papaul) [18:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T300381)', diff saved to https://phabricator.wikimedia.org/P20830 and previous config saved to /var/cache/conftool/dbconfig/20220215-184037-marostegui.json [18:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:33] (03PS1) 10Majavah: gitlab: avoid $realm check, simplify ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/762897 [18:41:52] !log lvs1019 - disable puppet/pybal, reboot - T301142 [18:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:57] T301142: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 [18:43:00] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33808/console" [puppet] - 10https://gerrit.wikimedia.org/r/762897 (owner: 10Majavah) [18:43:03] (03CR) 10Majavah: "this is a fixed version of https://gerrit.wikimedia.org/r/c/operations/puppet/+/762886/" [puppet] - 10https://gerrit.wikimedia.org/r/762897 (owner: 10Majavah) [18:43:10] mutante: ^ here's how you do that [18:43:49] (03PS1) 10BryanDavis: toolforge: redirect legacy ru_monuments to ru-monuments [puppet] - 10https://gerrit.wikimedia.org/r/762900 (https://phabricator.wikimedia.org/T301720) [18:44:26] (03PS1) 10Jdlrobson: Revert "Add fetch tests from WVUI" [skins/Vector] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762907 [18:45:46] (03PS1) 10Ebernhardson: search-platform: Port alerts from icinga [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) [18:46:05] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [18:47:14] (03PS1) 10BBlack: Add lvs1019 to pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/762903 (https://phabricator.wikimedia.org/T301142) [18:49:09] (03PS1) 10Jdlrobson: Remove MFUseDesktopContributionsPage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762904 (https://phabricator.wikimedia.org/T300583) [18:49:11] (03CR) 10BryanDavis: "PCC check https://puppet-compiler.wmflabs.org/pcc-worker1002/33809/" [puppet] - 10https://gerrit.wikimedia.org/r/762900 (https://phabricator.wikimedia.org/T301720) (owner: 10BryanDavis) [18:50:07] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.03348 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:50:18] !log cr[12]-eqiad - edit static fallback for low-traffic (lvs1015 -> lvs1019) [18:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:05] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Htriedman) Update: they should all be in the NDA and MOU document now [18:51:39] (03CR) 10BBlack: [C: 03+2] Add lvs1019 to pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/762903 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [18:52:24] ottomata: the puppet failures seems related to your commit for anaconda-wmf-base [18:52:30] dpkg: error processing archive /var/cache/apt/archives/anaconda-wmf-base_2020.02~wmf7_amd64.deb (--unpack): trying to overwrite '/usr/lib/anaconda-wmf/LICENSE.txt', which is also in package anaconda-wmf 2020.02~wmf6 [18:52:46] (03CR) 10Ladsgroup: [C: 03+2] Revert "Add fetch tests from WVUI" [skins/Vector] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762907 (owner: 10Jdlrobson) [18:53:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host gerrit2002.mgmt.codfw.wmnet with reboot policy FORCED [18:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:24] (03PS1) 10Jdlrobson: Apply max width setting to all Wikisource page namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762928 (https://phabricator.wikimedia.org/T300563) [18:56:32] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Mailing list for ptwiki's checkusers - https://phabricator.wikimedia.org/T301614 (10Ladsgroup) 05Openβ†’03Resolved Done. https://lists.wikimedia.org/postorius/lists/wikipedia-pt-checkusers.lists.wikimedia.org/ [18:58:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gerrit2002.mgmt.codfw.wmnet with reboot policy FORCED [18:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] hashar and jeena: (Dis)respected human, time to deploy MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T1900). Please do the needful. [19:01:15] PROBLEM - Juniper alarms on mr1-ulsfo is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 198.35.26.194 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:02:29] RECOVERY - Juniper alarms on mr1-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [19:03:21] PROBLEM - pybal on lvs1019 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:05:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T300381)', diff saved to https://phabricator.wikimedia.org/P20831 and previous config saved to /var/cache/conftool/dbconfig/20220215-190528-marostegui.json [19:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:34] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [19:06:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host gerrit2002.mgmt.codfw.wmnet with reboot policy FORCED [19:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:55] (03PS1) 10Ladsgroup: Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762908 [19:09:04] (03PS2) 10Ladsgroup: Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762908 [19:09:07] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/762908 (owner: 10Ladsgroup) [19:09:09] !log lvs1019 - start pybal/puppet with real routing, taking over low-traffic from lvs1020 [19:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:51] (03Merged) 10jenkins-bot: Revert "Add fetch tests from WVUI" [skins/Vector] (wmf/1.38.0-wmf.22) - 10https://gerrit.wikimedia.org/r/762907 (owner: 10Jdlrobson) [19:10:05] RECOVERY - pybal on lvs1019 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:12:48] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.22/skins/Vector: Backport: [[gerrit:762907|Revert "Add fetch tests from WVUI"]] (duration: 01m 07s) [19:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:30] btw scap deploy had this warning 19:12:25 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 (ran as mwdeploy@mw1410.eqiad.wmnet) returned [2]: NOT restarting php7.2-fpm: free opcache 344 MB [19:13:46] someone take a look at mw1410, they are sad [19:13:55] (03PS1) 10BBlack: Remove lvs1015 from pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/762930 (https://phabricator.wikimedia.org/T301142) [19:14:07] Jdlrobson: the patch is merged and deployed [19:15:46] (03PS1) 10BBlack: lvs1015: unconfigure towards spare::system [puppet] - 10https://gerrit.wikimedia.org/r/762931 (https://phabricator.wikimedia.org/T301142) [19:16:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:23] (03CR) 10BBlack: [C: 03+2] Remove lvs1015 from pybal neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/762930 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [19:16:29] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:16:30] (03CR) 10AOkoth: [C: 03+2] vrts: rename profile variables [puppet] - 10https://gerrit.wikimedia.org/r/762845 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [19:19:20] (03CR) 10BBlack: [C: 03+2] lvs1015: unconfigure towards spare::system [puppet] - 10https://gerrit.wikimedia.org/r/762931 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [19:20:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P20832 and previous config saved to /var/cache/conftool/dbconfig/20220215-192033-marostegui.json [19:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:20] 10SRE-OnFire: 2021-11-18 codfw ipv6 network - https://phabricator.wikimedia.org/T299968 (10lmata) a:03MMandere [19:21:34] 10SRE-OnFire: 2021-11-10 cirrussearch commonsfile outage - https://phabricator.wikimedia.org/T299967 (10lmata) a:03herron [19:21:59] 10SRE, 10SRE-OnFire, 10Sustainability (Incident Followup): 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10lmata) a:03CDanis [19:22:17] 10SRE-OnFire: 2021-10-22 eqiad return path timeouts - https://phabricator.wikimedia.org/T295152 (10lmata) a:03LSobanski [19:22:34] 10SRE-OnFire: 2021-11-25 eventgate-main outage - https://phabricator.wikimedia.org/T299970 (10lmata) a:03akosiaris [19:22:49] (03PS2) 10Ebernhardson: search-platform: Port alerts from icinga [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) [19:22:51] (03CR) 10Ebernhardson: search-platform: Port alerts from icinga (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/762902 (https://phabricator.wikimedia.org/T289077) (owner: 10Ebernhardson) [19:22:58] 10SRE-OnFire: 2021-11-23 Core Network Routing - https://phabricator.wikimedia.org/T299969 (10lmata) a:03CDanis [19:23:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:23:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:25] 10SRE-OnFire: 2021-11-05 TOC language converter - https://phabricator.wikimedia.org/T299966 (10lmata) a:03lmata [19:23:38] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:17] 10SRE-OnFire: 2021-11-02 Cloud VPS networking - https://phabricator.wikimedia.org/T299964 (10lmata) a:03jcrespo [19:24:41] 10SRE-OnFire (FY2021/2022-Q2): 2021-10-22 eqiad return path timeouts - https://phabricator.wikimedia.org/T295152 (10lmata) [19:24:54] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-25 eventgate-main outage - https://phabricator.wikimedia.org/T299970 (10lmata) [19:24:56] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-23 Core Network Routing - https://phabricator.wikimedia.org/T299969 (10lmata) [19:25:02] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-10 cirrussearch commonsfile outage - https://phabricator.wikimedia.org/T299967 (10lmata) [19:25:07] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-05 TOC language converter - https://phabricator.wikimedia.org/T299966 (10lmata) [19:25:09] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-18 codfw ipv6 network - https://phabricator.wikimedia.org/T299968 (10lmata) [19:25:18] 10SRE-OnFire (FY2021/2022-Q2): 2021-11-02 Cloud VPS networking - https://phabricator.wikimedia.org/T299964 (10lmata) [19:25:43] !log bblack@cumin1001 START - Cookbook sre.dns.netbox [19:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:55] 10SRE, 10SRE-OnFire (FY2021/2022-Q2), 10Sustainability (Incident Followup): 2021-10-07 network provider issues causing all Wikimedia sites to be unreachable for many users - https://phabricator.wikimedia.org/T292792 (10lmata) [19:26:40] (03PS1) 10Volans: sre.hosts.provision: check password correctness [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 [19:27:00] (03PS1) 10AOkoth: vrts: rename module class variables [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) [19:27:27] (03PS2) 10Volans: sre.hosts.provision: check password correctness [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 [19:27:36] (03CR) 10jerkins-bot: [V: 04-1] vrts: rename module class variables [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [19:27:54] !log bblack@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:02] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:43] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1002/33810/" [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [19:29:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:29:49] (03PS3) 10Volans: sre.hosts.provision: check password correctness [cookbooks] - 10https://gerrit.wikimedia.org/r/762934 [19:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:01] (03PS2) 10AOkoth: vrts: rename module class variables [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) [19:30:17] !log bblack@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1015.eqiad.wmnet with OS buster [19:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:25] 10SRE, 10ops-eqiad, 10Traffic-Icebox, 10Patch-For-Review: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bblack@cumin1001 for host lvs1015.eqiad.wmnet with OS buster [19:30:34] !log cmooney@cumin1001 START - Cookbook sre.hosts.provision for host elastic1093.mgmt.eqiad.wmnet with reboot policy FORCED [19:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:02] 10SRE, 10ops-eqiad, 10Traffic-Icebox, 10Patch-For-Review: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10BBlack) [19:32:43] taavi: Oh, thank you. went on a break and find that:) appreciated [19:34:02] (03CR) 10BBlack: [C: 03+2] Add netflow6001 to kafka custom ferm [puppet] - 10https://gerrit.wikimedia.org/r/760613 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [19:34:15] (03CR) 10BBlack: [C: 03+2] Add ops-drmrs to alertmanager config [puppet] - 10https://gerrit.wikimedia.org/r/760614 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [19:34:33] (03CR) 10BBlack: [C: 03+2] drmrs: add vk delivery error alerting [puppet] - 10https://gerrit.wikimedia.org/r/760615 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [19:35:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P20833 and previous config saved to /var/cache/conftool/dbconfig/20220215-193537-marostegui.json [19:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:31] (03CR) 10Dzahn: vrts: rename module class variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [19:37:02] (03CR) 10Dzahn: "looks good, just nitpick: please also rename the parameters in the examples/comments at the beginning of the class" [puppet] - 10https://gerrit.wikimedia.org/r/762935 (https://phabricator.wikimedia.org/T293942) (owner: 10AOkoth) [19:37:27] (03PS2) 10BBlack: Add ops-drmrs to alertmanager config [puppet] - 10https://gerrit.wikimedia.org/r/760614 (https://phabricator.wikimedia.org/T282787) [19:37:29] (03PS2) 10BBlack: drmrs: add vk delivery error alerting [puppet] - 10https://gerrit.wikimedia.org/r/760615 (https://phabricator.wikimedia.org/T282787) [19:38:13] !log beginning rolling restart of kafka-main clusters for updates [19:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:57] taavi: how come I am the author if you uploaded PS1 as a new patch? [19:39:30] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic1093.mgmt.eqiad.wmnet with reboot policy FORCED [19:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:43] (03CR) 10BBlack: [C: 03+2] Add ops-drmrs to alertmanager config [puppet] - 10https://gerrit.wikimedia.org/r/760614 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [19:40:52] mutante: I took your commit and just reset the change-id so that it will show up as a separate change [19:40:57] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1015.eqiad.wmnet with reason: host reimage [19:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:11] taavi: oh, that is unsual to me. next time, feel free to just amend directly to my change [19:42:34] comparing them now.. I tried different ways of quoting in different PSes already [19:42:57] ok, sure [19:43:36] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1015.eqiad.wmnet with reason: host reimage [19:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:18] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:44:37] (03Abandoned) 10Dzahn: gitlab: avoid $realm check, simplify ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/762886 (owner: 10Dzahn) [19:44:44] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:44:44] mutante: the only difference is in hieradata/role/common/gitlab.yaml, which now uses an explicit lookup() for the service IPs [19:45:15] ah, I see. ACK! [19:45:23] that's the part I was missing indeed [19:45:39] though https://puppet-compiler.wmflabs.org/pcc-worker1001/33811/gitlab-prod-1001.devtools.eqiad1.wikimedia.cloud/index.html [19:46:39] ? it works as far as I can tell [19:48:46] (03PS3) 10BBlack: Add drmrs to smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/760616 (https://phabricator.wikimedia.org/T282788) [19:49:18] it removes the "::" it was listening on before [19:49:55] and that IP change (which is probably because that is the puppet compiler IP) [19:50:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T300381)', diff saved to https://phabricator.wikimedia.org/P20834 and previous config saved to /var/cache/conftool/dbconfig/20220215-195042-marostegui.json [19:50:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [19:50:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [19:50:52] that's the gitlab-prod-1001 ip, it's coming from hieradata/cloud/eqiad1/devtools/common.yaml and looks correct to me [19:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T300381)', diff saved to https://phabricator.wikimedia.org/P20835 and previous config saved to /var/cache/conftool/dbconfig/20220215-195051-marostegui.json [19:50:54] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [19:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:42] the floating IP mapping happens much closer to our network edge, a VM will never see incoming trafgic for the floating, it'll always be mapped for its private IP [19:51:45] well, I did not want to change anything about the rules with this change [19:52:00] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1015.eqiad.wmnet with OS buster [19:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:05] that is related to other changes by J.elto [19:52:06] 10SRE, 10ops-eqiad, 10Traffic-Icebox, 10Patch-For-Review: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bblack@cumin1001 for host lvs1015.eqiad.wmnet with OS buster completed: - lvs1015 (**PASS**) - Do... [19:54:37] 10SRE, 10ops-eqiad, 10Traffic-Icebox, 10Patch-For-Review: Migrate lvs101[345] to lvs101[789] - https://phabricator.wikimedia.org/T301142 (10BBlack) 05Openβ†’03Resolved [19:55:23] (03CR) 10BBlack: [C: 03+2] "Added hosts as suggested" [puppet] - 10https://gerrit.wikimedia.org/r/760616 (https://phabricator.wikimedia.org/T282788) (owner: 10BBlack) [19:55:45] (03CR) 10BBlack: [C: 03+2] smokeping: monitor eqsin switch [puppet] - 10https://gerrit.wikimedia.org/r/760617 (https://phabricator.wikimedia.org/T186650) (owner: 10BBlack) [19:56:44] (03PS2) 10BBlack: smokeping: monitor eqsin switch [puppet] - 10https://gerrit.wikimedia.org/r/760617 (https://phabricator.wikimedia.org/T186650) [19:57:15] will discuss it tomorrow in gitlab meeting [19:58:22] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005397 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:03:13] (03PS1) 10BBlack: Fix section label in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/762941 (https://phabricator.wikimedia.org/T301142) [20:03:55] (03CR) 10BBlack: [C: 03+2] Fix section label in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/762941 (https://phabricator.wikimedia.org/T301142) (owner: 10BBlack) [20:05:20] (03PS3) 10BBlack: smokeping: monitor eqsin switch [puppet] - 10https://gerrit.wikimedia.org/r/760617 (https://phabricator.wikimedia.org/T186650) [20:10:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T300381)', diff saved to https://phabricator.wikimedia.org/P20836 and previous config saved to /var/cache/conftool/dbconfig/20220215-201025-marostegui.json [20:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:31] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [20:18:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata for Skye Berghel - https://phabricator.wikimedia.org/T301581 (10Ottomata) BTW, the posix group needed is `analytics-privatedata-users`. [20:19:24] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:25:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P20837 and previous config saved to /var/cache/conftool/dbconfig/20220215-202530-marostegui.json [20:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:32] (03CR) 10Elukey: ml-services: add arwiki & bnwiki editquality isvcs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/762533 (https://phabricator.wikimedia.org/T301415) (owner: 10Accraze) [20:39:19] 10SRE, 10Traffic, 10Upstream: Problem loading thumbnail images due to Envoy (HTTP/1.0 clients getting '426 Upgrade Required') - https://phabricator.wikimedia.org/T300366 (10TheDJ) Upstream made some changes, but it seems there are some post-merge concerns that came up in https://github.com/envoyproxy/envoy/p... [20:40:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P20838 and previous config saved to /var/cache/conftool/dbconfig/20220215-204035-marostegui.json [20:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:41] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Kubernetes: (Need By: TBD) rack/setup/install kubernetes20[19|2(012)] - https://phabricator.wikimedia.org/T299470 (10Papaul) [20:45:44] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:46:08] RECOVERY - Check for large files in client bucket on mwmaint1002 is OK: OK: client bucket file ok https://wikitech.wikimedia.org/wiki/Puppet%23check_client_bucket_large_file [20:52:08] Beta cluster seems to be extremely slow at the moment -- anyone else seeing that? [20:54:06] Kemayo: yea, https://fa.wikipedia.beta.wmflabs.org/ is not really loading for me. it worked yesteday [20:54:36] It was *slow* earlier today, but now it's moved on to outright not-loading. [20:55:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T300381)', diff saved to https://phabricator.wikimedia.org/P20840 and previous config saved to /var/cache/conftool/dbconfig/20220215-205539-marostegui.json [20:55:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [20:55:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [20:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:46] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [20:55:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T300381)', diff saved to https://phabricator.wikimedia.org/P20841 and previous config saved to /var/cache/conftool/dbconfig/20220215-205547-marostegui.json [20:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10Papaul) [21:00:05] Lucas_WMDE and Urbanecm: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220215T2100). [21:00:05] jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] i can deploy today! [21:00:22] hello Jdlrobson, around? [21:01:21] urbanecm: hello [21:01:32] let's start then :) [21:01:51] Jdlrobson: do the config patches depend on the backport, please? [21:02:02] Ah looks like the backport already got deployed [21:02:06] so we can just do the config patches [21:02:10] both can be done at the same time [21:02:16] they are both unrelated. [21:02:28] ah [21:02:35] that makes it easier/faster :) [21:02:42] Kemayo: someone is looking and killing/restarting php/fpm on a beta host [21:02:51] (03CR) 10Urbanecm: [C: 03+2] Remove MFUseDesktopContributionsPage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762904 (https://phabricator.wikimedia.org/T300583) (owner: 10Jdlrobson) [21:02:57] Kemayo: also see channel -releng [21:03:05] mutante: Thanks! [21:04:02] (03Merged) 10jenkins-bot: Remove MFUseDesktopContributionsPage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762904 (https://phabricator.wikimedia.org/T300583) (owner: 10Jdlrobson) [21:04:42] (03CR) 10Urbanecm: [C: 04-1] Apply max width setting to all Wikisource page namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762928 (https://phabricator.wikimedia.org/T300563) (owner: 10Jdlrobson) [21:04:55] Jdlrobson: pulled to mwdebug1001 for testing [21:04:57] can you have a look? [21:05:01] urbanecm: on it [21:05:32] urbanecm: don't understand your comment on the above though [21:05:38] the comment is: Virtual namespaces; don't appear in the page database [21:05:48] NS_WIKISOURCE_PAGES is a made up constant [21:05:56] but the namespace numbers do appear in the database [21:05:58] it's not a real namespace, so this seemed like the right place to put it [21:06:06] Oh I see, so new section? [21:06:16] You can sync the MFUseDesktopContributionsPge one [21:06:20] ack [21:07:30] ok fixed :) [21:07:37] (03PS2) 10Jdlrobson: Apply max width setting to all Wikisource page namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762928 (https://phabricator.wikimedia.org/T300563) [21:07:40] thanks for reading my code :) [21:08:40] Jdlrobson: actually, do you mind explaining in bit more detail what the patch's intentions are? [21:08:47] (the wikisource one) [21:08:50] the other one i'm syncing [21:09:06] https://it.wikisource.org/wiki/Pagina:Rusconi_-_Teatro_completo_di_Shakspeare,_1858,_I-II.djvu/655?useskin=vector-2022 [21:09:18] The namespace of this page is different from the namespace of https://en.wikisource.org/wiki/Page:Popular_Science_Monthly_Volume_58.djvu/540 [21:09:27] https://en.wikisource.org/wiki/Page:Popular_Science_Monthly_Volume_58.djvu/540?useskin=vector-2022 [21:09:37] you'll notice the max width doesn't apply on the enwikisource one [21:09:43] but does on itwikisource [21:10:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d97b43ea0428621c6fd9352af9840e0db4545c08: Remove MFUseDesktopContributionsPage config (T300583) (duration: 00m 52s) [21:10:08] In Italian the namespace is 108 [21:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:11] T300583: Remove MFUseDesktopContributionsPage config - https://phabricator.wikimedia.org/T300583 [21:10:22] 104 in Italian [21:10:37] I figured since the namespace IDs are unique across all wikis it would be easier to group them. [21:10:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:06] If you see phab:T74525 the eventual goal is for all the namespace IDs to be the same [21:11:06] T74525: harmonize Wikisource namespaces used by the ProofreadPage extension - https://phabricator.wikimedia.org/T74525 [21:11:37] Jdlrobson: i see. So it's meant to represent IDs used at ProofreadPage at the various wikis? [21:11:45] https://gerrit.wikimedia.org/g/mediawiki/extensions/ProofreadPage/+/83574cb266e87be2a9f5d24dc8900cdb94126dcb/includes/ProofreadPageLuaLibrary.php#52 [21:11:47] *used by [21:11:48] yeh [21:11:56] this would be NS_PAGE if it had a consistent value [21:12:00] got it [21:12:22] Given the nature of the config, this seemed to be the cleanest way to do it [21:12:29] I'd rather not repeat the configuration for all those wikis. [21:12:58] thanks for the explanations Jdlrobson -- they were helpful for me to understand the patch. [21:13:06] however... [21:13:28] ...English Wikisource does have namespaces with IDs listed in your NS_WIKISOURCE_PAGES constant, that are _not_ ProofreadPage related [21:13:45] it does? o_o [21:13:47] which ones? [21:14:55] (I'm not seeing them) [21:15:08] oh Author = 102 [21:15:12] according to https://en.wikisource.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces, it has 100 (Portal) and 102 (Author) [21:15:16] is that what you are referring to? [21:15:19] correct [21:15:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T300381)', diff saved to https://phabricator.wikimedia.org/P20842 and previous config saved to /var/cache/conftool/dbconfig/20220215-211519-marostegui.json [21:15:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:15:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:25] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [21:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:18] ok.. so that's annoying [21:16:28] sorry about that :) [21:16:34] Let me see if I can rework this patch without creating a bunch of unreadable code in the next 10 mins [21:16:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:13] Jdlrobson: the cleanest approach would probably be to re-use wgProofreadPageNamespaceIds somehow [21:17:20] yeh that's what i"m thinking [21:17:22] and CommonSettings [21:17:24] yeah [21:17:58] I can do 'wikisource' => $wmgWikiSourceMaxWidthOptions, right? [21:18:08] e.g. InitialiseSettings can access variables defined in CommonSettings [21:18:28] that's not possible :( [21:18:40] is.php is evaluated before (most of) CS.php is [21:19:11] ack [21:19:14] not a problem [21:24:22] (03PS3) 10Jdlrobson: Apply max width setting to all Wikisource page namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762928 (https://phabricator.wikimedia.org/T300563) [21:24:24] Okay urbanecm would this work ? ^ [21:24:29] Jdlrobson: let me see [21:26:42] (03PS4) 10Jdlrobson: Apply max width setting to all Wikisource page namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762928 (https://phabricator.wikimedia.org/T300563) [21:26:44] Jdlrobson: I don't see wmgProofreadPageNamespaceIds defined anywhere. Did you mean wgProofreadPageNamespaceIds? [21:27:05] Sorry my bad it's wgProofreadPageNamespaceIds yes [21:27:20] (03PS5) 10Jdlrobson: Apply max width setting to all Wikisource page namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762928 (https://phabricator.wikimedia.org/T300563) [21:29:03] looks good to me now [21:29:06] let's see if it works πŸ™‚ [21:29:10] urbanecm: fingers crossed [21:29:22] thanks for the sanity checking, you likely saved us a lot of time here. [21:29:26] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/762948 [21:29:29] (03CR) 10Urbanecm: [C: 03+2] "LGTM. Let's see if this works! Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762928 (https://phabricator.wikimedia.org/T300563) (owner: 10Jdlrobson) [21:30:06] happy to help :) [21:30:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P20843 and previous config saved to /var/cache/conftool/dbconfig/20220215-213024-marostegui.json [21:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:09] (03Merged) 10jenkins-bot: Apply max width setting to all Wikisource page namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762928 (https://phabricator.wikimedia.org/T300563) (owner: 10Jdlrobson) [21:31:43] Jdlrobson: pulled to mwdebug1001. Can you test please? [21:31:48] urbanecm: on it [21:31:56] (03CR) 10Jeena Huneidi: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/762948 (owner: 10PipelineBot) [21:33:10] urbanecm: yep that's workingfor me [21:33:12] yay! [21:33:30] wonderful :) [21:33:58] syncing [21:35:35] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/762948 (owner: 10PipelineBot) [21:36:03] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b3e8161445d4f778cab8cbabe709f9583ac62df2: Apply max width setting to all Wikisource page namespaces (T300563; 1/2) (duration: 00m 50s) [21:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:08] T300563: Wikisource projects use different namespaces for page, so max width applying where it shouldn't - https://phabricator.wikimedia.org/T300563 [21:36:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:52] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: b3e8161445d4f778cab8cbabe709f9583ac62df2: Apply max width setting to all Wikisource page namespaces (T300563; 2/2) (duration: 00m 49s) [21:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:07] Jdlrobson: should be live now. Anything else I can do for you today? [21:37:28] urbanecm: that's everything. Thanks a bunch for all your help today! [21:37:32] you definitely went above and beyond :) [21:38:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:38:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:04] :) [21:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:44] (03PS2) 10Urbanecm: amiwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758930 [21:38:47] (03CR) 10Urbanecm: [C: 03+2] amiwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758930 (owner: 10Urbanecm) [21:39:51] (03Merged) 10jenkins-bot: amiwiki: Deploy Growth features to newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/758930 (owner: 10Urbanecm) [21:41:01] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2e0b51f6c314bfd685f79544c6cb2260feb380a0: amiwiki: Deploy Growth features to newcomers (duration: 00m 49s) [21:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:13] !log UTC late B&C window completed [21:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:05] (03PS3) 10Jdlrobson: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [21:45:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P20844 and previous config saved to /var/cache/conftool/dbconfig/20220215-214529-marostegui.json [21:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:47] (03PS1) 10Lucas Werkmeister: Fix parsing release in service.template [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/762949 [21:46:08] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Aklapper) [21:46:27] (03CR) 10Lucas Werkmeister: "Tested in the wd-shex-infer tool by copying /usr/local/bin/webservice into the tool’s home directory and making the change there; it seems" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/762949 (owner: 10Lucas Werkmeister) [21:47:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:48:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [21:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [21:49:29] (03PS1) 10Andrew Bogott: profile::wmcs::nfs::standalone: use host-prefix for the service name [puppet] - 10https://gerrit.wikimedia.org/r/762951 (https://phabricator.wikimedia.org/T293800) [21:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:11] (03CR) 10Andrew Bogott: [C: 03+2] profile::wmcs::nfs::standalone: use host-prefix for the service name [puppet] - 10https://gerrit.wikimedia.org/r/762951 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [21:55:54] (03CR) 10BryanDavis: [C: 03+2] Fix parsing release in service.template [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/762949 (owner: 10Lucas Werkmeister) [21:56:46] (03Merged) 10jenkins-bot: Fix parsing release in service.template [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/762949 (owner: 10Lucas Werkmeister) [21:58:21] (03PS4) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) [21:58:34] (03CR) 10Jdlrobson: [C: 04-1] "When backporting, you should be able to test the survey is working on http://fa.wikipedia.org/?quicksurvey=internal-gdi-safety-survey" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [21:58:36] (03PS5) 10Eigyan: [wmf-config]: Deploy the fawiki test safety survey to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) [22:00:29] !log Updated the Wikidata property suggester with data from the 2022-02-07 JSON dump (with pre-applied T132839 workarounds) [22:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T300381)', diff saved to https://phabricator.wikimedia.org/P20845 and previous config saved to /var/cache/conftool/dbconfig/20220215-220034-marostegui.json [22:00:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [22:00:36] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [22:00:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [22:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:40] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [22:00:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T300381)', diff saved to https://phabricator.wikimedia.org/P20846 and previous config saved to /var/cache/conftool/dbconfig/20220215-220041-marostegui.json [22:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:44] (03CR) 10Volans: "Some final touches/questions inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/747116 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [22:04:11] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Pigsonthewing) >>! In T238285#7542940, @geraki wrote: > Note: Page... [22:04:53] ACKNOWLEDGEMENT - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP Cathal Mooney Due to Telxius transport to drmrs down. - The acknowledgement expires at: 2022-02-16 22:04:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:05:38] ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 Cathal Mooney Due to Telxius transport to drmrs down. - The acknowledgement expires at: 2022-02-16 22:05:23. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:10:15] (03CR) 10Urbanecm: [wmf-config]: Deploy the fawiki test safety survey to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762881 (https://phabricator.wikimedia.org/T297629) (owner: 10Eigyan) [22:14:54] (03PS1) 10BryanDavis: d/changelog: Prepare for 0.81 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/762956 [22:16:06] (03Abandoned) 10BryanDavis: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/758973 (owner: 10PipelineBot) [22:19:22] (03CR) 10BryanDavis: [C: 03+2] d/changelog: Prepare for 0.81 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/762956 (owner: 10BryanDavis) [22:19:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300381)', diff saved to https://phabricator.wikimedia.org/P20847 and previous config saved to /var/cache/conftool/dbconfig/20220215-221940-marostegui.json [22:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:46] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [22:20:41] (03Merged) 10jenkins-bot: d/changelog: Prepare for 0.81 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/762956 (owner: 10BryanDavis) [22:21:25] (03PS1) 10Razzi: analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) [22:21:30] !log jhuneidi@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply on staging [22:21:33] !log jhuneidi@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply on production [22:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:03] (03CR) 10jerkins-bot: [V: 04-1] analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [22:22:08] (03PS2) 10Razzi: analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) [22:22:52] (03PS3) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for buster [puppet] - 10https://gerrit.wikimedia.org/r/758986 (https://phabricator.wikimedia.org/T300682) [22:22:54] (03PS6) 10Dduvall: contint: Install docker 20.10 from thirdparty/ci on buster [puppet] - 10https://gerrit.wikimedia.org/r/758987 (https://phabricator.wikimedia.org/T300682) [22:22:56] (03CR) 10Dduvall: aptrepo: add docker packages to thirdparty/ci for buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/758986 (https://phabricator.wikimedia.org/T300682) (owner: 10Dduvall) [22:22:58] (03CR) 10jerkins-bot: [V: 04-1] analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) (owner: 10Razzi) [22:23:00] (03PS1) 10Andrew Bogott: nfs cookbooks: Better support for arbitrary prefixes and volume names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/762958 (https://phabricator.wikimedia.org/T293800) [22:23:25] (03PS3) 10Razzi: analytics_cluster::datahub::opensearch: start of puppet role [puppet] - 10https://gerrit.wikimedia.org/r/762957 (https://phabricator.wikimedia.org/T301382) [22:23:31] !log jhuneidi@deploy1002 helmfile [staging] START helmfile.d/services/blubberoid: apply on staging [22:23:34] !log jhuneidi@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: apply on production [22:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:02] !log jhuneidi@deploy1002 helmfile [staging] DONE helmfile.d/services/blubberoid: sync on staging [22:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:56] !log jhuneidi@deploy1002 helmfile [codfw] START helmfile.d/services/blubberoid: apply on production [22:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:00] !log jhuneidi@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: apply on staging [22:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:43] !log jhuneidi@deploy1002 helmfile [codfw] DONE helmfile.d/services/blubberoid: sync on production [22:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:30] !log jhuneidi@deploy1002 helmfile [eqiad] START helmfile.d/services/blubberoid: apply on production [22:27:33] !log jhuneidi@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: apply on staging [22:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:34] (03CR) 10Andrew Bogott: [C: 03+2] nfs cookbooks: Better support for arbitrary prefixes and volume names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/762958 (https://phabricator.wikimedia.org/T293800) (owner: 10Andrew Bogott) [22:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:16] !log jhuneidi@deploy1002 helmfile [eqiad] DONE helmfile.d/services/blubberoid: sync on production [22:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:47] bd808: Is there a toolhub patchset we can run a test on? [22:31:01] sorry meant to use the releng channel [22:31:28] jeena: toolhub is all hacked around, but I triggered a recheck on https://gerrit.wikimedia.org/r/c/wikimedia/developer-portal/+/762555 [22:31:38] πŸ‘ [22:32:06] and it worked! [22:32:32] hooray [22:34:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P20848 and previous config saved to /var/cache/conftool/dbconfig/20220215-223445-marostegui.json [22:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:20] (03PS1) 10BryanDavis: toolforge: update default aptly-host for wmcs-package-build.py [puppet] - 10https://gerrit.wikimedia.org/r/762961 [22:48:00] (JobUnavailable) firing: (2) Reduced availability for job etherpad in eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org [22:49:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P20849 and previous config saved to /var/cache/conftool/dbconfig/20220215-224950-marostegui.json [22:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:05] 10SRE, 10Observability-Alerting: rancid causes puppet to flap on netmon1002 - https://phabricator.wikimedia.org/T211459 (10Dzahn) 05Openβ†’03Resolved a:03Dzahn {F34952841} ^ it stopped changing :) Calling this resolved. [22:55:52] !log Removing 5 files for legal compliance [22:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T300381)', diff saved to https://phabricator.wikimedia.org/P20850 and previous config saved to /var/cache/conftool/dbconfig/20220215-230454-marostegui.json [23:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:01] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [23:10:54] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:30] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) [23:13:01] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic20[73-86] - https://phabricator.wikimedia.org/T299608 (10Papaul) [23:14:15] !log Removing one file for legal compliance [23:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:42] PROBLEM - SSH on analytics1063.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:19:09] (03PS2) 10Andrew Bogott: nfs cookbooks: Better support for arbitrary prefixes and volume names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/762958 (https://phabricator.wikimedia.org/T293800) [23:22:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10Papaul) [23:22:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host restbase-dev2001.mgmt.codfw.wmnet with reboot policy FORCED [23:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase-dev2001.mgmt.codfw.wmnet with reboot policy FORCED [23:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:38] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host restbase-dev2002.mgmt.codfw.wmnet with reboot policy FORCED [23:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase-dev2002.mgmt.codfw.wmnet with reboot policy FORCED [23:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host restbase-dev2003.mgmt.codfw.wmnet with reboot policy FORCED [23:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase-dev2003.mgmt.codfw.wmnet with reboot policy FORCED [23:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Platform Engineering, 10RESTBase: Q3:(Need By: TBD) rack/setup/install restbase-dev200[123].codfw.wmnet - https://phabricator.wikimedia.org/T299437 (10Papaul)