[00:03:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [00:03:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [00:06:51] PROBLEM - Check systemd state on ms-be1035 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:03] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:19:29] (03PS1) 10Krinkle: mediawiki.base: Restore and document importScript "once" behaviour [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/829772 [00:20:23] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.256 second response time https://wikitech.wikimedia.org/wiki/Swift [00:25:15] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [00:32:39] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:47:10] (03PS3) 10Tim Starling: Multi-DC stage 3: send 2% of traffic to appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/827616 (https://phabricator.wikimedia.org/T279664) [00:51:35] (03CR) 10Tim Starling: [C: 03+2] Multi-DC stage 3: send 2% of traffic to appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/827616 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling) [01:01:09] RECOVERY - Check systemd state on ms-be1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:03:39] !log multi-DC stage 3: 2% of codfw/ulsfo/eqsin traffic going to codfw appservers, rolling out via puppet 00:54-01:24 [01:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:45] (JobUnavailable) firing: (7) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:49] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.216 second response time https://wikitech.wikimedia.org/wiki/Swift [01:47:08] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Krinkle) [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:09] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [01:51:23] PROBLEM - cassandra-c service on restbase1026 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:51:27] PROBLEM - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:53] PROBLEM - Check systemd state on restbase1026 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra-c.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:55] PROBLEM - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is CRITICAL: connect to address 10.64.48.182 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [01:58:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33812 and previous config saved to /var/cache/conftool/dbconfig/20220906-015812-ladsgroup.json [01:58:15] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T0200) [02:07:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.28 [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/829875 (https://phabricator.wikimedia.org/T314189) [02:07:44] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.28 [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/829875 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:10:07] RECOVERY - Check systemd state on restbase1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:10:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:11:05] RECOVERY - cassandra-c service on restbase1026 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:11:09] RECOVERY - cassandra-c SSL 10.64.48.182:7001 on restbase1026 is OK: SSL OK - Certificate restbase1026-c valid until 2023-04-14 11:21:30 +0000 (expires in 220 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [02:11:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:12:37] RECOVERY - cassandra-c CQL 10.64.48.182:9042 on restbase1026 is OK: TCP OK - 0.000 second response time on 10.64.48.182 port 9042 https://phabricator.wikimedia.org/T93886 [02:13:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P33813 and previous config saved to /var/cache/conftool/dbconfig/20220906-021318-ladsgroup.json [02:14:11] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) > Observe cross-DC database connection rate, analyse sources It's not necessary to use tcpdump since we can just look at SSL connection counts. I... [02:23:17] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:08] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.28 [core] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/829875 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [02:27:15] (03PS1) 10Jforrester: ExtensionDistributor: Add REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829877 (https://phabricator.wikimedia.org/T313925) [02:28:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P33814 and previous config saved to /var/cache/conftool/dbconfig/20220906-022824-ladsgroup.json [02:31:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:32:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:32:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:33:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:43:13] (03CR) 10Andrew Bogott: Add clean-stale-puppet-certs script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott) [02:43:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33815 and previous config saved to /var/cache/conftool/dbconfig/20220906-024330-ladsgroup.json [02:43:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [02:43:34] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:43:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [02:43:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33816 and previous config saved to /var/cache/conftool/dbconfig/20220906-024351-ladsgroup.json [02:45:29] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:56] (03PS3) 10Andrew Bogott: Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321 [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T0300) [03:01:15] (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829879 (https://phabricator.wikimedia.org/T314189) [03:01:17] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829879 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [03:02:08] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829879 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [03:02:19] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:32] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.28 refs T314189 [03:02:35] T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189 [03:03:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:06:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:06:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:08:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:13:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:14:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:14:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:15:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:15:56] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I made this [[https://grafana-rw.wikimedia.org/d/6fLyZKG4k/all-clusters-utilization|all clusters utilization]] dashboard so that I could easily se... [03:17:47] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [03:18:44] (03PS3) 10Tim Starling: Multi-DC stage 4: send all traffic to appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/827617 (https://phabricator.wikimedia.org/T279664) [03:23:57] (03CR) 10Tim Starling: [C: 03+2] Multi-DC stage 4: send all traffic to appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/827617 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling) [03:26:12] !log multi-DC stage 4: all traffic to appservers-ro, rolling out via puppet 03:24-03:54 [03:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:49] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.28 refs T314189 (duration: 36m 17s) [03:38:52] T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189 [03:40:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [03:44:17] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10ori) > I suspect this is fallout from the URL query sorting change (cc @ori) not invalidating the cache of h... [03:44:23] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:47:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [03:47:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [03:54:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [03:55:29] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) MySQL cross-DC traffic is higher than expected, with 110 conns/s. Appserver CPU usage is fine. Mcrouter connection rates are fine. [04:24:24] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [04:29:45] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I captured cross-DC queries on the s3 master (db1157) using SHOW PROCESSLIST in a loop, once per second for 20 minutes. Out of 10 captured queries... [04:32:24] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) The serverIsReadOnly() cache key includes the DB hostname, so I should have done my calculation per section rather than globally. [04:32:39] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:37:45] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, and 3 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) | section | Cross-DC connection rate (req/s) | |--|--| | es4 | 0.00 | | es5 | 0.00 | | s1 | 9.21 | | s2 | 19.7 | | s3 | 53.2 | | s4 | 7.02 | | s5... [04:40:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33819 and previous config saved to /var/cache/conftool/dbconfig/20220906-044029-ladsgroup.json [04:40:32] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:55:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P33820 and previous config saved to /var/cache/conftool/dbconfig/20220906-045535-ladsgroup.json [05:00:29] RECOVERY - MegaRAID on an-worker1079 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:04:45] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:05:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 36 hosts with reason: Primary switchover s1 T316623 [05:05:08] T316623: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T316623 [05:05:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 36 hosts with reason: Primary switchover s1 T316623 [05:06:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1163 with weight 0 T316623', diff saved to https://phabricator.wikimedia.org/P33821 and previous config saved to /var/cache/conftool/dbconfig/20220906-050610-ladsgroup.json [05:10:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P33822 and previous config saved to /var/cache/conftool/dbconfig/20220906-051041-ladsgroup.json [05:11:22] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Marostegui) Thanks Papaul and John! I can reach all the hosts just fine [05:12:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T316745 [05:12:36] T316745: Switchover x1 master (db1103 -> db1120) - https://phabricator.wikimedia.org/T316745 [05:12:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T316745 [05:13:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1120 with weight 0 T316745', diff saved to https://phabricator.wikimedia.org/P33823 and previous config saved to /var/cache/conftool/dbconfig/20220906-051304-root.json [05:13:49] (03PS2) 10Marostegui: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/828469 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot) [05:13:54] (03PS2) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/828468 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot) [05:22:23] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/828468 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot) [05:25:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33824 and previous config saved to /var/cache/conftool/dbconfig/20220906-052547-ladsgroup.json [05:25:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:25:51] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [05:26:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [05:26:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33825 and previous config saved to /var/cache/conftool/dbconfig/20220906-052609-ladsgroup.json [05:30:18] (03PS1) 10Marostegui: instances.yaml: Add db1107 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830051 (https://phabricator.wikimedia.org/T316870) [05:31:06] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1107 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/830051 (https://phabricator.wikimedia.org/T316870) (owner: 10Marostegui) [05:32:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1107 to dbctl depooled T316870', diff saved to https://phabricator.wikimedia.org/P33826 and previous config saved to /var/cache/conftool/dbconfig/20220906-053238-marostegui.json [05:32:43] T316870: Move db1107 to s1 - https://phabricator.wikimedia.org/T316870 [05:34:55] PROBLEM - MegaRAID on an-worker1079 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:40:56] (03PS2) 10Ladsgroup: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/827862 (https://phabricator.wikimedia.org/T316623) (owner: 10Gerrit maintenance bot) [05:41:17] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.155 second response time https://wikitech.wikimedia.org/wiki/Swift [05:41:18] (03PS1) 10Marostegui: x2 replicas: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830052 (https://phabricator.wikimedia.org/T316847) [05:41:30] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/827862 (https://phabricator.wikimedia.org/T316623) (owner: 10Gerrit maintenance bot) [05:42:04] (03CR) 10Marostegui: [C: 03+2] x2 replicas: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830052 (https://phabricator.wikimedia.org/T316847) (owner: 10Marostegui) [05:42:33] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/827862 (https://phabricator.wikimedia.org/T316623) (owner: 10Gerrit maintenance bot) [05:43:37] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift [05:44:09] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add mock peeringdb token [labs/private] - 10https://gerrit.wikimedia.org/r/819568 (owner: 10Ayounsi) [05:53:39] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.072 second response time https://wikitech.wikimedia.org/wiki/Swift [05:56:01] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [06:00:04] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T0600). [06:00:12] o/ [06:00:14] o/ [06:00:24] !log Starting s1 eqiad failover from db1118 to db1163 - T316623 [06:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:27] T316623: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T316623 [06:00:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T316623', diff saved to https://phabricator.wikimedia.org/P33827 and previous config saved to /var/cache/conftool/dbconfig/20220906-060032-ladsgroup.json [06:00:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1163 to s1 primary and set section read-write T316623', diff saved to https://phabricator.wikimedia.org/P33828 and previous config saved to /var/cache/conftool/dbconfig/20220906-060055-ladsgroup.json [06:01:18] I see recentchanges [06:02:51] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.219 second response time https://wikitech.wikimedia.org/wiki/Swift [06:02:57] (03PS2) 10Ladsgroup: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/827863 (https://phabricator.wikimedia.org/T316623) (owner: 10Gerrit maintenance bot) [06:03:03] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/827863 (https://phabricator.wikimedia.org/T316623) (owner: 10Gerrit maintenance bot) [06:03:45] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.133 second response time https://wikitech.wikimedia.org/wiki/Swift [06:04:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1118 T316623', diff saved to https://phabricator.wikimedia.org/P33829 and previous config saved to /var/cache/conftool/dbconfig/20220906-060418-ladsgroup.json [06:05:11] Amir1: can I go? [06:05:15] sure [06:05:19] the floor is yours [06:05:23] !log Starting x1 eqiad failover from db1103 to db1120 - T316745 [06:05:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:25] T316745: Switchover x1 master (db1103 -> db1120) - https://phabricator.wikimedia.org/T316745 [06:06:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1120 to x1 primary T316745', diff saved to https://phabricator.wikimedia.org/P33830 and previous config saved to /var/cache/conftool/dbconfig/20220906-060602-root.json [06:06:28] Amir1: can you generate a write in codfw? [06:06:51] I don't think it's possible to geenrate write there [06:06:54] (03CR) 10Marostegui: [C: 03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/828469 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot) [06:06:58] you mean x1 in eqiad? [06:06:59] (03PS3) 10Marostegui: wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/828469 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot) [06:07:04] Amir1: sorry yes [06:07:09] (03CR) 10Marostegui: [V: 03+2] wmnet: Update x1-master alias [dns] - 10https://gerrit.wikimedia.org/r/828469 (https://phabricator.wikimedia.org/T316745) (owner: 10Gerrit maintenance bot) [06:07:12] yeah, create a short url [06:07:28] https://w.wiki/5fiE [06:07:31] yup, it's working [06:07:34] \o/ thanks [06:08:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103 T316745', diff saved to https://phabricator.wikimedia.org/P33831 and previous config saved to /var/cache/conftool/dbconfig/20220906-060815-root.json [06:08:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some weight to current x1 eqiad master', diff saved to https://phabricator.wikimedia.org/P33832 and previous config saved to /var/cache/conftool/dbconfig/20220906-060833-root.json [06:10:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [06:10:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [06:10:47] PROBLEM - etcd request latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:11:02] (03CR) 10Elukey: [C: 03+1] "Checked the istio-cni config on the worker nodes, it looks good. Should be fine to merge and deploy!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/829822 (https://phabricator.wikimedia.org/T310175) (owner: 10Btullis) [06:11:18] (03PS1) 10Marostegui: db1103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830054 [06:11:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [06:11:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [06:11:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T312863)', diff saved to https://phabricator.wikimedia.org/P33833 and previous config saved to /var/cache/conftool/dbconfig/20220906-061150-ladsgroup.json [06:11:53] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [06:12:19] RECOVERY - etcd request latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [06:12:40] (03CR) 10Marostegui: [C: 03+2] db1103: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830054 (owner: 10Marostegui) [06:14:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 1%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33835 and previous config saved to /var/cache/conftool/dbconfig/20220906-061419-root.json [06:14:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 1%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33836 and previous config saved to /var/cache/conftool/dbconfig/20220906-061434-root.json [06:14:49] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:15:01] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [06:15:24] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [06:15:35] (03PS1) 10Marostegui: db1132,db1143: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830055 (https://phabricator.wikimedia.org/T311106) [06:15:49] 10SRE-OnFire, 10DBA, 10Patch-For-Review, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) I am repooling db1132 and db1143. [06:16:20] (03CR) 10Marostegui: [C: 03+2] db1132,db1143: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830055 (https://phabricator.wikimedia.org/T311106) (owner: 10Marostegui) [06:29:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 2%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33837 and previous config saved to /var/cache/conftool/dbconfig/20220906-062924-root.json [06:29:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 2%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33838 and previous config saved to /var/cache/conftool/dbconfig/20220906-062940-root.json [06:33:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1188 T316342', diff saved to https://phabricator.wikimedia.org/P33839 and previous config saved to /var/cache/conftool/dbconfig/20220906-063322-root.json [06:33:25] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [06:37:08] (03PS1) 10Marostegui: mariadb: Productionize db1196 [puppet] - 10https://gerrit.wikimedia.org/r/830057 (https://phabricator.wikimedia.org/T316342) [06:38:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1196 [puppet] - 10https://gerrit.wikimedia.org/r/830057 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [06:40:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1189 T316342', diff saved to https://phabricator.wikimedia.org/P33841 and previous config saved to /var/cache/conftool/dbconfig/20220906-064021-root.json [06:40:26] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [06:41:00] (03CR) 10Slyngshede: [C: 03+2] Systemd timer: Cleanup a few dangling absent cronjob references. [puppet] - 10https://gerrit.wikimedia.org/r/829557 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:44:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 3%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33843 and previous config saved to /var/cache/conftool/dbconfig/20220906-064429-root.json [06:44:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 3%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33844 and previous config saved to /var/cache/conftool/dbconfig/20220906-064445-root.json [06:52:28] (03CR) 10Jcrespo: "I would sqash this and:" [software/pampinus] - 10https://gerrit.wikimedia.org/r/829858 (owner: 10Ladsgroup) [06:53:41] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [06:59:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 4%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33845 and previous config saved to /var/cache/conftool/dbconfig/20220906-065934-root.json [06:59:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 4%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33846 and previous config saved to /var/cache/conftool/dbconfig/20220906-065950-root.json [07:00:05] Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T0700). [07:00:05] _joe_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:36] (03PS5) 10Giuseppe Lavagetto: Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736) [07:02:16] (03CR) 10Jcrespo: "I'd say either going for #ac6600 or making the text black for that shade of yellow." [software/pampinus] - 10https://gerrit.wikimedia.org/r/829858 (owner: 10Ladsgroup) [07:04:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:05:24] (03Merged) 10jenkins-bot: Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [07:06:11] (03PS1) 10Marostegui: install_server: Do not reimage db1196 [puppet] - 10https://gerrit.wikimedia.org/r/830060 [07:07:09] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1196 [puppet] - 10https://gerrit.wikimedia.org/r/830060 (owner: 10Marostegui) [07:08:19] (03CR) 10Filippo Giunchedi: "Good catch Majavah! LGTM, and indeed it should work" [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro) [07:10:44] (03CR) 10David Caro: p::wmcs:prometheus: Add cloudvps federation job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro) [07:11:31] (03CR) 10David Caro: [C: 03+2] p::metricsinfra:haproxy: move to epp template [puppet] - 10https://gerrit.wikimedia.org/r/829743 (owner: 10David Caro) [07:11:48] !log oblivian@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823679|Move 1 of 6 users to php 7.4 (T271736)]] (duration: 04m 06s) [07:11:52] T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 [07:12:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:13:09] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:14:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 5%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33847 and previous config saved to /var/cache/conftool/dbconfig/20220906-071438-root.json [07:14:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 5%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33848 and previous config saved to /var/cache/conftool/dbconfig/20220906-071455-root.json [07:18:28] (03PS4) 10Samtar: CommonSettings: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) [07:19:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:19:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:25:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:26:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [07:29:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 10%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33849 and previous config saved to /var/cache/conftool/dbconfig/20220906-072943-root.json [07:30:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 10%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33850 and previous config saved to /var/cache/conftool/dbconfig/20220906-072959-root.json [07:34:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130 T316342', diff saved to https://phabricator.wikimedia.org/P33851 and previous config saved to /var/cache/conftool/dbconfig/20220906-073434-root.json [07:34:37] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [07:38:55] (03PS1) 10Marostegui: mariadb: Productionize db1197 [puppet] - 10https://gerrit.wikimedia.org/r/830064 (https://phabricator.wikimedia.org/T316342) [07:39:41] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1197 [puppet] - 10https://gerrit.wikimedia.org/r/830064 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [07:39:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33853 and previous config saved to /var/cache/conftool/dbconfig/20220906-073948-ladsgroup.json [07:39:52] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [07:40:29] (03PS1) 10David Caro: git-sync-upstream: apply automated black and isort [puppet] - 10https://gerrit.wikimedia.org/r/830082 (https://phabricator.wikimedia.org/T317071) [07:40:31] (03PS1) 10David Caro: git-sync-upstream: add a log for the repo being rebased [puppet] - 10https://gerrit.wikimedia.org/r/830083 (https://phabricator.wikimedia.org/T317071) [07:41:55] (03CR) 10CI reject: [V: 04-1] git-sync-upstream: add a log for the repo being rebased [puppet] - 10https://gerrit.wikimedia.org/r/830083 (https://phabricator.wikimedia.org/T317071) (owner: 10David Caro) [07:44:40] (03PS1) 10Ayounsi: Depool ulsfo for routers ugprades [dns] - 10https://gerrit.wikimedia.org/r/830085 (https://phabricator.wikimedia.org/T295690) [07:44:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 25%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33854 and previous config saved to /var/cache/conftool/dbconfig/20220906-074448-root.json [07:45:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 25%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33855 and previous config saved to /var/cache/conftool/dbconfig/20220906-074504-root.json [07:47:01] (03PS2) 10David Caro: git-sync-upstream: add a log for the repo being rebased [puppet] - 10https://gerrit.wikimedia.org/r/830083 (https://phabricator.wikimedia.org/T317071) [07:48:15] (03CR) 10Filippo Giunchedi: [C: 03+1] Depool ulsfo for routers ugprades [dns] - 10https://gerrit.wikimedia.org/r/830085 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [07:49:24] (03PS1) 10Marostegui: db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830087 (https://phabricator.wikimedia.org/T316870) [07:49:31] (03CR) 10David Caro: [C: 03+2] build: use the standard path to get the docker binary [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811 (owner: 10David Caro) [07:49:33] (03CR) 10David Caro: [C: 03+2] bullseye0: Improve the install-packages script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [07:49:36] (03CR) 10David Caro: [C: 03+2] Remove buster0 buildpacks images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro) [07:49:39] (03CR) 10David Caro: [C: 03+2] bullseye0: Add bullseye buildpack build/run images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829031 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [07:49:59] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) yes, your assessment is right @ori, query parameters are sorted before triggering the purge [07:50:05] (03CR) 10Marostegui: [C: 03+2] db1107: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830087 (https://phabricator.wikimedia.org/T316870) (owner: 10Marostegui) [07:50:35] (03Merged) 10jenkins-bot: bullseye0: Add bullseye buildpack build/run images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829031 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [07:50:37] (03Merged) 10jenkins-bot: Remove buster0 buildpacks images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829116 (owner: 10David Caro) [07:50:39] (03Merged) 10jenkins-bot: bullseye0: Improve the install-packages script [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829218 (https://phabricator.wikimedia.org/T316854) (owner: 10David Caro) [07:51:00] (03Merged) 10jenkins-bot: build: use the standard path to get the docker binary [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/829811 (owner: 10David Caro) [07:51:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 1%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33856 and previous config saved to /var/cache/conftool/dbconfig/20220906-075113-root.json [07:52:02] (03CR) 10Filippo Giunchedi: [C: 03+1] p::wmcs:prometheus: Add cloudvps federation job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829756 (https://phabricator.wikimedia.org/T316982) (owner: 10David Caro) [07:52:07] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for routers ugprades [dns] - 10https://gerrit.wikimedia.org/r/830085 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [07:52:36] !log depool ulsfo for routers upgrade - T295690 [07:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:39] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [07:54:39] (03PS1) 10Marostegui: db1103: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830088 [07:54:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P33857 and previous config saved to /var/cache/conftool/dbconfig/20220906-075455-ladsgroup.json [07:57:00] (03CR) 10Marostegui: [C: 03+2] db1103: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830088 (owner: 10Marostegui) [07:57:41] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) @ori, my current theory (and it needs to be tested) is that varnish frontend purges the history... [07:58:06] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.downtime: add network devices support - https://phabricator.wikimedia.org/T317082 (10ayounsi) [07:58:12] !log jnuche@deploy1002 Pruned MediaWiki: 1.39.0-wmf.24, 1.39.0-wmf.26 (duration: 02m 48s) [07:58:51] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr3-ulsfo.wikimedia.org with reason: router upgrade [07:58:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr3-ulsfo.wikimedia.org with reason: router upgrade [07:59:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7eb8120c-f8b6-4c79-8deb-b18a305a2353) set by ayounsi@cumin1001 for 2:00:00 on 1 host(s) and th... [07:59:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 50%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33858 and previous config saved to /var/cache/conftool/dbconfig/20220906-075953-root.json [08:00:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 50%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33859 and previous config saved to /var/cache/conftool/dbconfig/20220906-080009-root.json [08:00:24] (03PS1) 10Marostegui: Revert "x1: Change binlog format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/829781 [08:00:42] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.downtime: add network devices support - https://phabricator.wikimedia.org/T317082 (10ayounsi) Nevermind, this does seem to work: ` cumin1001:~$ sudo cookbook sre.hosts.downtime -r 'router upgrade' -t T295690 -H 2 D{cr3-ulsfo.wikimedia.org} START - Cookbook s... [08:01:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:01:45] (03CR) 10CI reject: [V: 04-1] Revert "x1: Change binlog format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/829781 (owner: 10Marostegui) [08:02:57] !log Set x1 back to binlog_format=ROW [08:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:22] (03PS1) 10TrainBranchBot: group0 wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830091 (https://phabricator.wikimedia.org/T314189) [08:03:24] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830091 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [08:03:30] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.downtime: add network devices support - https://phabricator.wikimedia.org/T317082 (10Volans) @ayounsi you have to use `--force` (see `--help`) and can pass a `NodeSet`-accepted syntax of hostnames as they are in Icinga, like: ` cr[3-4]-ulsfo,cr[2-3]-ulsfo IP... [08:04:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [08:04:55] (03PS1) 10Marostegui: x1: Change binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/830092 [08:05:00] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830091 (https://phabricator.wikimedia.org/T314189) (owner: 10TrainBranchBot) [08:05:07] (03Abandoned) 10Marostegui: Revert "x1: Change binlog format to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/829781 (owner: 10Marostegui) [08:05:42] (03CR) 10Marostegui: [C: 03+2] x1: Change binlog format to ROW [puppet] - 10https://gerrit.wikimedia.org/r/830092 (owner: 10Marostegui) [08:06:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33860 and previous config saved to /var/cache/conftool/dbconfig/20220906-080609-root.json [08:06:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 2%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33861 and previous config saved to /var/cache/conftool/dbconfig/20220906-080618-root.json [08:08:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:08:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:09:25] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.28 refs T314189 [08:09:28] T314189: 1.39.0-wmf.28 deployment blockers - https://phabricator.wikimedia.org/T314189 [08:09:40] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] helm-state-metrics: Update to v0.1.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/829197 (owner: 10JMeybohm) [08:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P33862 and previous config saved to /var/cache/conftool/dbconfig/20220906-081001-ladsgroup.json [08:10:08] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.downtime: add network devices support - https://phabricator.wikimedia.org/T317082 (10Volans) From a dry-run test it seems that it should work despite the space. [08:10:14] (03CR) 10JMeybohm: [C: 03+2] helm-state-metrics: Update to v0.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/829198 (owner: 10JMeybohm) [08:10:44] (03PS8) 10JMeybohm: sre.k8s.pool-depool-cluster: Add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) [08:11:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:13:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33863 and previous config saved to /var/cache/conftool/dbconfig/20220906-081336-root.json [08:13:50] (03CR) 10Volans: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [08:14:08] (03Merged) 10jenkins-bot: helm-state-metrics: Update to v0.1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/829198 (owner: 10JMeybohm) [08:14:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33864 and previous config saved to /var/cache/conftool/dbconfig/20220906-081434-root.json [08:14:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 75%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33865 and previous config saved to /var/cache/conftool/dbconfig/20220906-081458-root.json [08:15:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 75%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33866 and previous config saved to /var/cache/conftool/dbconfig/20220906-081514-root.json [08:16:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:17:10] (03PS1) 10Marostegui: db2096: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830094 [08:17:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:17:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:18:12] (03PS5) 10David Caro: wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) [08:18:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33867 and previous config saved to /var/cache/conftool/dbconfig/20220906-081819-root.json [08:18:21] (03CR) 10Marostegui: [C: 03+2] db2096: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830094 (owner: 10Marostegui) [08:18:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:20:48] (03PS1) 10Marostegui: mariadb: Productionize db1200 [puppet] - 10https://gerrit.wikimedia.org/r/830095 (https://phabricator.wikimedia.org/T316342) [08:21:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33868 and previous config saved to /var/cache/conftool/dbconfig/20220906-082114-root.json [08:21:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 3%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33869 and previous config saved to /var/cache/conftool/dbconfig/20220906-082122-root.json [08:21:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1200 [puppet] - 10https://gerrit.wikimedia.org/r/830095 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [08:22:28] (03Merged) 10jenkins-bot: Bump pynetbox to ~= 6.6 [software/spicerack] - 10https://gerrit.wikimedia.org/r/820806 (https://phabricator.wikimedia.org/T310745) (owner: 10Ayounsi) [08:22:32] (03CR) 10Ladsgroup: "feel free to amend, squash, do whatever you like with it!" [software/pampinus] - 10https://gerrit.wikimedia.org/r/829858 (owner: 10Ladsgroup) [08:23:59] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [08:24:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33870 and previous config saved to /var/cache/conftool/dbconfig/20220906-082507-ladsgroup.json [08:25:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [08:25:10] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [08:25:12] (03PS6) 10David Caro: wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) [08:25:14] (03PS1) 10David Caro: wmcs: use double slash for multilne command help [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/830096 [08:25:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [08:26:48] (03PS3) 10Volans: Enable pynetbox threading [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [08:26:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33871 and previous config saved to /var/cache/conftool/dbconfig/20220906-082653-root.json [08:27:21] (03CR) 10David Caro: [C: 03+2] wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [08:28:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33872 and previous config saved to /var/cache/conftool/dbconfig/20220906-082841-root.json [08:29:05] (03CR) 10CI reject: [V: 04-1] wmcs: use double slash for multilne command help [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/830096 (owner: 10David Caro) [08:29:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33873 and previous config saved to /var/cache/conftool/dbconfig/20220906-082939-root.json [08:29:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1138 T316342', diff saved to https://phabricator.wikimedia.org/P33874 and previous config saved to /var/cache/conftool/dbconfig/20220906-082954-root.json [08:29:56] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [08:30:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1132 (re)pooling @ 100%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33875 and previous config saved to /var/cache/conftool/dbconfig/20220906-083002-root.json [08:30:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1143 (re)pooling @ 100%: Repooling again', diff saved to https://phabricator.wikimedia.org/P33876 and previous config saved to /var/cache/conftool/dbconfig/20220906-083019-root.json [08:32:01] (03Merged) 10jenkins-bot: wmcs.openstack.quota_increase: allow all known quota types [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/825736 (https://phabricator.wikimedia.org/T315961) (owner: 10David Caro) [08:32:04] (03CR) 10Btullis: [C: 03+2] Label the eight dse-k8s-worker nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828052 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [08:32:39] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:33:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33878 and previous config saved to /var/cache/conftool/dbconfig/20220906-083324-root.json [08:34:04] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.136 second response time https://wikitech.wikimedia.org/wiki/Swift [08:34:18] (03CR) 10Btullis: [C: 03+2] Add an istio custom deploy configuration for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/829822 (https://phabricator.wikimedia.org/T310175) (owner: 10Btullis) [08:34:33] (03PS1) 10Marostegui: mariadb: Productionize db1198 [puppet] - 10https://gerrit.wikimedia.org/r/830097 (https://phabricator.wikimedia.org/T316342) [08:35:00] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Swift [08:35:29] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1198 [puppet] - 10https://gerrit.wikimedia.org/r/830097 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [08:36:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33879 and previous config saved to /var/cache/conftool/dbconfig/20220906-083619-root.json [08:36:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 4%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33880 and previous config saved to /var/cache/conftool/dbconfig/20220906-083627-root.json [08:36:32] (03CR) 10Volans: "LGTM in general, although I might miss some specific context. It's nice to see the new service catalog module being used! Couple of very m" [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [08:36:49] (03CR) 10Volans: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [08:37:50] (03Merged) 10jenkins-bot: Add an istio custom deploy configuration for dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/829822 (https://phabricator.wikimedia.org/T310175) (owner: 10Btullis) [08:40:58] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/824485 (https://phabricator.wikimedia.org/T260663) (owner: 10JMeybohm) [08:41:33] (03PS5) 10Samtar: CommonSettings-labs: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) [08:41:44] (03CR) 10Volans: "Waiting for a test host from dcops before merging so that we can test it right away." [cookbooks] - 10https://gerrit.wikimedia.org/r/812448 (owner: 10Volans) [08:41:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33881 and previous config saved to /var/cache/conftool/dbconfig/20220906-084158-root.json [08:42:10] !log restart cr3-ulsfo for software upgrade - T295690 [08:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:14] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [08:42:54] (03PS2) 10David Caro: wmcs: use 'r' for multilne command help [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/830096 [08:42:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 1%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33882 and previous config saved to /var/cache/conftool/dbconfig/20220906-084257-root.json [08:43:27] (03Merged) 10jenkins-bot: Enable pynetbox threading [software/spicerack] - 10https://gerrit.wikimedia.org/r/822339 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [08:43:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33883 and previous config saved to /var/cache/conftool/dbconfig/20220906-084345-root.json [08:44:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33884 and previous config saved to /var/cache/conftool/dbconfig/20220906-084443-root.json [08:46:50] PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:47:18] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:47:28] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 61, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:48:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33885 and previous config saved to /var/cache/conftool/dbconfig/20220906-084829-root.json [08:50:21] (03CR) 10Samwilson: [C: 03+1] CommonSettings-labs: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [08:51:17] (03CR) 10Volans: "question and comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/826798 (owner: 10Muehlenhoff) [08:51:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33886 and previous config saved to /var/cache/conftool/dbconfig/20220906-085123-root.json [08:51:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 5%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33887 and previous config saved to /var/cache/conftool/dbconfig/20220906-085132-root.json [08:56:34] 10SRE, 10SRE-Access-Requests, 10Fundraising Tech - Chaos Crew, 10Infrastructure-Foundations, and 2 others: DMarc Email Address for Wikimedia.org - https://phabricator.wikimedia.org/T316899 (10MatthewVernon) @Jgreen It looks to me like this is no longer an SRE access request; are you OK with me removing tha... [08:57:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33888 and previous config saved to /var/cache/conftool/dbconfig/20220906-085703-root.json [08:57:30] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for ebysans - https://phabricator.wikimedia.org/T317030 (10MatthewVernon) @BTullis are we OK to close this ticket now, then? [I'm on Clinic Duty this week, and it's appearing in the dashboard...] [08:58:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 2%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33889 and previous config saved to /var/cache/conftool/dbconfig/20220906-085802-root.json [08:58:50] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for ebysans - https://phabricator.wikimedia.org/T317030 (10BTullis) 05Open→03Resolved Yes thanks. Resolving now. [08:58:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33890 and previous config saved to /var/cache/conftool/dbconfig/20220906-085850-root.json [08:59:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33891 and previous config saved to /var/cache/conftool/dbconfig/20220906-085948-root.json [09:00:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/829856 (https://phabricator.wikimedia.org/T314489) (owner: 10Volans) [09:01:53] (03CR) 10Jbond: Spicerack: add configuration file and API key for PeeringDB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [09:03:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33892 and previous config saved to /var/cache/conftool/dbconfig/20220906-090333-root.json [09:06:18] (03CR) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [09:06:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33893 and previous config saved to /var/cache/conftool/dbconfig/20220906-090628-root.json [09:06:35] (03CR) 10Volans: Spicerack: add configuration file and API key for PeeringDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [09:06:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 10%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33894 and previous config saved to /var/cache/conftool/dbconfig/20220906-090637-root.json [09:06:54] (03PS4) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 [09:07:44] (03CR) 10Jbond: [C: 03+2] cli: Add ability to override the amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond) [09:08:54] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/829288 (owner: 10Majavah) [09:10:49] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10MatthewVernon) @pfischer I'm the clinic duty person this week. Can you confirm your wikimedia email address, please, and... [09:11:01] (03Merged) 10jenkins-bot: cli: Add ability to override the amount of retries and backoffs [software/debmonitor] - 10https://gerrit.wikimedia.org/r/812556 (owner: 10Jbond) [09:11:57] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) I've just confirmed it in testwiki, first unauthenticated GET against `action=history` triggers... [09:12:05] (03CR) 10Volans: "post-merge FYI comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [09:12:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33895 and previous config saved to /var/cache/conftool/dbconfig/20220906-091207-root.json [09:13:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 3%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33896 and previous config saved to /var/cache/conftool/dbconfig/20220906-091307-root.json [09:13:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33897 and previous config saved to /var/cache/conftool/dbconfig/20220906-091355-root.json [09:14:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33898 and previous config saved to /var/cache/conftool/dbconfig/20220906-091453-root.json [09:15:03] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:17:20] (03CR) 10Volans: [C: 04-1] "Sorry I had missed one error in the first pass. LGTM otherwise, no need to re-review" [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [09:17:39] (03PS2) 10Volans: Simplify cumin query in comment for confd [dns] - 10https://gerrit.wikimedia.org/r/829856 (https://phabricator.wikimedia.org/T314489) [09:18:06] (03Abandoned) 10Hnowlan: api-gateway: move route_name metadata to route level [deployment-charts] - 10https://gerrit.wikimedia.org/r/767070 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [09:18:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33899 and previous config saved to /var/cache/conftool/dbconfig/20220906-091838-root.json [09:19:32] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [09:19:43] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [09:21:24] (03PS2) 10Jbond: prometheus-openstack-stale-puppet-certs.py: log original cert name [puppet] - 10https://gerrit.wikimedia.org/r/829320 (owner: 10Andrew Bogott) [09:21:26] (03PS4) 10Jbond: Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott) [09:21:28] (03PS1) 10Jbond: C:prometheus: openstack-stale-puppet-certs get ssldir from puppet [puppet] - 10https://gerrit.wikimedia.org/r/830113 [09:21:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33900 and previous config saved to /var/cache/conftool/dbconfig/20220906-092133-root.json [09:21:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 25%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33901 and previous config saved to /var/cache/conftool/dbconfig/20220906-092141-root.json [09:22:11] !log installing istio configs to dse-k8s cluster [09:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:19] (03CR) 10David Caro: "LGTM, a question inline and also, merging this should not change anything right? (until we set the keepalived_vips hiera setting, probably" [puppet] - 10https://gerrit.wikimedia.org/r/829289 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [09:24:05] (03CR) 10David Caro: [C: 03+2] P:toolforge: remove linux kernel pinnings [puppet] - 10https://gerrit.wikimedia.org/r/790710 (https://phabricator.wikimedia.org/T290494) (owner: 10Majavah) [09:24:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:24:29] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [09:24:30] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:24:43] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [09:25:04] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:25:31] (03CR) 10Volans: [C: 03+2] Simplify cumin query in comment for confd [dns] - 10https://gerrit.wikimedia.org/r/829856 (https://phabricator.wikimedia.org/T314489) (owner: 10Volans) [09:25:31] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [09:25:39] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T314041)', diff saved to https://phabricator.wikimedia.org/P33902 and previous config saved to /var/cache/conftool/dbconfig/20220906-092604-ladsgroup.json [09:26:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [09:26:07] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [09:26:07] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:26:08] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:26:13] (03CR) 10David Caro: [C: 03+2] hieradata: set swift_clusters: {} on cloud [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [09:26:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance [09:26:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T314041)', diff saved to https://phabricator.wikimedia.org/P33903 and previous config saved to /var/cache/conftool/dbconfig/20220906-092626-ladsgroup.json [09:26:29] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.154 second response time https://wikitech.wikimedia.org/wiki/Swift [09:27:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33904 and previous config saved to /var/cache/conftool/dbconfig/20220906-092712-root.json [09:27:28] (03PS3) 10David Caro: hieradata: set swift_clusters: {} on cloud [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [09:27:48] (03CR) 10David Caro: [C: 03+2] "Just rebased and fixed conflicts" [puppet] - 10https://gerrit.wikimedia.org/r/799861 (https://phabricator.wikimedia.org/T309281) (owner: 10Majavah) [09:28:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 4%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33905 and previous config saved to /var/cache/conftool/dbconfig/20220906-092812-root.json [09:28:20] (03PS5) 10David Caro: puppetmaster: remove 'allow_from' [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [09:28:32] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [09:28:35] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Swift [09:28:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:28:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:29:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33906 and previous config saved to /var/cache/conftool/dbconfig/20220906-092900-root.json [09:29:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:29:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33907 and previous config saved to /var/cache/conftool/dbconfig/20220906-092958-root.json [09:31:24] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1008.eqiad.wmnet [09:32:28] (03CR) 10David Caro: [C: 03+1] "LGTM, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/826803 (https://phabricator.wikimedia.org/T315955) (owner: 10Cathal Mooney) [09:33:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33908 and previous config saved to /var/cache/conftool/dbconfig/20220906-093343-root.json [09:34:10] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826346 (owner: 10Majavah) [09:34:15] (03PS2) 10David Caro: hieradata: remove unused labs_tld labs_site variables [puppet] - 10https://gerrit.wikimedia.org/r/826346 (owner: 10Majavah) [09:34:37] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) ATS provides a similar feature to libvmod-querysort as part of the Cache Key manipulation plugin... [09:34:45] (03PS6) 10David Caro: puppetmaster: remove 'allow_from' [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [09:35:31] (03CR) 10Jbond: [C: 03+1] git-sync-upstream: apply automated black and isort [puppet] - 10https://gerrit.wikimedia.org/r/830082 (https://phabricator.wikimedia.org/T317071) (owner: 10David Caro) [09:35:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/830083 (https://phabricator.wikimedia.org/T317071) (owner: 10David Caro) [09:36:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33909 and previous config saved to /var/cache/conftool/dbconfig/20220906-093638-root.json [09:36:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 50%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33910 and previous config saved to /var/cache/conftool/dbconfig/20220906-093646-root.json [09:38:02] (03CR) 10David Caro: [C: 03+2] Remove support for overriding LDAP client stack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [09:39:53] RECOVERY - mediawiki-installation DSH group on parse1008 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:40:21] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1008.eqiad.wmnet [09:40:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1008.eqiad.wmnet [09:41:51] (03CR) 10David Caro: [C: 03+2] "I'll leave this for a bit to give a chance to address the nits if you want, or ping me to merge right away" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [09:41:57] (03CR) 10Aqu: Puppetize spark3 installation and configs using conda-analytics env V2 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [09:42:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33911 and previous config saved to /var/cache/conftool/dbconfig/20220906-094217-root.json [09:42:20] (03CR) 10David Caro: [C: 03+2] git-sync-upstream: apply automated black and isort [puppet] - 10https://gerrit.wikimedia.org/r/830082 (https://phabricator.wikimedia.org/T317071) (owner: 10David Caro) [09:42:27] (03CR) 10David Caro: [C: 03+2] git-sync-upstream: add a log for the repo being rebased [puppet] - 10https://gerrit.wikimedia.org/r/830083 (https://phabricator.wikimedia.org/T317071) (owner: 10David Caro) [09:43:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 5%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33912 and previous config saved to /var/cache/conftool/dbconfig/20220906-094316-root.json [09:43:21] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.292 second response time https://wikitech.wikimedia.org/wiki/Swift [09:44:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33913 and previous config saved to /var/cache/conftool/dbconfig/20220906-094404-root.json [09:44:18] !log pooled parse1008.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [09:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:21] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [09:45:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33914 and previous config saved to /var/cache/conftool/dbconfig/20220906-094503-root.json [09:45:31] (03PS6) 10David Caro: P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [09:45:35] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [09:45:43] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [09:46:03] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [09:47:40] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [09:48:08] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [09:48:22] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 2 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Joe) Another option is to do the query sorting for purges, which are a special case, in either: # mediawiki... [09:48:31] (03PS1) 10Marostegui: mariadb: Productionize db1199 [puppet] - 10https://gerrit.wikimedia.org/r/830117 (https://phabricator.wikimedia.org/T316342) [09:48:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33915 and previous config saved to /var/cache/conftool/dbconfig/20220906-094848-root.json [09:49:32] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1199 [puppet] - 10https://gerrit.wikimedia.org/r/830117 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [09:51:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33916 and previous config saved to /var/cache/conftool/dbconfig/20220906-095143-root.json [09:51:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 75%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33917 and previous config saved to /var/cache/conftool/dbconfig/20220906-095151-root.json [09:51:59] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:53:33] (03CR) 10David Caro: [C: 03+2] P:toolforge: cleanup bastion grid integration [puppet] - 10https://gerrit.wikimedia.org/r/820856 (https://phabricator.wikimedia.org/T314665) (owner: 10Majavah) [09:55:12] (03CR) 10Btullis: [C: 03+2] Puppetize spark3 installation and configs using conda-analytics env V2 [puppet] - 10https://gerrit.wikimedia.org/r/821695 (https://phabricator.wikimedia.org/T312882) (owner: 10Aqu) [09:55:13] !log depooled wtp1041.eqiad.wmnet from parsoid cluster T307219 [09:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:16] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [09:57:20] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1009.eqiad.wmnet [09:57:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33918 and previous config saved to /var/cache/conftool/dbconfig/20220906-095722-root.json [09:58:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 10%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33919 and previous config saved to /var/cache/conftool/dbconfig/20220906-095821-root.json [09:59:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33920 and previous config saved to /var/cache/conftool/dbconfig/20220906-095909-root.json [10:00:03] (03PS1) 10Volans: sre.hardware.upgrade-firmware: sort drivers (2) [cookbooks] - 10https://gerrit.wikimedia.org/r/830121 [10:00:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33921 and previous config saved to /var/cache/conftool/dbconfig/20220906-100008-root.json [10:02:22] (03PS1) 10Majavah: P:toolforge::apt_pinning: re-add required param values [puppet] - 10https://gerrit.wikimedia.org/r/830122 [10:03:14] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37122/console" [puppet] - 10https://gerrit.wikimedia.org/r/830122 (owner: 10Majavah) [10:03:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33923 and previous config saved to /var/cache/conftool/dbconfig/20220906-100353-root.json [10:04:21] (03CR) 10Volans: "I've sent what I think is a proper fix in I27187f712a3b3a8380cb9b01562d21250b810083. Sorry for the trouble." [cookbooks] - 10https://gerrit.wikimedia.org/r/829193 (owner: 10Papaul) [10:05:28] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/821688 (owner: 10Ayounsi) [10:06:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33924 and previous config saved to /var/cache/conftool/dbconfig/20220906-100647-root.json [10:06:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1107 (re)pooling @ 100%: Pooling for the first time', diff saved to https://phabricator.wikimedia.org/P33925 and previous config saved to /var/cache/conftool/dbconfig/20220906-100656-root.json [10:08:35] (03CR) 10Jbond: Spicerack: add configuration file and API key for PeeringDB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [10:09:21] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Marostegui) I have installed the new package also on db1111 (wikidata) and will start repooling it tomorrow. [10:10:25] (03PS1) 10Marostegui: db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830123 [10:10:37] (03PS5) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 [10:10:39] (03CR) 10Jbond: "done thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [10:10:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:11:09] (03CR) 10Marostegui: [C: 03+2] db1111: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/830123 (owner: 10Marostegui) [10:11:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [10:11:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33926 and previous config saved to /var/cache/conftool/dbconfig/20220906-101129-ladsgroup.json [10:11:32] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:12:25] (03CR) 10David Caro: [C: 03+2] P:toolforge::apt_pinning: re-add required param values [puppet] - 10https://gerrit.wikimedia.org/r/830122 (owner: 10Majavah) [10:12:43] (03CR) 10Jbond: [C: 03+1] Remove support for overriding LDAP client stack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [10:13:12] (03CR) 10David Caro: [C: 03+2] "This would have been caught by some simple unit test :/" [puppet] - 10https://gerrit.wikimedia.org/r/830122 (owner: 10Majavah) [10:13:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 25%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33927 and previous config saved to /var/cache/conftool/dbconfig/20220906-101326-root.json [10:13:50] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Remove prod-specific bits from cloud puppetmasters - https://phabricator.wikimedia.org/T309281 (10taavi) [10:14:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33928 and previous config saved to /var/cache/conftool/dbconfig/20220906-101414-root.json [10:15:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33929 and previous config saved to /var/cache/conftool/dbconfig/20220906-101513-root.json [10:15:19] (03PS1) 10Giuseppe Lavagetto: [WiP] sort query parameters in URLs [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) [10:15:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 1%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33930 and previous config saved to /var/cache/conftool/dbconfig/20220906-101559-root.json [10:17:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/830121 (owner: 10Volans) [10:18:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33931 and previous config saved to /var/cache/conftool/dbconfig/20220906-101858-root.json [10:19:42] (03PS6) 10Jbond: sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 [10:20:14] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:21:08] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:21:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1103 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33932 and previous config saved to /var/cache/conftool/dbconfig/20220906-102152-root.json [10:25:20] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:25:21] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:26:38] !log put cr3-ulsfo back in service - T295690 [10:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:41] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [10:27:13] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:27:14] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:28:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 50%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33934 and previous config saved to /var/cache/conftool/dbconfig/20220906-102831-root.json [10:28:54] RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:29:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1119 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33935 and previous config saved to /var/cache/conftool/dbconfig/20220906-102919-root.json [10:30:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33936 and previous config saved to /var/cache/conftool/dbconfig/20220906-103017-root.json [10:31:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 2%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33937 and previous config saved to /var/cache/conftool/dbconfig/20220906-103104-root.json [10:31:43] (03PS1) 10FNegri: Add cloudcephosd103[1-4] to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/830147 (https://phabricator.wikimedia.org/T314870) [10:32:24] (03CR) 10FNegri: "I think we can add the remaining 4 hosts in a single patch. They won't actually join the cluster until the cookbook is run on each host in" [puppet] - 10https://gerrit.wikimedia.org/r/830147 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [10:34:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33938 and previous config saved to /var/cache/conftool/dbconfig/20220906-103402-root.json [10:34:37] 10SRE, 10ops-eqiad: Cable/connection issue on ml-cache1001.eqiad.wmnet - https://phabricator.wikimedia.org/T317091 (10klausman) [10:35:35] (03CR) 10David Caro: [C: 03+1] Add cloudcephosd103[1-4] to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/830147 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [10:36:29] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [10:39:43] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond) [10:40:50] !log switched primary kube-controller-manager from kubemaster1001 to kubemaster1002 [10:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:54] 10SRE, 10Data Engineering Planning, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10EChetty) [10:42:19] (03CR) 10Jelto: "Thanks for the review and comments! I switch to the class interface and SREBatchBase." [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [10:42:34] !log drain traffic from cr4-ulsfo - T295690 [10:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:41] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [10:42:47] (03CR) 10Volans: [C: 03+2] Enable pynetbox threading [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [10:43:25] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10EChetty) [10:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 75%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33939 and previous config saved to /var/cache/conftool/dbconfig/20220906-104336-root.json [10:43:37] 10SRE, 10Data Engineering Planning, 10Traffic-Icebox: varnishkafka / ATSkafka should support setting the kafka message timestamp - https://phabricator.wikimedia.org/T277553 (10EChetty) [10:43:47] (03CR) 10FNegri: [C: 03+2] "Nice!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/830096 (owner: 10David Caro) [10:43:59] 10SRE, 10Data Engineering Planning, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10EChetty) [10:44:00] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [10:44:01] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [10:46:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 3%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33940 and previous config saved to /var/cache/conftool/dbconfig/20220906-104611-root.json [10:46:29] (03Merged) 10jenkins-bot: wmcs: use 'r' for multilne command help [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/830096 (owner: 10David Caro) [10:46:58] (03PS1) 10Jbond: Upstream release v0.3.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830149 [10:47:39] (03Merged) 10jenkins-bot: Enable pynetbox threading [software/homer] - 10https://gerrit.wikimedia.org/r/828031 (https://phabricator.wikimedia.org/T311486) (owner: 10Ayounsi) [10:47:52] (03CR) 10Jbond: [V: 03+2 C: 03+2] Upstream release v0.3.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830149 (owner: 10Jbond) [10:47:58] (03CR) 10FNegri: [C: 03+2] Add cloudcephosd103[1-4] to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/830147 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [10:50:53] (03PS1) 10Hnowlan: api-gatway: open access to inference on service port [deployment-charts] - 10https://gerrit.wikimedia.org/r/830150 [10:52:22] !log uploaded ghostscript 9.26a~dfsg-0+deb9u9+wmf1 to apt.wikimedia.org [10:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:03] (03CR) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 (owner: 10Jbond) [10:57:59] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1009.eqiad.wmnet [10:57:59] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1009.eqiad.wmnet [10:58:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1189 (re)pooling @ 100%: Repooling after upgrade', diff saved to https://phabricator.wikimedia.org/P33941 and previous config saved to /var/cache/conftool/dbconfig/20220906-105841-root.json [11:01:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 4%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33942 and previous config saved to /var/cache/conftool/dbconfig/20220906-110116-root.json [11:03:36] (03PS1) 10Jbond: Revert "Upstream release v0.3.2" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830132 [11:04:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Upstream release v0.3.2" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830132 (owner: 10Jbond) [11:06:31] !log restart cr4-ulsfo for software upgrade - T295690 [11:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:35] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [11:06:37] (03CR) 10Volans: "Much nicer! Thanks a lot for the refactor. LGTM mostly, just couple of suggestions inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [11:08:35] (03PS1) 10Jbond: cli: bump client version number [software/debmonitor] - 10https://gerrit.wikimedia.org/r/830152 [11:09:58] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: Add return check mode [cookbooks] - 10https://gerrit.wikimedia.org/r/818111 (owner: 10Jbond) [11:11:14] !log installing ghostscript updates on stretch [11:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:57] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:12:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:12:15] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:12:15] PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:12:20] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:12:33] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:12:45] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:13:15] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:14:35] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 132, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:14:45] RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:14:54] (03PS1) 10Jbond: sre.puppet.sync-netbox-hiera: check flag should be bool [cookbooks] - 10https://gerrit.wikimedia.org/r/830153 [11:15:15] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 80, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:15:45] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:16:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33943 and previous config saved to /var/cache/conftool/dbconfig/20220906-111621-root.json [11:17:58] !log put cr4-ulsfo back in service - T295690 [11:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:02] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [11:19:35] (03CR) 10Jbond: [C: 03+2] sre.puppet.sync-netbox-hiera: check flag should be bool [cookbooks] - 10https://gerrit.wikimedia.org/r/830153 (owner: 10Jbond) [11:25:53] !log jbond@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin2002" [11:26:12] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on 12 hosts with reason: Downtime pending inclusion in production [11:26:22] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on 12 hosts with reason: Downtime pending inclusion in production [11:26:30] !log jbond@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin2002" [11:27:29] !log pooled parse1009.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [11:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:32] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [11:31:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33944 and previous config saved to /var/cache/conftool/dbconfig/20220906-113126-root.json [11:33:56] (03CR) 10Snwachukwu: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37123/console" [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [11:34:58] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:35:12] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:46:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33945 and previous config saved to /var/cache/conftool/dbconfig/20220906-114631-root.json [11:46:38] (03CR) 10Jbond: [C: 03+2] raid: use modern nrpe defines [puppet] - 10https://gerrit.wikimedia.org/r/825740 (owner: 10Majavah) [11:47:03] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) This has been quite eventful. To keep in mind that those upgrade need the !!no-validate!! knob, more details in the [[ https://www.juniper.net/documentation/us/en/software... [11:50:39] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.downtime: add network devices support - https://phabricator.wikimedia.org/T317082 (10taavi) [12:00:35] (03CR) 10Jbond: "fly byy post comments, no action needed" [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:01:06] !log depooled wtp1042.eqiad.wmnet from parsoid cluster T307219 [12:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:10] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [12:01:35] (03PS4) 10Snwachukwu: Update Puppet files for Airflow Upgrade to 2.3.2 [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) [12:01:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33946 and previous config saved to /var/cache/conftool/dbconfig/20220906-120135-root.json [12:02:19] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.268 second response time https://wikitech.wikimedia.org/wiki/Swift [12:02:57] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1039-1040].eqiad.wmnet with reason: Downtiming replaced wtp servers [12:03:12] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1039-1040].eqiad.wmnet with reason: Downtiming replaced wtp servers [12:03:41] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1039.eqiad.wmnet [12:03:50] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1040.eqiad.wmnet [12:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33947 and previous config saved to /var/cache/conftool/dbconfig/20220906-120412-ladsgroup.json [12:04:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [12:04:15] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [12:04:22] (03CR) 10Muehlenhoff: [C: 03+1] c:raid::md move from crontab to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:04:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [12:04:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T314041)', diff saved to https://phabricator.wikimedia.org/P33948 and previous config saved to /var/cache/conftool/dbconfig/20220906-120433-ladsgroup.json [12:04:37] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift [12:04:46] (03PS1) 10Ayounsi: Add variable to disable VRRP auth [homer/public] - 10https://gerrit.wikimedia.org/r/830156 (https://phabricator.wikimedia.org/T295690) [12:05:00] !log Set wtp10[38-40].eqiad.wmnet inactive pending decommission T317025 [12:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:02] T317025: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 [12:06:16] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/830156 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [12:06:28] (03CR) 10Snwachukwu: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37127/console" [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [12:10:49] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10ayounsi) [12:11:06] (03CR) 10Ayounsi: [C: 03+2] Add variable to disable VRRP auth [homer/public] - 10https://gerrit.wikimedia.org/r/830156 (https://phabricator.wikimedia.org/T295690) (owner: 10Ayounsi) [12:14:41] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1010.eqiad.wmnet [12:15:15] (03PS1) 10Ayounsi: Revert "Depool ulsfo for routers ugprades" [dns] - 10https://gerrit.wikimedia.org/r/830134 [12:15:57] !log repool ulsfo - T295690 [12:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:59] T295690: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 [12:16:08] (03PS2) 10Ayounsi: Revert "Depool ulsfo for routers ugprades" [dns] - 10https://gerrit.wikimedia.org/r/830134 [12:16:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33949 and previous config saved to /var/cache/conftool/dbconfig/20220906-121640-root.json [12:17:25] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for routers ugprades" [dns] - 10https://gerrit.wikimedia.org/r/830134 (owner: 10Ayounsi) [12:17:59] (03CR) 10Volans: [C: 03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/830152 (owner: 10Jbond) [12:18:01] (03CR) 10Zabe: [C: 03+1] ExtensionDistributor: Add REL1_39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/829877 (https://phabricator.wikimedia.org/T313925) (owner: 10Jforrester) [12:22:13] (03CR) 10MSantos: maps: remove tilerator logic from planet_sync (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759894 (owner: 10MSantos) [12:29:17] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 0:15:00 on puppetdb2002.codfw.wmnet with reason: Temporarily stop puppetdb/postgres [12:29:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on puppetdb2002.codfw.wmnet with reason: Temporarily stop puppetdb/postgres [12:30:25] (03PS1) 10Ayounsi: network.prepare-upgrade: cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/830158 [12:31:32] (03PS2) 10Ayounsi: network.prepare-upgrade: cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/830158 [12:31:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33950 and previous config saved to /var/cache/conftool/dbconfig/20220906-123145-root.json [12:32:36] (03PS3) 10Ayounsi: network.prepare-upgrade: cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/830158 [12:32:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:33:15] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) Ping again @colewhite to see if we can proceed or not during the next months :) [12:35:23] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 275 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [12:35:45] (JobUnavailable) firing: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:35:58] (03CR) 10CI reject: [V: 04-1] network.prepare-upgrade: cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/830158 (owner: 10Ayounsi) [12:36:28] (03PS12) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 [12:37:54] (03CR) 10Jbond: [C: 03+2] cli: bump client version number [software/debmonitor] - 10https://gerrit.wikimedia.org/r/830152 (owner: 10Jbond) [12:40:25] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2002 is OK: HTTP OK: HTTP/1.1 200 OK - 920705 bytes in 4.394 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [12:40:45] (JobUnavailable) resolved: Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:41:39] (03Merged) 10jenkins-bot: cli: bump client version number [software/debmonitor] - 10https://gerrit.wikimedia.org/r/830152 (owner: 10Jbond) [12:46:51] (03PS1) 10Jbond: Upstream release v0.3.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830159 [12:47:18] (03CR) 10Jbond: [C: 03+2] Upstream release v0.3.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830159 (owner: 10Jbond) [12:48:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] Upstream release v0.3.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830159 (owner: 10Jbond) [12:51:44] (03CR) 10Jaime Nuche: "Latest patchset tested in beta. PCC also looks happy: https://puppet-compiler.wmflabs.org/pcc-worker1001/37128/" [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [12:53:02] (03PS1) 10Jbond: Revert "Upstream release v0.3.2" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830135 [12:56:38] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Upstream release v0.3.2" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830135 (owner: 10Jbond) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T1300). [13:00:05] TheresNoTime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T1300) [13:00:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312863)', diff saved to https://phabricator.wikimedia.org/P33951 and previous config saved to /var/cache/conftool/dbconfig/20220906-130004-ladsgroup.json [13:00:08] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [13:00:13] * TheresNoTime is here [13:01:27] I can deploy, but will wait a few minutes for the scheduled deployers :) [13:03:45] Lucas_WMDE: urbanecm: I am going to self-deploy my patch, there is only one patch in the window [13:04:05] o/ [13:04:08] TheresNoTime: feel free to :) [13:04:14] I'm around if anything arises [13:04:23] go ahead :) [13:04:33] (I’m in a meeting) [13:04:43] (03PS6) 10Samtar: CommonSettings-labs: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) [13:05:48] ack [13:06:24] (03CR) 10Samtar: [C: 03+2] "self-deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [13:07:12] (03CR) 10Klausman: [C: 03+1] api-gatway: open access to inference on service port (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/830150 (owner: 10Hnowlan) [13:07:42] (03Merged) 10jenkins-bot: CommonSettings-labs: Load Phonos extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/824294 (https://phabricator.wikimedia.org/T314294) (owner: 10Samtar) [13:08:42] now testing [13:08:57] (03CR) 10Hnowlan: [C: 03+2] api-gatway: open access to inference on service port [deployment-charts] - 10https://gerrit.wikimedia.org/r/830150 (owner: 10Hnowlan) [13:09:33] (03PS1) 10Jbond: Upstream release v0.3.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830161 [13:09:33] urbanecm: oh as I'm testing for beta, I need to wait for the jenkins job to run, correct? [13:09:46] (03CR) 10Jbond: [V: 03+2 C: 03+2] Upstream release v0.3.2 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/830161 (owner: 10Jbond) [13:10:37] (and/or trigger a `beta-code-update-eqiad` :p) [13:10:40] TheresNoTime: correct [13:12:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:12:22] (03Merged) 10jenkins-bot: api-gatway: open access to inference on service port [deployment-charts] - 10https://gerrit.wikimedia.org/r/830150 (owner: 10Hnowlan) [13:15:00] Okay that worked (or, "worked", but extension issue vs. anything worth stopping deployment over) [13:15:03] (03CR) 10Jbond: c:raid::md move from crontab to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:15:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P33952 and previous config saved to /var/cache/conftool/dbconfig/20220906-131510-ladsgroup.json [13:15:43] Doing sync [13:15:47] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:16:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:16:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T314041)', diff saved to https://phabricator.wikimedia.org/P33953 and previous config saved to /var/cache/conftool/dbconfig/20220906-131654-ladsgroup.json [13:16:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:16:56] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [13:17:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [13:17:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33954 and previous config saved to /var/cache/conftool/dbconfig/20220906-131715-ladsgroup.json [13:17:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:18:36] 10SRE, 10Traffic: ATS isn't honoring the cache policy set in cache::alternate_domains on some cases - https://phabricator.wikimedia.org/T316545 (10Vgutierrez) 05Open→03Resolved [13:19:41] !log samtar@deploy1002 Synchronized wmf-config/CommonSettings-labs.php: Config: [[gerrit:824294|CommonSettings-labs: Load Phonos extension (T314294)]] (duration: 04m 05s) [13:19:45] T314294: Deploy Phonos to beta cluster - https://phabricator.wikimedia.org/T314294 [13:20:55] urbanecm: done, I have nothing further, did you want to call for patches or should I close the window? [13:21:21] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.133 second response time https://wikitech.wikimedia.org/wiki/Swift [13:21:34] TheresNoTime: i think you can close it [13:21:43] cool :) [13:21:52] !log closing UTC afternoon backport window [13:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:18] 10SRE, 10Traffic, 10Patch-For-Review: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911 (10Vgutierrez) p:05Triage→03Medium [13:25:00] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Performance-Team, and 3 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10Vgutierrez) p:05Triage→03High [13:26:11] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [13:26:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1180 T316342', diff saved to https://phabricator.wikimedia.org/P33956 and previous config saved to /var/cache/conftool/dbconfig/20220906-132627-root.json [13:26:30] T316342: Productionize db1196-db1203 - https://phabricator.wikimedia.org/T316342 [13:30:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P33958 and previous config saved to /var/cache/conftool/dbconfig/20220906-133017-ladsgroup.json [13:30:27] oh, is group0 already on this week’s train? [13:31:11] in that case https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/829020 would be ready for deployment, in principle [13:31:19] but I don’t have my yubikey with me today, so I’d need someone else’s help [13:31:26] or it can wait until tomorrow, that’s also totally fine [13:32:59] (03CR) 10BCornwall: varnish: Remove extraneous checks for Docker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [13:33:16] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1010.eqiad.wmnet [13:33:17] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1010.eqiad.wmnet [13:34:20] 10SRE, 10Traffic, 10affects-Kiwix-and-openZIM: HTTP 500 against api.php?action=parse API on tr.wikipedia.org - https://phabricator.wikimedia.org/T317011 (10Vgutierrez) p:05Triage→03Medium >>! In T317011#8212213, @Aklapper wrote: > Not sure which project tags to add when it comes to caching layers (?), as... [13:35:57] !log pooled parse1010.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [13:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:00] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [13:39:27] (03PS2) 10Giuseppe Lavagetto: [WiP] sort query parameters in URLs [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) [13:40:26] 10SRE, 10Traffic, 10Upstream: metric discrepancies between ATS 9.x and ATS 8.x - https://phabricator.wikimedia.org/T315064 (10Vgutierrez) p:05Triage→03Medium This could have been solved by T316938 [13:43:03] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: power supply alert for cloudcephosd1031.eqiad.wmnet - https://phabricator.wikimedia.org/T317127 (10fnegri) [13:43:27] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: power supply alert for cloudcephosd1031.eqiad.wmnet - https://phabricator.wikimedia.org/T317127 (10fnegri) [13:44:22] (03PS1) 10Btullis: Fix the spark3-env.sh resource [puppet] - 10https://gerrit.wikimedia.org/r/830170 (https://phabricator.wikimedia.org/T312882) [13:45:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T312863)', diff saved to https://phabricator.wikimedia.org/P33959 and previous config saved to /var/cache/conftool/dbconfig/20220906-134523-ladsgroup.json [13:45:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [13:45:27] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [13:45:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [13:45:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T312863)', diff saved to https://phabricator.wikimedia.org/P33960 and previous config saved to /var/cache/conftool/dbconfig/20220906-134545-ladsgroup.json [13:46:24] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37129/console" [puppet] - 10https://gerrit.wikimedia.org/r/830170 (https://phabricator.wikimedia.org/T312882) (owner: 10Btullis) [13:46:35] (03PS1) 10Marostegui: mariadb: Productionize db1201 [puppet] - 10https://gerrit.wikimedia.org/r/830172 (https://phabricator.wikimedia.org/T316342) [13:47:26] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Update CAS to 6.5 - https://phabricator.wikimedia.org/T311235 (10MoritzMuehlenhoff) CAS 6.6 has been released two days ago and features several changes related to webauthn and OIDC, so we'll move to 6.6 instead. Notable changes are: **Ope... [13:47:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1201 [puppet] - 10https://gerrit.wikimedia.org/r/830172 (https://phabricator.wikimedia.org/T316342) (owner: 10Marostegui) [13:48:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): hw troubleshooting: power supply alert for cloudcephosd1031.eqiad.wmnet - https://phabricator.wikimedia.org/T317127 (10fnegri) Please note that the instance is not currently in use, it is part of a new group of hosts that are being added to a Ce... [13:49:22] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10MatthewVernon) p:05Triage→03Medium [13:49:59] RECOVERY - Host ml-cache1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.92 ms [13:50:32] 10SRE, 10ops-eqiad: Cable/connection issue on ml-cache1001.eqiad.wmnet - https://phabricator.wikimedia.org/T317091 (10Jclark-ctr) Replaced Cable and moved port on switch. able to connect now [13:50:40] 10SRE, 10ops-eqiad: Cable/connection issue on ml-cache1001.eqiad.wmnet - https://phabricator.wikimedia.org/T317091 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [13:53:23] 10SRE, 10Data Engineering Planning, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10MatthewVernon) My clinic duty hat would like to triage this task - is "Medium" priority OK @fgiunchedi / @Ottomata ? [13:55:21] (03PS2) 10Btullis: Fix the spark3-env.sh resource [puppet] - 10https://gerrit.wikimedia.org/r/830170 (https://phabricator.wikimedia.org/T312882) [13:56:16] !log depooled wtp1043.eqiad.wmnet from parsoid cluster T307219 [13:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:21] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [13:56:49] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37130/console" [puppet] - 10https://gerrit.wikimedia.org/r/830170 (https://phabricator.wikimedia.org/T312882) (owner: 10Btullis) [13:57:23] (03CR) 10Aqu: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/830170 (https://phabricator.wikimedia.org/T312882) (owner: 10Btullis) [13:57:55] (03PS3) 10Ottomata: Fix the spark3-env.sh resource [puppet] - 10https://gerrit.wikimedia.org/r/830170 (https://phabricator.wikimedia.org/T312882) (owner: 10Btullis) [13:58:01] (03CR) 10Ottomata: [C: 03+2] Fix the spark3-env.sh resource [puppet] - 10https://gerrit.wikimedia.org/r/830170 (https://phabricator.wikimedia.org/T312882) (owner: 10Btullis) [13:58:15] 10SRE, 10Observability-Alerting, 10Performance-Team, 10Patch-For-Review: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10colewhite) [13:58:32] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting: icinga raid montioring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10colewhite) [13:58:50] 10ops-eqiad: Cable/connection issue on ml-cache1001.eqiad.wmnet - https://phabricator.wikimedia.org/T317091 (10MatthewVernon) [13:59:01] 10SRE, 10ops-codfw, 10Observability-Logging: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 (10colewhite) [13:59:32] (03CR) 10Volans: [C: 03+1] "LGTM, small fix inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/830158 (owner: 10Ayounsi) [14:01:01] (03CR) 10Vgutierrez: [C: 03+1] [WiP] sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [14:01:04] (03CR) 10Hashar: "This change is ready for review." (033 comments) [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [14:01:15] (03PS9) 10Hashar: Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 [14:01:30] (03CR) 10CI reject: [V: 04-1] Send events to Wikimedia EventGate [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [14:02:31] 10SRE, 10Thumbor: Thumbor units failing / service general slowness - https://phabricator.wikimedia.org/T312722 (10MatthewVernon) @fgiunchedi you OK with this being "Medium" priority? I'm trying to reduce the vast pile of untriaged tasks on the clinic duty board... [14:03:07] (03CR) 10Vgutierrez: [C: 03+1] [WiP] sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [14:04:41] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.184 second response time https://wikitech.wikimedia.org/wiki/Swift [14:04:42] 10SRE, 10SRE-swift-storage, 10Data Engineering Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) Per @fgiunchedi 's comment above, I started the delete task on `cumin1001` again, this time targetin... [14:04:48] 10SRE, 10Data-Services: 2022-09-04 Scraping from AS714 (Apple) against dumps.wikimedia.org saturating network links - https://phabricator.wikimedia.org/T317001 (10MatthewVernon) @CDanis would you mind setting a priority for this task, please? I'm trying to get the clinic duty untriaged tasks backlog to a more... [14:04:59] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops, 10Community-Tech (CommTech-Sprint-32): SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10JMeybohm) p:05Triage→03Medium [14:06:44] (03PS9) 10Jelto: sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) [14:07:01] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [14:08:15] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1011.eqiad.wmnet [14:10:01] wasn't that the proxy that failed last time? [14:10:05] RECOVERY - Disk space on ms-be1071 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1071&var-datasource=eqiad+prometheus/ops [14:10:19] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:20] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) Submitted ticket in dell portal Confirmed: Service Request 150914190 was successfully submitted. [14:15:21] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:15:32] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:18:56] (03CR) 10Jelto: "thanks for the review! Answered in-line." [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:19:04] (03CR) 10Hashar: "recheck due to CI failure" [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [14:27:32] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1011.eqiad.wmnet [14:27:32] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1011.eqiad.wmnet [14:28:15] !log pooled parse1011.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [14:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:17] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:28:25] (03CR) 10Hashar: "I have added an integration test for the publisher and even managed to retrieve the posted Json to run an assertion against it. Guice and " [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/814807 (owner: 10Hashar) [14:29:02] (03CR) 10SBassett: [C: 03+1] Remove peek-admins grup [puppet] - 10https://gerrit.wikimedia.org/r/829828 (owner: 10Muehlenhoff) [14:29:25] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [14:29:29] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [14:29:53] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [14:29:56] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [14:30:03] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [14:30:08] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [14:30:29] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:33:37] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:34:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33961 and previous config saved to /var/cache/conftool/dbconfig/20220906-143435-root.json [14:36:00] (03PS1) 10Btullis: Deploy an updated datahub version [deployment-charts] - 10https://gerrit.wikimedia.org/r/830181 (https://phabricator.wikimedia.org/T317053) [14:36:13] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging1004 [14:36:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging1004 [14:37:33] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [14:39:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:39:46] !log depooled wtp1044.eqiad.wmnet from parsoid cluster T307219 [14:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:48] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:40:27] (03PS3) 10Volans: sre.hosts.provision: ask to setup the RAID [cookbooks] - 10https://gerrit.wikimedia.org/r/812448 [14:40:44] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: ask to setup the RAID [cookbooks] - 10https://gerrit.wikimedia.org/r/812448 (owner: 10Volans) [14:40:55] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:41:18] (03CR) 10Volans: "I did a full pass. In general LGTM. I've left some minor/typo/nit comments inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/816730 (owner: 10Ayounsi) [14:41:27] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10jbond) p:05Triage→03Medium [14:41:30] (03CR) 10Ottomata: [C: 03+1] "Nice! Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/827526 (https://phabricator.wikimedia.org/T315580) (owner: 10Snwachukwu) [14:42:03] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Use vlan trunking instead of multiple physical interfaces - https://phabricator.wikimedia.org/T316114 (10jbond) p:05Triage→03Medium [14:42:20] 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10jbond) p:05Triage→03Medium [14:42:51] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting: icinga raid montioring inoperable for H750 controllers - https://phabricator.wikimedia.org/T315608 (10jbond) p:05Triage→03High [14:42:57] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10MoritzMuehlenhoff) Retitling the task and dropping vm-requests [14:43:33] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10Vgutierrez) p:05Triage→03Medium [14:43:39] 10SRE: Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10MoritzMuehlenhoff) [14:44:07] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:44:33] 10SRE, 10Data Engineering Planning, 10Data-Engineering-Operations, 10Mail: Add xcollazo@wikimedia.org to the analytics-alerts mailing list - https://phabricator.wikimedia.org/T315486 (10jbond) p:05Triage→03Medium [14:45:05] (03Merged) 10jenkins-bot: sre.hosts.provision: ask to setup the RAID [cookbooks] - 10https://gerrit.wikimedia.org/r/812448 (owner: 10Volans) [14:45:26] (03PS2) 10Volans: sre.hardware.upgrade-firmware: sort drivers (2) [cookbooks] - 10https://gerrit.wikimedia.org/r/830121 [14:45:29] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.105 second response time https://wikitech.wikimedia.org/wiki/Swift [14:45:30] (03CR) 10Volans: [C: 03+2] sre.hardware.upgrade-firmware: sort drivers (2) [cookbooks] - 10https://gerrit.wikimedia.org/r/830121 (owner: 10Volans) [14:45:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove peek-admins grup [puppet] - 10https://gerrit.wikimedia.org/r/829828 (owner: 10Muehlenhoff) [14:45:42] (03CR) 10Volans: [C: 03+1] "LGTM, ship it :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:46:27] !log cgoubert@puppetmaster1001 conftool action : set/pooled=no:weight=10; selector: dc=eqiad,cluster=parsoid,name=parse1012.eqiad.wmnet [14:48:57] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/Swift [14:48:57] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: sort drivers (2) [cookbooks] - 10https://gerrit.wikimedia.org/r/830121 (owner: 10Volans) [14:49:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33962 and previous config saved to /var/cache/conftool/dbconfig/20220906-144940-root.json [14:51:08] (03CR) 10Jelto: [C: 03+2] sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [14:53:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-logging1004.mgmt.eqiad.wmnet with reboot policy FORCED [14:55:14] (03CR) 10Ottomata: sre: followup on Kafka partition replication alerts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/826214 (https://phabricator.wikimedia.org/T309010) (owner: 10Filippo Giunchedi) [14:55:21] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for parse1012.eqiad.wmnet [14:55:21] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1012.eqiad.wmnet [14:55:24] (03PS1) 10Muehlenhoff: Cleanup some more stale references/comments to crons which are now systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/830184 (https://phabricator.wikimedia.org/T273673) [14:55:52] 10SRE, 10Infrastructure-Foundations, 10netops: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10ayounsi) p:05Triage→03Low [14:56:09] (03CR) 10Muehlenhoff: [C: 03+1] c:raid::md move from crontab to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819577 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [14:57:17] (03PS10) 10Jelto: sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) [14:58:45] (03CR) 10CI reject: [V: 04-1] Cleanup some more stale references/comments to crons which are now systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/830184 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff) [14:58:51] !log pooled parse1012.eqiad.wmnet (php 7.4 only) in parsoid cluster T307219 [14:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:54] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [14:59:14] (03PS4) 10Ayounsi: network.prepare-upgrade: cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/830158 [15:02:13] (03CR) 10Ssingh: [V: 03+1 C: 03+2] trafficserver: send SIGUSR2 on log rotation [puppet] - 10https://gerrit.wikimedia.org/r/829034 (owner: 10Ssingh) [15:02:45] moritzm: ok to merge yours? [15:03:53] sukhe: please do, yes! [15:03:57] thanks [15:04:23] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) @ayounsi the mentioned : "All management routers are running Junos 20 except mr1-codfw and mr1-esams that are running 18." and "The current Junos recommen... [15:04:29] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.089 second response time https://wikitech.wikimedia.org/wiki/Swift [15:04:45] (03PS2) 10Muehlenhoff: Cleanup some more stale references/comments to crons [puppet] - 10https://gerrit.wikimedia.org/r/830184 (https://phabricator.wikimedia.org/T273673) [15:04:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33963 and previous config saved to /var/cache/conftool/dbconfig/20220906-150445-root.json [15:06:14] (03PS4) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB [puppet] - 10https://gerrit.wikimedia.org/r/819562 [15:06:19] (03CR) 10Ayounsi: Spicerack: add configuration file and API key for PeeringDB (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/819562 (owner: 10Ayounsi) [15:06:28] (03CR) 10Jelto: [C: 03+2] sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:08:37] 10SRE, 10ops-codfw, 10decommission-hardware, 10Patch-For-Review: decommission elastic2035.codfw.wmnet - https://phabricator.wikimedia.org/T316729 (10Papaul) 05Open→03Declined There was already a task for this @ https://phabricator.wikimedia.org/T300946 [15:08:41] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10ayounsi) Only those 2 from 18 to 21. 20 is recent enough. [15:08:54] !log depooled wtp1045.eqiad.wmnet from parsoid cluster T307219 [15:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:59] T307219: Put parse parse10[01-24] in production - https://phabricator.wikimedia.org/T307219 [15:09:21] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Swift [15:09:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T314041)', diff saved to https://phabricator.wikimedia.org/P33964 and previous config saved to /var/cache/conftool/dbconfig/20220906-150928-ladsgroup.json [15:09:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [15:09:31] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:09:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [15:09:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:09:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:09:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T314041)', diff saved to https://phabricator.wikimedia.org/P33965 and previous config saved to /var/cache/conftool/dbconfig/20220906-150953-ladsgroup.json [15:10:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) @MatthewVernon Did you decide on what you are going to do with this node? [15:12:04] (03Merged) 10jenkins-bot: sre.gitlab.reboot-runner: add cookbook to restart gitlab-runners [cookbooks] - 10https://gerrit.wikimedia.org/r/827456 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:12:06] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) Thanks [15:12:29] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on wtp[1041-1043].eqiad.wmnet with reason: Downtiming replaced wtp servers [15:12:45] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on wtp[1041-1043].eqiad.wmnet with reason: Downtiming replaced wtp servers [15:14:01] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1041.eqiad.wmnet [15:14:11] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1042.eqiad.wmnet [15:14:19] !log cgoubert@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=eqiad,cluster=parsoid,name=wtp1043.eqiad.wmnet [15:15:13] !log Set wtp10[41-43].eqiad.wmnet inactive pending decommission T317025 [15:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:16] T317025: Decommission wtp10[25-48].eqiad.wmnet - https://phabricator.wikimedia.org/T317025 [15:16:26] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.downtime: add network devices support - https://phabricator.wikimedia.org/T317082 (10ayounsi) Ok, thanks. How are the alertmanager silences managed? would the command below do everything needed: * all Icinga "hosts" * alertmanager (and LibreNMS by extension... [15:19:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33966 and previous config saved to /var/cache/conftool/dbconfig/20220906-151950-root.json [15:20:28] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10MatthewVernon) It's still being drained - as per my note last month it's taking a while... [15:20:54] !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [15:21:37] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.reboot-runner (exit_code=1) rolling reboot on A:gitlab-runner [15:22:09] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:22:31] (03CR) 10Giuseppe Lavagetto: [WiP] sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [15:23:08] (03PS3) 10Giuseppe Lavagetto: Sort query parameters in URLs [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) [15:23:55] (03CR) 10Giuseppe Lavagetto: Sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [15:24:28] (03CR) 10Ayounsi: [C: 03+2] junos_set_interface_config: fix logic error [cookbooks] - 10https://gerrit.wikimedia.org/r/821688 (owner: 10Ayounsi) [15:24:33] (03PS3) 10Ayounsi: junos_set_interface_config: fix logic error [cookbooks] - 10https://gerrit.wikimedia.org/r/821688 [15:27:03] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [15:27:42] (03CR) 10Btullis: [C: 03+2] Deploy an updated datahub version [deployment-charts] - 10https://gerrit.wikimedia.org/r/830181 (https://phabricator.wikimedia.org/T317053) (owner: 10Btullis) [15:31:23] (03Merged) 10jenkins-bot: Deploy an updated datahub version [deployment-charts] - 10https://gerrit.wikimedia.org/r/830181 (https://phabricator.wikimedia.org/T317053) (owner: 10Btullis) [15:31:37] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10hashar) I have posted the very few actions I have done on the incident documentation. Given the root cause was immediately found (trafficserver) an... [15:34:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33967 and previous config saved to /var/cache/conftool/dbconfig/20220906-153454-root.json [15:37:29] 10SRE: Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10Krinkle) [15:39:35] (03CR) 10Ayounsi: [C: 03+2] network.prepare-upgrade: cleanup (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/830158 (owner: 10Ayounsi) [15:39:40] (03PS5) 10Ayounsi: network.prepare-upgrade: cleanup [cookbooks] - 10https://gerrit.wikimedia.org/r/830158 [15:40:34] (03PS1) 10Jelto: sre.gitlab.reboot-runner: fix pre_scripts call [cookbooks] - 10https://gerrit.wikimedia.org/r/830189 (https://phabricator.wikimedia.org/T295481) [15:43:33] !log root@cumin1001 START - Cookbook sre.network.prepare-upgrade [15:43:34] !log root@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [15:43:48] !log root@cumin1001 START - Cookbook sre.network.prepare-upgrade [15:43:58] (03CR) 10CI reject: [V: 04-1] sre.gitlab.reboot-runner: fix pre_scripts call [cookbooks] - 10https://gerrit.wikimedia.org/r/830189 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:44:02] !log root@cumin1001 END (FAIL) - Cookbook sre.network.prepare-upgrade (exit_code=99) [15:45:21] (03PS2) 10Jelto: sre.gitlab.reboot-runner: fix pre_scripts call [cookbooks] - 10https://gerrit.wikimedia.org/r/830189 (https://phabricator.wikimedia.org/T295481) [15:45:26] 10SRE, 10Phabricator, 10Traffic, 10Wikimedia-Incident: Phabricator was logging out users repeatedly (2022-08-26) - https://phabricator.wikimedia.org/T316337 (10jcrespo) > on the incident documentation Where? There is no incident doc yet (or I couldn't find one on Wikitech) [15:48:00] (03CR) 10Ori: Sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [15:48:14] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Two failed disks in ms-be1071 - https://phabricator.wikimedia.org/T315437 (10Jclark-ctr) 05Open→03Resolved Imported Foreign drive. errors cleared [15:48:19] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [15:48:51] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/830189 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:50:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repooling after cloning another host', diff saved to https://phabricator.wikimedia.org/P33968 and previous config saved to /var/cache/conftool/dbconfig/20220906-154959-root.json [15:50:40] (03PS3) 10Jelto: sre.gitlab.reboot-runner: fix pre_scripts call [cookbooks] - 10https://gerrit.wikimedia.org/r/830189 (https://phabricator.wikimedia.org/T295481) [15:52:29] (03PS1) 10Clément Goubert: scap/cumin: switch parsoid eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/830193 (https://phabricator.wikimedia.org/T307219) [15:53:51] (03CR) 10Ori: Sort query parameters in URLs (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [15:54:21] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37131/console" [puppet] - 10https://gerrit.wikimedia.org/r/830193 (https://phabricator.wikimedia.org/T307219) (owner: 10Clément Goubert) [15:55:03] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.downtime: add network devices support - https://phabricator.wikimedia.org/T317082 (10Volans) It would try to do the same thing on Alertmanager, yes, assuming there are alerts that match the given hostnames i the proper tag :) [15:56:03] (03CR) 10Jelto: [C: 03+2] sre.gitlab.reboot-runner: fix pre_scripts call [cookbooks] - 10https://gerrit.wikimedia.org/r/830189 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [15:57:33] (03PS2) 10Clément Goubert: scap/cumin: switch parsoid eqiad canaries [puppet] - 10https://gerrit.wikimedia.org/r/830193 (https://phabricator.wikimedia.org/T307219) [15:58:56] (03CR) 10Clément Goubert: "Current canary hosts are in tomorrow's list of servers to cycle for new parse servers. This commit is in preparation of that switch." [puppet] - 10https://gerrit.wikimedia.org/r/830193 (https://phabricator.wikimedia.org/T307219) (owner: 10Clément Goubert) [16:00:00] (03Merged) 10jenkins-bot: sre.gitlab.reboot-runner: fix pre_scripts call [cookbooks] - 10https://gerrit.wikimedia.org/r/830189 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [16:00:04] jbond and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:54] 10SRE-tools, 10Infrastructure-Foundations: sre.hosts.downtime: add network devices support - https://phabricator.wikimedia.org/T317082 (10ayounsi) 05Open→03Resolved a:03ayounsi Awesome, doc updated! https://wikitech.wikimedia.org/wiki/Juniper_router_upgrade [16:01:27] !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [16:06:32] (03CR) 10Cwhite: [C: 03+1] logstash: output thanos-query syslogs to kafka and local file [puppet] - 10https://gerrit.wikimedia.org/r/828960 (https://phabricator.wikimedia.org/T316867) (owner: 10Herron) [16:07:28] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade management routers and switches to Junos 21 - https://phabricator.wikimedia.org/T316529 (10Papaul) [16:07:52] (03CR) 10Clément Goubert: "Regarding the operational procedure for this patch, I suppose I will have to :" [puppet] - 10https://gerrit.wikimedia.org/r/830193 (https://phabricator.wikimedia.org/T307219) (owner: 10Clément Goubert) [16:10:20] (03CR) 10Herron: [C: 03+2] "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/828960 (https://phabricator.wikimedia.org/T316867) (owner: 10Herron) [16:12:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging1004.mgmt.eqiad.wmnet with reboot policy FORCED [16:12:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [16:16:41] 10SRE, 10Observability-Logging, 10Patch-For-Review: Consider bringing thanos-query logs into logstash - https://phabricator.wikimedia.org/T316867 (10colewhite) +1 The logs appear very similar in format to prometheus-blackbox-exporter: that is logfmt. It'd be great to get them to follow the same processing... [16:18:04] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1004'] [16:19:06] (03PS1) 10Hnowlan: api-gateway: don't conditionally rewrite if asked, always do it [deployment-charts] - 10https://gerrit.wikimedia.org/r/830196 [16:19:31] (03PS1) 10Ebernhardson: Ensure namespace filters is passed as a list [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830214 [16:19:39] (03PS2) 10Hnowlan: api-gateway: don't conditionally rewrite if asked, always do it [deployment-charts] - 10https://gerrit.wikimedia.org/r/830196 [16:19:54] (03PS1) 10Sergio Gimeno: Enable the Vue version of the mentee overview in pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830197 (https://phabricator.wikimedia.org/T300532) [16:20:36] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [16:21:21] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:22:03] (03CR) 10Sergio Gimeno: [C: 04-1] "Don't merge until I31183464ea5240bfeb2be77637ba2b9cac91b8bc is backported to wmf.27" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830197 (https://phabricator.wikimedia.org/T300532) (owner: 10Sergio Gimeno) [16:22:14] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:22:17] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10pfischer) @MatthewVernon, thank you, for looking into this. I already created a dev account, please have look at the atta... [16:22:39] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [16:23:26] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [16:23:49] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:24:16] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [16:25:21] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [16:27:08] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10thcipriani) >>! In T316090#8207986, @Jelto wrote: > Welcome @pfischer! Thanks for the request and all the appr... [16:28:41] <_joe_> jouncebot: next [16:28:41] In 1 hour(s) and 31 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T1800) [16:29:26] (03PS4) 10BCornwall: prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) [16:30:15] (03CR) 10CI reject: [V: 04-1] prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [16:31:02] (03PS1) 10Sergio Gimeno: Mentee overview(vue): prevent clicks on more recent edit buttons to submit the filters [extensions/GrowthExperiments] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830199 (https://phabricator.wikimedia.org/T316926) [16:32:39] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [16:35:28] (03PS5) 10BCornwall: prometheus: Parse ATS config to node exporter text [puppet] - 10https://gerrit.wikimedia.org/r/826362 (https://phabricator.wikimedia.org/T292815) [16:36:21] (03CR) 10BCornwall: [C: 03+2] varnish: Stop sending analytics cookies to API [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [16:36:41] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['kafka-logging1004'] [16:38:04] (03CR) 10Dzahn: vrts: install vrts script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [16:38:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/826674 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [16:38:50] _joe_: ok to merge yours? I am merging brett's change [16:38:59] <_joe_> sukhe: yes ofc [16:38:59] (03PS1) 10Andrew Bogott: Move dumps from labstore1006 to clouddumps1001 [dns] - 10https://gerrit.wikimedia.org/r/830200 (https://phabricator.wikimedia.org/T309346) [16:39:02] thanks! [16:39:21] brett: ^ done [16:39:39] gracias [16:41:36] * Krinkle testing on mwdebug1002 [16:42:26] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10colewhite) Followed up offline. @elukey and I are scheduling a time to complete this. [16:42:30] (03PS2) 10Andrew Bogott: Dumps: switch to using clouddumps hosts rather than the old labstores. [puppet] - 10https://gerrit.wikimedia.org/r/828102 (https://phabricator.wikimedia.org/T309346) [16:42:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1004'] [16:43:28] (03PS4) 10MdsShakil: Add localized wordmark for Bengali Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830213 (https://phabricator.wikimedia.org/T316953) [16:44:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1004'] [16:44:33] (03CR) 10Dzahn: vrts: install vrts script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [16:44:45] (03CR) 10Krinkle: [C: 03+2] Limit "CentralAuth" log channel to level=info and above [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812478 (https://phabricator.wikimedia.org/T312704) (owner: 10Krinkle) [16:44:50] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1004'] [16:44:53] (03CR) 10Andrew Bogott: [C: 03+2] Dumps: switch to using clouddumps hosts rather than the old labstores. [puppet] - 10https://gerrit.wikimedia.org/r/828102 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [16:45:06] (03PS5) 10MdsShakil: Add localized wordmark for Bengali Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830213 (https://phabricator.wikimedia.org/T316953) [16:45:06] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kafka-logging1004'] [16:45:53] PROBLEM - Docker registry HTTPS interface certificate expiry on registry1003 is CRITICAL: connect to address 10.64.0.93 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [16:46:05] (03CR) 10Krinkle: [C: 03+2] Remove unused 'CentralAuthRename' log config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812479 (https://phabricator.wikimedia.org/T312704) (owner: 10Krinkle) [16:46:20] (03CR) 10Krinkle: [C: 03+2] mediawiki.base: Restore and document importScript "once" behaviour [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/829772 (owner: 10Krinkle) [16:46:26] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10bking) Update: Unfortunately, even though @pfischer has created the Wikitech dev account, nothing has changed since @Mat... [16:46:37] (03Merged) 10jenkins-bot: Limit "CentralAuth" log channel to level=info and above [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812478 (https://phabricator.wikimedia.org/T312704) (owner: 10Krinkle) [16:47:18] (03CR) 10Dzahn: "hey, so.. just like Alex' comments.. please also see my comments as what can be done in a follow-up patch. I don't mean to make perfect th" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [16:47:23] (03Merged) 10jenkins-bot: Remove unused 'CentralAuthRename' log config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/812479 (https://phabricator.wikimedia.org/T312704) (owner: 10Krinkle) [16:47:41] (03CR) 10ArielGlenn: [C: 03+1] "Assuming that everything checks out on Iefa943b3d7892210d576772ef201b70b11b4e205 this looks good to go" [dns] - 10https://gerrit.wikimedia.org/r/830200 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [16:47:53] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [16:48:21] RECOVERY - Docker registry HTTPS interface certificate expiry on registry1003 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 26 Aug 2024 02:52:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [16:50:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:50:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:50:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:51:28] (03CR) 10Dzahn: "yea, it's not really that I know why the cloud range is in there but I also can't imagine what would break if it's removed." [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [16:51:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:54:36] (03PS1) 10Sergio Gimeno: Enable the topic match mode in all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830202 (https://phabricator.wikimedia.org/T305408) [16:54:46] (03PS6) 10MdsShakil: Add localized wordmark for Bengali Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830213 (https://phabricator.wikimedia.org/T316953) [16:54:53] (03CR) 10Andrew Bogott: [C: 03+2] Move dumps from labstore1006 to clouddumps1001 [dns] - 10https://gerrit.wikimedia.org/r/830200 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [16:55:42] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1004'] [16:55:51] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:56:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:57:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:57:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:57:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:58:03] (03CR) 10Dzahn: "What this does is remove cloud from trusted hosts in exim config and "acl_check_connect". mail from cloud systems would not be accepted an" [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [16:58:21] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:58:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:59:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T314041)', diff saved to https://phabricator.wikimedia.org/P33969 and previous config saved to /var/cache/conftool/dbconfig/20220906-165958-ladsgroup.json [17:00:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [17:00:01] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [17:00:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [17:02:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1004'] [17:06:29] (03CR) 10CI reject: [V: 04-1] mediawiki.base: Restore and document importScript "once" behaviour [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/829772 (owner: 10Krinkle) [17:06:48] !log krinkle@deploy1002 Synchronized wmf-config/: (no justification provided) (duration: 03m 50s) [17:08:35] (03PS1) 10Krinkle: Temporarily disable ReferenceListTest::testSerializationStability [extensions/Wikibase] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830217 [17:08:50] (03CR) 10Krinkle: [V: 03+2 C: 03+2] "Unbreak https://gerrit.wikimedia.org/r/c/mediawiki/core/+/829772/ tests" [extensions/Wikibase] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/830217 (owner: 10Krinkle) [17:09:14] (03CR) 10Krinkle: [V: 03+2 C: 03+2] "Test fixed by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/830217" [core] (wmf/1.39.0-wmf.27) - 10https://gerrit.wikimedia.org/r/829772 (owner: 10Krinkle) [17:10:07] * Krinkle testing on mwdebug1002 [17:10:59] 10SRE, 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10wiki_willy) Certificates for recycling and hard drive shredding, along with the final settlement amount attached. The value of the hardware came in a lot lower than what had been provided to us in the initial estimat... [17:11:01] (03PS1) 10Andrew Bogott: Revert "Dumps: switch to using clouddumps hosts rather than the old labstores." [puppet] - 10https://gerrit.wikimedia.org/r/830218 [17:11:47] (03CR) 10CI reject: [V: 04-1] Revert "Dumps: switch to using clouddumps hosts rather than the old labstores." [puppet] - 10https://gerrit.wikimedia.org/r/830218 (owner: 10Andrew Bogott) [17:11:53] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kafka-logging1004'] [17:14:00] (03PS2) 10Andrew Bogott: Partially revert "Dumps: switch to using clouddumps hosts rather than the old labstores." [puppet] - 10https://gerrit.wikimedia.org/r/830218 [17:14:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:14:36] (03CR) 10CI reject: [V: 04-1] Partially revert "Dumps: switch to using clouddumps hosts rather than the old labstores." [puppet] - 10https://gerrit.wikimedia.org/r/830218 (owner: 10Andrew Bogott) [17:14:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:14:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:15:48] (03PS3) 10Andrew Bogott: Partially revert "Dumps: switch to using clouddumps hosts rather than..." [puppet] - 10https://gerrit.wikimedia.org/r/830218 [17:15:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:16:02] !log krinkle@deploy1002 Synchronized php-1.39.0-wmf.27/resources/src/: I0516527d5cc0 (duration: 03m 50s) [17:17:25] (03CR) 10Andrew Bogott: [C: 03+2] Partially revert "Dumps: switch to using clouddumps hosts rather than..." [puppet] - 10https://gerrit.wikimedia.org/r/830218 (owner: 10Andrew Bogott) [17:17:37] (03PS1) 10Giuseppe Lavagetto: docker-registry: fix nginx configuration [puppet] - 10https://gerrit.wikimedia.org/r/830203 [17:17:46] (03CR) 10ArielGlenn: [C: 03+1] "Please do." [puppet] - 10https://gerrit.wikimedia.org/r/830218 (owner: 10Andrew Bogott) [17:18:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kafka-logging1004'] [17:18:16] (03CR) 10Herron: [C: 03+1] "LGTM overall, please see question inline" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/829214 (https://phabricator.wikimedia.org/T316921) (owner: 10Vgutierrez) [17:18:55] (03CR) 10Dduvall: [C: 03+1] "Looks right to me based on our discussion and reported errors." [puppet] - 10https://gerrit.wikimedia.org/r/830203 (owner: 10Giuseppe Lavagetto) [17:19:11] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker-registry: fix nginx configuration [puppet] - 10https://gerrit.wikimedia.org/r/830203 (owner: 10Giuseppe Lavagetto) [17:20:23] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10Andrew) [17:22:13] PROBLEM - Check systemd state on registry1003 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:25] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: connect to address 10.64.0.93 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [17:22:57] PROBLEM - Docker registry HTTPS interface certificate expiry on registry1003 is CRITICAL: connect to address 10.64.0.93 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [17:22:59] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:23:27] !log installing dpkg bugfix updates from bullseye point release [17:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:29] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:26:14] (03PS1) 10BCornwall: admin: Update Brett Cornwall (bcornwall)'s SSH key [puppet] - 10https://gerrit.wikimedia.org/r/830205 [17:27:13] RECOVERY - Check systemd state on registry1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:25] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3755 bytes in 0.166 second response time https://wikitech.wikimedia.org/wiki/Docker [17:27:55] RECOVERY - Docker registry HTTPS interface certificate expiry on registry1003 is OK: OK - Certificate docker-registry.discovery.wmnet will expire on Mon 26 Aug 2024 02:52:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Docker [17:29:29] (03CR) 10Ssingh: [C: 03+1] "Confirmed with Brett on video that the request is genuine." [puppet] - 10https://gerrit.wikimedia.org/r/830205 (owner: 10BCornwall) [17:29:35] (03CR) 10BCornwall: [C: 03+2] admin: Update Brett Cornwall (bcornwall)'s SSH key [puppet] - 10https://gerrit.wikimedia.org/r/830205 (owner: 10BCornwall) [17:30:02] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10bking) @pfischer Can you add your SSH public key to this ticket so I can add you to shell users? @Gehel Should Peter hav... [17:34:48] (03PS1) 10CDanis: klaxon: tox: support python 3.10 [software/klaxon] - 10https://gerrit.wikimedia.org/r/830227 [17:35:04] (03CR) 10CDanis: [C: 03+2] klaxon: tox: support python 3.10 [software/klaxon] - 10https://gerrit.wikimedia.org/r/830227 (owner: 10CDanis) [17:36:06] (03Merged) 10jenkins-bot: klaxon: tox: support python 3.10 [software/klaxon] - 10https://gerrit.wikimedia.org/r/830227 (owner: 10CDanis) [17:37:51] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:39:05] (03PS1) 10JMeybohm: Alert on high lateny of kubelet operations [alerts] - 10https://gerrit.wikimedia.org/r/830228 (https://phabricator.wikimedia.org/T311251) [17:40:43] (03CR) 10Dzahn: [C: 03+1] "the iptables rules allow port 25 only from mx1001/2001 anyways" [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [17:40:50] (03CR) 10Dzahn: [C: 03+2] Exclude cloud-eqiad prefix from VRT trusted networks [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [17:42:35] (03CR) 10Dzahn: [C: 03+2] "it's also spamassassin config. exim4 was refreshed by puppet." [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [17:44:09] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10RKemper) >>! In T316922#8215118, @bking wrote: > Update: Unfortunately, even though @pfischer has created the Wikitech d... [17:45:18] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10RKemper) @pfischer Per the above, you might need to [[ https://wikitech.wikimedia.org/w/index.php?title=Special:UserLogin... [17:45:41] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Dzahn) >>! In T316922#8209855, @Aklapper wrote: > @Dzahn: Not in the special case of `ldap/wmf` though, [per SRE instruct... [17:48:47] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Remove 185.15.56.0/24 from network::external - https://phabricator.wikimedia.org/T265864 (10Dzahn) >>! In T265864#6995696, @Legoktm wrote: > This will remove Cloud VPS from `wikimedia_nets`, which gets some... [17:48:56] !log root@cumin1001 START - Cookbook sre.network.prepare-upgrade [17:50:01] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10RKemper) [17:50:29] !log root@cumin1001 START - Cookbook sre.network.prepare-upgrade [17:51:20] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10RKemper) [17:53:23] (03CR) 10Dzahn: [C: 03+2] "also mentioned on VRT IRC channel" [puppet] - 10https://gerrit.wikimedia.org/r/828025 (https://phabricator.wikimedia.org/T265864) (owner: 10Ayounsi) [17:53:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) [17:55:11] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:55:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Papaul) HW raid setup on kafka-logging1004 [17:55:49] (03CR) 10Dzahn: [C: 03+1] gitlab: reduce backup_keep_time to 1d [puppet] - 10https://gerrit.wikimedia.org/r/829747 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [17:56:31] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) Thanks for the update. [18:00:03] (03PS1) 10CDanis: refactor value of api_base_url to support reporting API [software/klaxon] - 10https://gerrit.wikimedia.org/r/830229 [18:00:04] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Jclark-ctr) c2 <-- G2204190495000069 --> a1 c7 <-- G2204190495000136 --> a8 d2 <-- G2204190495000072 --> a1 d7 <-- G2204190495000097 --> a8 [18:00:05] jeena and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T1800). [18:01:00] 10SRE, 10Observability-Alerting, 10User-fgiunchedi: Icinga downtimes not working - https://phabricator.wikimedia.org/T314353 (10lmata) p:05Triage→03Medium [18:01:34] (03CR) 10CI reject: [V: 04-1] refactor value of api_base_url to support reporting API [software/klaxon] - 10https://gerrit.wikimedia.org/r/830229 (owner: 10CDanis) [18:01:54] Train was deployed during the European window today [18:02:52] 10SRE, 10Observability-Alerting, 10Performance-Team, 10Patch-For-Review: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10lmata) p:05Triage→03Medium [18:02:58] (03PS2) 10CDanis: refactor value of api_base_url to support reporting API [software/klaxon] - 10https://gerrit.wikimedia.org/r/830229 [18:11:30] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10bking) [18:11:37] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10bking) [18:19:51] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:25:31] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-Page-history, 10Traffic, and 3 others: History pages' caches not being invalidated after edits - https://phabricator.wikimedia.org/T317064 (10aaron) [18:25:47] !log reduce codfw replicas 2 to 1 for logstash-(webrequest|k8s) partitions. Make space for failed logstash2027 - T316996 [18:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:51] T316996: Degraded RAID on logstash2027 - https://phabricator.wikimedia.org/T316996 [18:27:17] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:31:19] (03PS1) 10Dduvall: phabricator: Add missing line continuation to phab_deploy_promote [puppet] - 10https://gerrit.wikimedia.org/r/830234 (https://phabricator.wikimedia.org/T313953) [18:33:02] (03CR) 10Dzahn: [C: 03+2] phabricator: Add missing line continuation to phab_deploy_promote [puppet] - 10https://gerrit.wikimedia.org/r/830234 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [18:33:48] mutante: thank you! [18:35:15] no problem, just happened to see it [18:36:05] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "UID and key matches,lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829148 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto) [18:37:56] dduvall: deployed on phab2002 [18:38:01] (03PS2) 10Gehel: admin: add production access for pfischer [puppet] - 10https://gerrit.wikimedia.org/r/829148 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto) [18:38:03] (03PS2) 10Gehel: admin: add pfischer to search*, analytics and deployment group [puppet] - 10https://gerrit.wikimedia.org/r/829150 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto) [18:38:23] mutante: excellent [18:39:28] (03CR) 10Gehel: [C: 03+2] admin: add production access for pfischer [puppet] - 10https://gerrit.wikimedia.org/r/829148 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto) [18:39:34] 10SRE, 10Observability-Alerting: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163 (10lmata) [18:40:27] 10SRE, 10Observability-Alerting: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163 (10lmata) 05Open→03Resolved a:03lmata Closing this task assumes that the flapping has subsided based on a cursory look at Icinga. Please reopen if this is still an issue. cc/ @gehel @RKemper [18:40:28] (03PS3) 10Gehel: admin: add pfischer to search*, analytics and deployment group [puppet] - 10https://gerrit.wikimedia.org/r/829150 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto) [18:41:14] (03CR) 10Dzahn: [C: 03+1] "all the approvals on ticket, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/829150 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto) [18:41:52] (03CR) 10Gehel: [C: 03+2] admin: add pfischer to search*, analytics and deployment group [puppet] - 10https://gerrit.wikimedia.org/r/829150 (https://phabricator.wikimedia.org/T316090) (owner: 10Jelto) [18:43:54] (03PS1) 10Muehlenhoff: cas: Update to 6.6.0 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/830236 (https://phabricator.wikimedia.org/T311235) [18:44:35] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:44:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [18:45:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [18:45:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T314041)', diff saved to https://phabricator.wikimedia.org/P33972 and previous config saved to /var/cache/conftool/dbconfig/20220906-184515-ladsgroup.json [18:45:18] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [18:47:51] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Gehel) [18:50:02] (03PS1) 10Gehel: admin: add pfischer to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/830238 (https://phabricator.wikimedia.org/T316090) [18:50:30] (03CR) 10Ryan Kemper: [C: 03+1] admin: add pfischer to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/830238 (https://phabricator.wikimedia.org/T316090) (owner: 10Gehel) [18:50:35] (03CR) 10Bking: [C: 03+1] admin: add pfischer to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/830238 (https://phabricator.wikimedia.org/T316090) (owner: 10Gehel) [18:51:18] (03CR) 10Gehel: [C: 03+2] admin: add pfischer to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/830238 (https://phabricator.wikimedia.org/T316090) (owner: 10Gehel) [18:53:58] 10SRE, 10SRE-Access-Requests, 10Data Engineering Planning, 10Discovery-Search (Current work), 10Patch-For-Review: Production Shell access for Peter - https://phabricator.wikimedia.org/T316090 (10Gehel) Patches are merged, account creation checked on 1 elasticsearch server and one wdqs server. @pfischer... [18:54:25] (03PS2) 10Muehlenhoff: cas: Update to 6.6.0 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/830236 (https://phabricator.wikimedia.org/T311235) [18:58:21] (03PS1) 10Ebernhardson: Add alert for CirrusSearch reported memory issues [puppet] - 10https://gerrit.wikimedia.org/r/830240 (https://phabricator.wikimedia.org/T316712) [19:03:39] (03CR) 10AOkoth: C:spamassassin Allow debugging of why service fails. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [19:04:04] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10bking) We added `pfischer` to the WMF LDAP group [[ https://wikitech.wikimedia.org/wiki/SRE/LDAP#Add_a_user_to_a_group |... [19:05:21] (03PS1) 10Dduvall: phabricator: Include scap "done" rev in pre-finalize permissions reset [puppet] - 10https://gerrit.wikimedia.org/r/830241 (https://phabricator.wikimedia.org/T313953) [19:08:28] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10AlexisJazz) >>! In T124101#7314869, @AlexisJazz wrote: > @MarkTraceur @CBogen @Tgr can someone in... [19:09:31] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops, 10SRE Observability (FY2022/2023-Q1): Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) p:05Triage→03Medium [19:09:41] 10SRE, 10Observability-Alerting: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10Dzahn) Since this ticket has been written most cron jobs have been converted to systemd timers. Maybe that made all of this obsolete? [19:09:45] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops, 10SRE Observability (FY2022/2023-Q1): Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) a:05herron→03andrea.denisse [19:11:08] (03PS2) 10Ryan Kemper: wdqs: add bking as contact for wdqs alerts [puppet] - 10https://gerrit.wikimedia.org/r/824553 (https://phabricator.wikimedia.org/T313095) (owner: 10Bking) [19:11:19] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: add bking as contact for wdqs alerts [puppet] - 10https://gerrit.wikimedia.org/r/824553 (https://phabricator.wikimedia.org/T313095) (owner: 10Bking) [19:11:31] (03CR) 10Dduvall: [V: 03+1] "Verified to work in devtools." [puppet] - 10https://gerrit.wikimedia.org/r/830241 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [19:12:05] (03CR) 10Muehlenhoff: C:spamassassin Allow debugging of why service fails. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829108 (https://phabricator.wikimedia.org/T316903) (owner: 10Slyngshede) [19:12:56] (03PS3) 10Dduvall: Run all puppetized deploy scripts as checks [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/828655 (https://phabricator.wikimedia.org/T313953) [19:13:11] (03CR) 10Dzahn: [C: 03+2] phabricator: Include scap "done" rev in pre-finalize permissions reset [puppet] - 10https://gerrit.wikimedia.org/r/830241 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [19:13:35] (03CR) 10Dduvall: "This configuration appears to work reliably in devtools after some fixes to the puppet-side scripts." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/828655 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [19:14:23] (03CR) 10Andrew Bogott: [C: 03+2] C:prometheus: openstack-stale-puppet-certs get ssldir from puppet [puppet] - 10https://gerrit.wikimedia.org/r/830113 (owner: 10Jbond) [19:15:05] 10SRE, 10Observability-Alerting: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10Dzahn) [19:15:07] (03PS3) 10Andrew Bogott: prometheus-openstack-stale-puppet-certs.py: log original cert name [puppet] - 10https://gerrit.wikimedia.org/r/829320 [19:15:11] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-jbond: replace all puppet crons with systemd timers - https://phabricator.wikimedia.org/T273673 (10Dzahn) [19:16:09] 10SRE, 10Observability-Alerting: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10MoritzMuehlenhoff) cron spam is more of a generic term here, that the spam is now coming from jobs spawned by systemd timers doesn't really change the problem :-) [19:17:03] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.4 point update - https://phabricator.wikimedia.org/T312637 (10MoritzMuehlenhoff) [19:20:29] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.210 second response time https://wikitech.wikimedia.org/wiki/Swift [19:24:30] !log milimetric@deploy1002 Started deploy [analytics/refinery@8a5ce13]: Regular analytics weekly train [analytics/refinery@8a5ce13] [19:24:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:25:17] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [19:26:12] (03PS7) 10AOkoth: vrts: install vrts script [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) [19:28:47] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/pcc-worker1003/37132/" [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [19:28:57] 10SRE, 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, and 3 others: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101 (10aaron) Listings look empty: ` >>> $be = MediaWiki\MediaWikiServices::getInstance()->getFileBacke... [19:29:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [19:31:37] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:32:13] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Discovery-Search (Current work): Add pfischer to #wmf-nda on Phab and to #wmf on LDAP - https://phabricator.wikimedia.org/T316922 (10Gehel) a:03pfischer [19:32:44] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/824553 (https://phabricator.wikimedia.org/T313095) (owner: 10Bking) [19:34:05] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:34:11] (03CR) 10AOkoth: vrts: install vrts script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [19:39:09] (03PS1) 10Bking: elastic: Increase no of master-eligibles in codfw [puppet] - 10https://gerrit.wikimedia.org/r/830245 (https://phabricator.wikimedia.org/T313431) [19:41:15] (03PS4) 10Andrew Bogott: prometheus-openstack-stale-puppet-certs.py: log original cert name [puppet] - 10https://gerrit.wikimedia.org/r/829320 [19:41:17] (03PS5) 10Andrew Bogott: Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321 [19:41:19] (03PS1) 10Andrew Bogott: openstack network tests: switch to check clouddumps1001 mounts [puppet] - 10https://gerrit.wikimedia.org/r/830246 (https://phabricator.wikimedia.org/T309346) [19:41:21] (03PS1) 10Andrew Bogott: prometheus-openstack-stale-puppet-certs.py: use ssl_dir() [puppet] - 10https://gerrit.wikimedia.org/r/830247 [19:41:24] (03CR) 10Dzahn: vrts: install vrts script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/828673 (https://phabricator.wikimedia.org/T317059) (owner: 10AOkoth) [19:41:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/830245 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:41:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312863)', diff saved to https://phabricator.wikimedia.org/P33973 and previous config saved to /var/cache/conftool/dbconfig/20220906-194135-ladsgroup.json [19:41:39] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [19:43:01] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:43:14] (03CR) 10CI reject: [V: 04-1] Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott) [19:44:08] (03CR) 10Andrew Bogott: [C: 03+2] openstack network tests: switch to check clouddumps1001 mounts [puppet] - 10https://gerrit.wikimedia.org/r/830246 (https://phabricator.wikimedia.org/T309346) (owner: 10Andrew Bogott) [19:45:05] (03CR) 10Andrew Bogott: [C: 03+2] prometheus-openstack-stale-puppet-certs.py: use ssl_dir() [puppet] - 10https://gerrit.wikimedia.org/r/830247 (owner: 10Andrew Bogott) [19:46:32] (03PS2) 10Ryan Kemper: elastic: Increase # of master-eligibles in codfw [puppet] - 10https://gerrit.wikimedia.org/r/830245 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:47:49] (03PS3) 10Ryan Kemper: elastic: Increase # of master-eligibles in codfw [puppet] - 10https://gerrit.wikimedia.org/r/830245 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:48:40] (03PS6) 10Andrew Bogott: Add clean-stale-puppet-certs script [puppet] - 10https://gerrit.wikimedia.org/r/829321 [19:49:34] (03CR) 10Andrew Bogott: [C: 03+2] prometheus-openstack-stale-puppet-certs.py: log original cert name [puppet] - 10https://gerrit.wikimedia.org/r/829320 (owner: 10Andrew Bogott) [19:49:59] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37134/console" [puppet] - 10https://gerrit.wikimedia.org/r/830245 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:50:16] 10SRE, 10Observability-Metrics: WMF's Grafana installation does not follow Wikimedia's visual identity guidelines - https://phabricator.wikimedia.org/T214762 (10lmata) 05Open→03Declined closing, please reach out if you need this. [19:50:20] (03PS1) 10Robertsky: wikimaniawiki: create 2023 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830248 (https://phabricator.wikimedia.org/T316928) [19:50:22] (03PS1) 10Robertsky: wikimaniawiki: update default searched namespace for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830249 (https://phabricator.wikimedia.org/T316928) [19:50:24] (03CR) 10Andrew Bogott: [C: 03+2] Add clean-stale-puppet-certs script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/829321 (owner: 10Andrew Bogott) [19:50:26] (03PS1) 10Robertsky: wikimaniawiki: enable Visual Editor on 2023 namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830250 (https://phabricator.wikimedia.org/T316928) [19:52:26] (03CR) 10Ryan Kemper: [C: 03+1] "PCC looks great: https://puppet-compiler.wmflabs.org/pcc-worker1003/37134/elastic2060.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/830245 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:52:33] (03CR) 10Bking: [C: 03+2] elastic: Increase # of master-eligibles in codfw [puppet] - 10https://gerrit.wikimedia.org/r/830245 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [19:54:18] (03PS1) 10Andrew Bogott: clean-stale-puppet-certs: remove surplus arg switch [puppet] - 10https://gerrit.wikimedia.org/r/830251 [19:56:36] (03CR) 10Andrew Bogott: [C: 03+2] clean-stale-puppet-certs: remove surplus arg switch [puppet] - 10https://gerrit.wikimedia.org/r/830251 (owner: 10Andrew Bogott) [19:56:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P33974 and previous config saved to /var/cache/conftool/dbconfig/20220906-195642-ladsgroup.json [19:57:04] (03PS1) 10BryanDavis: striker: Bump deployed version to 2022-09-04-055313-production [puppet] - 10https://gerrit.wikimedia.org/r/830252 (https://phabricator.wikimedia.org/T174444) [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220906T2000). [20:00:05] ebernhardson and MdsShakil: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:51] o/ [20:00:55] Hey deployers, please try using `scap backport` for your deployments. I think you'll like it. The unexpected l10n rebuild issue has been resolved. [20:00:55] i can deploy today [20:01:43] dancy: thanks! will do [20:01:48] \o [20:02:06] hi ebernhardson: would you like to self-deploy? happy to do it if not [20:02:13] cjming: you can go ahead [20:02:16] alrighty [20:03:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830214 (owner: 10Ebernhardson) [20:03:29] !log 'bking@cumin1001 disabling puppet on elastic codfw hosts T313431' [20:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:32] T313431: Increase Elastic master-eligible nodes from 3 to 5 - https://phabricator.wikimedia.org/T313431 [20:04:20] (03CR) 10BryanDavis: [V: 03+1 C: 03+1] "PCC changes: https://puppet-compiler.wmflabs.org/pcc-worker1001/37135/" [puppet] - 10https://gerrit.wikimedia.org/r/830252 (https://phabricator.wikimedia.org/T174444) (owner: 10BryanDavis) [20:08:19] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: /srv 0 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [20:08:48] (03CR) 10Andrew Bogott: [C: 03+2] striker: Bump deployed version to 2022-09-04-055313-production [puppet] - 10https://gerrit.wikimedia.org/r/830252 (https://phabricator.wikimedia.org/T174444) (owner: 10BryanDavis) [20:10:25] PROBLEM - ElasticSearch setting check - 9600 on elastic2027 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300] does not match [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [20:10:39] (03CR) 10Andrew Bogott: [C: 03+1] wikitech: drop webserver_hostname_aliases [puppet] - 10https://gerrit.wikimedia.org/r/829287 (owner: 10Majavah) [20:11:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P33975 and previous config saved to /var/cache/conftool/dbconfig/20220906-201148-ladsgroup.json [20:13:23] !log Running database migrations for Striker (T296893) [20:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:26] T296893: Replace Diffusion integration with Gitlab integration in Striker (toolsadmin) - https://phabricator.wikimedia.org/T296893 [20:13:40] cjming: deployment is happening today? [20:14:08] (03CR) 10Robertsky: "Hi Reedy," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830248 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [20:14:09] hi MdsShakil: yes! i'm doing the 1st patch -- should be done in a few mins and then can do yours [20:16:27] !log Forcing puppet runs on cloudweb100[34] to deploy new version of Striker (T296893) [20:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:25] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic2027 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300] does not match [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T313431 https://wikitech.wikimedi [20:17:25] ki/Search%23Administration [20:18:04] Hello. Which team handles TimedMediaHandler? [20:18:23] It looks like there's some issues with transcoding files on commons [20:18:25] e.g. Unrecognized option 'fpsmax'. Error splitting the argument list: Option not found [20:19:17] hauskater: https://www.mediawiki.org/wiki/Developers/Maintainers says "unassigned" :/ [20:19:22] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): toolforge/paws k8s containers need to know about clouddumps100[12] - https://phabricator.wikimedia.org/T317144 (10rook) PAWS containers should start mounting `/mnt/nfs/dumps-clouddumps100[12].wikimedia.org` alongside `/mnt/nfs/dumps-labstor... [20:19:33] PROBLEM - ElasticSearch setting check - 9400 on elastic2073 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300] does not match [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [20:19:46] bd808: that'd be too easy heh :) [20:20:19] I read "Readers Engineering" on https://www.mediawiki.org/wiki/Developers/Maintainers [20:20:48] bd808: looking for 'fpsmax' I've found https://phabricator.wikimedia.org/T317069 [20:20:51] cc zabe_ [20:20:53] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:21:05] I guess I was looking at the catch-all "Media handlers" row. [20:21:23] but the video I uploaded is not 4k but max 720 fps [20:21:49] hauskater: DJ has pinged the folks I would have pinged there. [20:22:17] bd808: ack, hopefully they'll see it and take a look at it [20:22:32] specifically brion is a person who knows things about ffmpeg transcodes [20:22:33] (03Merged) 10jenkins-bot: Ensure namespace filters is passed as a list [extensions/CirrusSearch] (wmf/1.39.0-wmf.28) - 10https://gerrit.wikimedia.org/r/830214 (owner: 10Ebernhardson) [20:22:39] it looks like it's an argument no longer supported by the usr/bin/ffpmg script [20:22:57] *ffmpeg [20:23:04] !log cjming@deploy1002 Started scap: Backport for [[gerrit:830214|Ensure namespace filters is passed as a list]] [20:23:41] !log cjming@deploy1002 cjming and ebernhardson: Backport for [[gerrit:830214|Ensure namespace filters is passed as a list]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:23:43] 10SRE, 10SRE-OnFire: klaxon CLI tool for seeding an oncall handoff - https://phabricator.wikimedia.org/T317159 (10CDanis) [20:23:46] ebernhardson: up on your pick of mwdebug - if it's testable [20:24:50] Only place where fpsmax is used is at https://gerrit.wikimedia.org/g/mediawiki/extensions/TimedMediaHandler/+/85872e13c0ec3e8449e77f11e284a0076dae711e/includes/WebVideoTranscode/WebVideoTranscodeJob.php [20:24:56] cjming: thanks, checking [20:25:04] (03PS1) 10CDanis: refactor out incident parsing for reuse [software/klaxon] - 10https://gerrit.wikimedia.org/r/830258 [20:25:06] (03PS1) 10CDanis: WIP: Basic seeding of an oncall handoff message [software/klaxon] - 10https://gerrit.wikimedia.org/r/830259 (https://phabricator.wikimedia.org/T317159) [20:25:23] cjming: works as expected [20:25:29] great - going live [20:26:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T312863)', diff saved to https://phabricator.wikimedia.org/P33976 and previous config saved to /var/cache/conftool/dbconfig/20220906-202654-ladsgroup.json [20:26:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:26:58] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:27:01] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Update legoktm's root SSH key [labs/private] - 10https://gerrit.wikimedia.org/r/829262 (owner: 10Legoktm) [20:27:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [20:27:41] hauskater: -fpsmax was introduced in ffmpeg 4.4 via https://git.ffmpeg.org/gitweb/ffmpeg.git/commit/d99cc1782563672bcdb46fb5ec51135847db8c99, but the video scalers use 4.1.9 [20:28:18] !log milimetric@deploy1002 Finished deploy [analytics/refinery@8a5ce13]: Regular analytics weekly train [analytics/refinery@8a5ce13] (duration: 63m 48s) [20:28:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:29:13] PROBLEM - ElasticSearch setting check - 9600 on elastic2076 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300] does not match [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [20:29:17] moritzm: so this commit by brion is likely the cause of the break: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/TimedMediaHandler/+/21a48ef249a0afcbff44aa44794cd226436d8304%5E%21/#F8 [20:29:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:29:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:29:40] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:830214|Ensure namespace filters is passed as a list]] (duration: 06m 35s) [20:29:52] ebernhardson: should be live! [20:30:00] or the videoscalers could be upgraded [20:30:03] MdsShakil: still around? we can do your patch now [20:30:03] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs: toolsdb: extend binlog retention [puppet] - 10https://gerrit.wikimedia.org/r/809219 (owner: 10Majavah) [20:30:14] cjming: yes [20:30:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830213 (https://phabricator.wikimedia.org/T316953) (owner: 10MdsShakil) [20:30:57] cjming: thanks! [20:31:04] np! [20:31:08] (03CR) 10CI reject: [V: 04-1] Add localized wordmark for Bengali Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830213 (https://phabricator.wikimedia.org/T316953) (owner: 10MdsShakil) [20:31:11] (03PS1) 10Bking: elastic: reduce master-eligibles for codfw back down to 2 [puppet] - 10https://gerrit.wikimedia.org/r/830261 (https://phabricator.wikimedia.org/T313431) [20:31:46] hauskater: that seems to be the breaking commit indeed. we cannot easily update ffmpeg, it's not just a one off update, since we need to follow ffmpeg security updates on an ongoing manner, i.e. use the versions shipped by Debian [20:31:54] MdsShakil: hmm - there seems to be a problem with your patch [20:32:05] moritzm: I'll comment on the task [20:32:10] (03CR) 10Andrew Bogott: [C: 03+2] wikitech: drop webserver_hostname_aliases [puppet] - 10https://gerrit.wikimedia.org/r/829287 (owner: 10Majavah) [20:32:14] cjming: where? [20:32:17] (03PS2) 10Andrew Bogott: wikitech: drop webserver_hostname_aliases [puppet] - 10https://gerrit.wikimedia.org/r/829287 (owner: 10Majavah) [20:32:29] MdsShakil: https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-test-docker/20477/console [20:32:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T314041)', diff saved to https://phabricator.wikimedia.org/P33977 and previous config saved to /var/cache/conftool/dbconfig/20220906-203236-ladsgroup.json [20:32:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [20:32:40] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:32:40] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [20:32:41] if you can push up a quick fix, we can try again [20:32:49] MdsShakil: ^^ [20:32:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [20:32:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T314041)', diff saved to https://phabricator.wikimedia.org/P33978 and previous config saved to /var/cache/conftool/dbconfig/20220906-203258-ladsgroup.json [20:33:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:33:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/830261 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [20:33:26] I left a comment on T312152 [20:33:27] T312152: Clean up video transcode config for speed/bitrate balance - https://phabricator.wikimedia.org/T312152 [20:35:31] !log milimetric@deploy1002 Started deploy [analytics/refinery@8a5ce13]: Regular analytics weekly train [analytics/refinery@8a5ce13] [20:36:22] moritzm: https://phabricator.wikimedia.org/T317069#8215953 [20:36:38] MdsShakil: i can take care of it - unless you're on it? [20:38:40] cjming: you are welcome 🙂 [20:38:46] !log milimetric@deploy1002 Finished deploy [analytics/refinery@8a5ce13]: Regular analytics weekly train [analytics/refinery@8a5ce13] (duration: 03m 15s) [20:38:56] MdsShakil: alrighty - pushing - it just needs a space to make CI happy [20:39:30] (03PS7) 10Clare Ming: Add localized wordmark for Bengali Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830213 (https://phabricator.wikimedia.org/T316953) (owner: 10MdsShakil) [20:40:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830213 (https://phabricator.wikimedia.org/T316953) (owner: 10MdsShakil) [20:41:23] (03CR) 10Ebernhardson: [C: 03+1] elastic: reduce master-eligibles for codfw back down to 2 [puppet] - 10https://gerrit.wikimedia.org/r/830261 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [20:41:27] (03Merged) 10jenkins-bot: Add localized wordmark for Bengali Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830213 (https://phabricator.wikimedia.org/T316953) (owner: 10MdsShakil) [20:41:35] (03CR) 10Bking: [C: 03+2] elastic: reduce master-eligibles for codfw back down to 2 [puppet] - 10https://gerrit.wikimedia.org/r/830261 (https://phabricator.wikimedia.org/T313431) (owner: 10Bking) [20:41:50] !log cjming@deploy1002 Started scap: Backport for [[gerrit:830213|Add localized wordmark for Bengali Wiktionary (T316953)]] [20:41:54] T316953: Add localized wordmark for Bengali Wiktionary - https://phabricator.wikimedia.org/T316953 [20:42:13] !log cjming@deploy1002 cjming and mdsshakil: Backport for [[gerrit:830213|Add localized wordmark for Bengali Wiktionary (T316953)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:42:22] MdsShakil: are you able to verify on one of the mwdebug servers? [20:42:54] cjming: looking good [20:43:02] cool - syncing [20:43:12] dancy: feature request: the gerrit +2 comment should include the deployers username since it's done via a shared bot account [20:43:55] taavi: Can I get you to file a phab ticket? You can add me a subscriber and I'll make sure it gets done. [20:44:07] !log milimetric@deploy1002 Started deploy [analytics/refinery@8a5ce13] (thin): Regular analytics weekly train THIN [analytics/refinery@8a5ce13] [20:44:15] !log milimetric@deploy1002 Finished deploy [analytics/refinery@8a5ce13] (thin): Regular analytics weekly train THIN [analytics/refinery@8a5ce13] (duration: 00m 08s) [20:44:35] !log milimetric@deploy1002 Started deploy [analytics/refinery@8a5ce13]: Regular analytics weekly train [analytics/refinery@8a5ce13] [20:44:36] !log milimetric@deploy1002 deploy aborted: Regular analytics weekly train [analytics/refinery@8a5ce13] (duration: 00m 00s) [20:44:57] !log milimetric@deploy1002 Started deploy [analytics/refinery@8a5ce13]: Regular analytics weekly train [analytics/refinery@8a5ce13] [20:45:14] !log milimetric@deploy1002 Finished deploy [analytics/refinery@8a5ce13]: Regular analytics weekly train [analytics/refinery@8a5ce13] (duration: 00m 16s) [20:45:23] sure! [20:46:02] Thanks! [20:46:19] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [20:47:14] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:830213|Add localized wordmark for Bengali Wiktionary (T316953)]] (duration: 05m 24s) [20:47:18] T316953: Add localized wordmark for Bengali Wiktionary - https://phabricator.wikimedia.org/T316953 [20:47:21] MdsShakil: should be live! [20:48:04] !log end of UTC late backport window [20:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:48:18] cjming Yes, thank you for your help [20:48:23] np! [20:48:24] !log milimetric@deploy1002 Started deploy [analytics/refinery@8a5ce13] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8a5ce13] [20:49:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:49:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:50:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:53:47] PROBLEM - ElasticSearch setting check - 9200 on elastic2031 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] does not match [elastic2054.codfw.wmnet:9700, elastic2076.codfw.wmnet:9700, elastic2080.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [20:57:18] !log milimetric@deploy1002 Finished deploy [analytics/refinery@8a5ce13] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@8a5ce13] (duration: 08m 54s) [20:58:41] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:10:06] ACKNOWLEDGEMENT - ElasticSearch setting check - 9200 on elastic2031 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] does not match [elastic2054.codfw.wmnet:9700, elastic2076.codfw.wmnet:9700, elastic2080.codfw.wmnet:9700] for .(cluster Brian_King Temporary, non-urgent condition until T313431 is resolved. - The acknowledgement expires at: 2022-09-14 21:09:29. https://wikite [21:10:06] edia.org/wiki/Search%23Administration [21:10:06] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic2076 is CRITICAL: CRITICAL - [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300] does not match [elastic2025.codfw.wmnet:9300, elastic2031.codfw.wmnet:9300, elastic2042.codfw.wmnet:9300, elastic2074.codfw.wmnet:9300, elastic2081.codfw.wmnet:9300] for .(cluster Brian_King Temporary, non-urgent condition until T313431 is resolved. - The ack [21:10:06] ment expires at: 2022-09-14 21:09:29. https://wikitech.wikimedia.org/wiki/Search%23Administration [21:13:24] !log milimetric@deploy1002 Started deploy [analytics/refinery@b14c9f4]: Hotfix for requestctl field [21:14:08] (03CR) 10Andrew Bogott: [C: 03+2] P:openstack::codfw1dev::db: cleanup firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/824765 (owner: 10Majavah) [21:15:52] PROBLEM - ElasticSearch setting check - 9200 on elastic2025 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] does not match [elastic2054.codfw.wmnet:9700, elastic2076.codfw.wmnet:9700, elastic2080.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [21:16:57] ACKNOWLEDGEMENT - ElasticSearch setting check - 9200 on elastic2025 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2054.codfw.wmnet:9700] does not match [elastic2054.codfw.wmnet:9700, elastic2076.codfw.wmnet:9700, elastic2080.codfw.wmnet:9700] for .(cluster Brian_King Temporary, non-urgent condition until T313431 is resolved - The acknowledgement expires at: 2022-09-14 21:16:36. https://wikitec [21:16:57] dia.org/wiki/Search%23Administration [21:18:45] (03PS1) 10Eevans: cassandra: Add dummy password for aqs_testing roll [labs/private] - 10https://gerrit.wikimedia.org/r/830265 (https://phabricator.wikimedia.org/T317140) [21:18:54] (03PS4) 10Ori: Sort query parameters in URLs [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [21:19:32] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:19:53] (03CR) 10Ori: "_joe_: I updated the patch to make the implementation match libvmod-querysort exactly, and copied over the exact set of testcases from the" [software/purged] - 10https://gerrit.wikimedia.org/r/830124 (https://phabricator.wikimedia.org/T317064) (owner: 10Giuseppe Lavagetto) [21:19:57] (03Abandoned) 10Eevans: cassandra: Add dummy password for aqs_testing roll [labs/private] - 10https://gerrit.wikimedia.org/r/830265 (https://phabricator.wikimedia.org/T317140) (owner: 10Eevans) [21:22:17] (03PS1) 10Eevans: cassandra: Add dummy password for aqs_testing roll [labs/private] - 10https://gerrit.wikimedia.org/r/830267 (https://phabricator.wikimedia.org/T317140) [21:24:22] (03PS1) 10Eevans: cassandra: Create new role for testing AQS bulk-loader changes [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) [21:25:08] (03CR) 10CI reject: [V: 04-1] cassandra: Create new role for testing AQS bulk-loader changes [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) (owner: 10Eevans) [21:26:05] 10SRE, 10Observability-Alerting: improve cron spam visibility - https://phabricator.wikimedia.org/T84845 (10Dzahn) But the difference is now it's not sent to root@ and instead to actual teams. I would hope that means they don't get ignored any longer which is the original problem statement in this ticket. That... [21:27:41] (03CR) 10Eevans: [C: 03+2] cassandra: Add dummy password for aqs_testing roll [labs/private] - 10https://gerrit.wikimedia.org/r/830267 (https://phabricator.wikimedia.org/T317140) (owner: 10Eevans) [21:28:13] (03CR) 10Eevans: [V: 03+2 C: 03+2] cassandra: Add dummy password for aqs_testing roll [labs/private] - 10https://gerrit.wikimedia.org/r/830267 (https://phabricator.wikimedia.org/T317140) (owner: 10Eevans) [21:30:51] !log root@cumin1001 END (ERROR) - Cookbook sre.network.prepare-upgrade (exit_code=97) [21:30:58] (03CR) 10Reedy: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830248 (https://phabricator.wikimedia.org/T316928) (owner: 10Robertsky) [21:31:35] jouncebot: nowandnext [21:31:35] No deployments scheduled for the next 9 hour(s) and 28 minute(s) [21:31:35] In 9 hour(s) and 28 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220907T0700) [21:32:16] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:32:22] (03CR) 10Bking: [C: 03+2] wdqs: add bking as contact for wdqs alerts [puppet] - 10https://gerrit.wikimedia.org/r/824553 (https://phabricator.wikimedia.org/T313095) (owner: 10Bking) [21:34:09] urandom I skipped over your patch, you want me to go ahead and puppet-merge? [21:34:37] inflatador: oh, yes please [21:34:42] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [21:35:06] urandom ACK, merged [21:37:11] (03PS2) 10Eevans: cassandra: Create new role for testing AQS bulk-loader changes [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) [21:37:34] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37139/" [puppet] - 10https://gerrit.wikimedia.org/r/829244 (https://phabricator.wikimedia.org/T315713) (owner: 10Dzahn) [21:38:00] (03CR) 10CI reject: [V: 04-1] cassandra: Create new role for testing AQS bulk-loader changes [puppet] - 10https://gerrit.wikimedia.org/r/830268 (https://phabricator.wikimedia.org/T317140) (owner: 10Eevans) [21:39:52] !log phabricator - passive hosts in codfw switched to readonly DB access (m3-slave, not m3-master) T315713 [21:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:55] T315713: sort out mysql privileges for phab1004/phab2002 - https://phabricator.wikimedia.org/T315713 [21:41:38] !log root@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [21:42:32] (03CR) 10Dzahn: [C: 03+2] "changed config on 2001 and 2002 to use m3-slave, noop on phab1001 production host" [puppet] - 10https://gerrit.wikimedia.org/r/829244 (https://phabricator.wikimedia.org/T315713) (owner: 10Dzahn) [21:45:33] !log milimetric@deploy1002 deploy aborted: Hotfix for requestctl field (duration: 32m 09s) [21:45:37] !log milimetric@deploy1002 Started deploy [analytics/refinery@b14c9f4]: Hotfix for requestctl field [21:45:39] (03CR) 10Dzahn: "this is safer now since a new server won't talk to rw-mysql DB until it's explictily made the active phabricator server https://gerrit.wik" [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:49:32] !log milimetric@deploy1002 Finished deploy [analytics/refinery@b14c9f4]: Hotfix for requestctl field (duration: 03m 55s) [21:49:34] !log milimetric@deploy1002 Started deploy [analytics/refinery@b14c9f4]: Hotfix for requestctl field [21:52:20] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.107 second response time https://wikitech.wikimedia.org/wiki/Swift [21:53:02] (03PS1) 10BryanDavis: striker: bump container version to 2022-09-06-213820-production [puppet] - 10https://gerrit.wikimedia.org/r/830275 (https://phabricator.wikimedia.org/T296893) [21:53:03] !log milimetric@deploy1002 Finished deploy [analytics/refinery@b14c9f4]: Hotfix for requestctl field (duration: 03m 28s) [21:53:35] !log milimetric@deploy1002 Started deploy [analytics/refinery@b14c9f4]: Hotfix for requestctl field [21:54:38] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift [21:56:03] !log milimetric@deploy1002 Finished deploy [analytics/refinery@b14c9f4]: Hotfix for requestctl field (duration: 02m 28s) [21:56:24] !log milimetric@deploy1002 Started deploy [analytics/refinery@b14c9f4] (thin): Hotfix for requestctl field [21:56:32] !log milimetric@deploy1002 Finished deploy [analytics/refinery@b14c9f4] (thin): Hotfix for requestctl field (duration: 00m 08s) [21:56:45] (03PS4) 10Dzahn: site: add phabricator role on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) [21:57:40] (03CR) 10BryanDavis: [V: 03+1 C: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1003/37140/" [puppet] - 10https://gerrit.wikimedia.org/r/830275 (https://phabricator.wikimedia.org/T296893) (owner: 10BryanDavis) [21:59:59] (03CR) 10Andrew Bogott: [C: 03+2] striker: bump container version to 2022-09-06-213820-production [puppet] - 10https://gerrit.wikimedia.org/r/830275 (https://phabricator.wikimedia.org/T296893) (owner: 10BryanDavis) [22:01:29] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:04:47] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [22:11:39] (03CR) 10Dzahn: "I'll go ahead and self-merge then" [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:15:25] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37141/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:17:04] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "database_host: m3-slave.eqiad.wmnet (read-only) https://phabricator.wikimedia.org/T315713 https://puppet-compiler.wmflabs.org/pcc-worker1" [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:22:39] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "Error: /Stage[main]/Phabricator/Scap::Target[phabricator/deployment]/Package[phabricator/deployment]: Provider scap3 is not functional on " [puppet] - 10https://gerrit.wikimedia.org/r/824803 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:24:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T314041)', diff saved to https://phabricator.wikimedia.org/P33979 and previous config saved to /var/cache/conftool/dbconfig/20220906-222418-ladsgroup.json [22:24:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [22:24:23] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [22:24:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [22:24:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T314041)', diff saved to https://phabricator.wikimedia.org/P33980 and previous config saved to /var/cache/conftool/dbconfig/20220906-222439-ladsgroup.json [22:33:23] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:45:13] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.204 second response time https://wikitech.wikimedia.org/wiki/Swift [22:46:47] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Swift [22:49:28] (03PS1) 10Dzahn: phabricator: ensure home dir exists before sysuser is created [puppet] - 10https://gerrit.wikimedia.org/r/830284 (https://phabricator.wikimedia.org/T280597) [22:54:57] (03PS1) 10Dzahn: phabricator::phd: actually use $phd_user variable and small improvements [puppet] - 10https://gerrit.wikimedia.org/r/830285 (https://phabricator.wikimedia.org/T280597) [22:55:59] (03PS2) 10Dzahn: phabricator::phd: actually use $phd_user variable and small improvements [puppet] - 10https://gerrit.wikimedia.org/r/830285 (https://phabricator.wikimedia.org/T280597) [22:59:27] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:00:33] PROBLEM - Check no envoy runtime configuration is left persistent on phab1004 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [23:01:51] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:03:47] PROBLEM - Check that envoy is running on phab1004 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is inactive https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [23:05:45] phab1004 is me because I just applied a new role [23:05:52] I'll fix the alerting [23:06:58] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 6 days, 0:00:00 on phab1004.eqiad.wmnet with reason: new install [23:07:14] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on phab1004.eqiad.wmnet with reason: new install [23:12:14] ACKNOWLEDGEMENT - Host an-presto1006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T304289 [23:12:14] ACKNOWLEDGEMENT - Host elastic1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T304289 [23:12:14] ACKNOWLEDGEMENT - Host elastic1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T304289 [23:12:14] ACKNOWLEDGEMENT - Host ores2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T304289 [23:12:14] ACKNOWLEDGEMENT - Host thumbor2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T304289 [23:14:39] ACKNOWLEDGEMENT - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T304289 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:14:39] ACKNOWLEDGEMENT - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T304289 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:14:39] ACKNOWLEDGEMENT - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T304289 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:14:39] ACKNOWLEDGEMENT - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn https://phabricator.wikimedia.org/T304289 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:21:25] 10SRE, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Dzahn) [23:32:57] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:35:19] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:38:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114 (T314041)', diff saved to https://phabricator.wikimedia.org/P33981 and previous config saved to /var/cache/conftool/dbconfig/20220906-233809-ladsgroup.json [23:38:13] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [23:44:19] (03PS13) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [23:50:07] (03PS14) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [23:50:37] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (0319 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [23:52:11] PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [23:53:39] 10SRE, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Vahurzpu) Due to [[ https://en.wikibooks.org/w/index.php?title=MediaWiki:Wikimedia-copyright&diff=4101009&oldid=3696291 | this diff ]]; it's just a punctuation fix, not anything that modifies the me... [23:55:38] 10SRE, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Dzahn) @Vahurzpu Thank you! ACK It's not the first time it happens due to small changes. Part of this ticket was to point out that every time there is any minimal change this causes a false alert... [23:59:25] RECOVERY - k8s API server requests latencies on kubemaster1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27