[15:26:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Remove db1233 from groups', diff saved to https://phabricator.wikimedia.org/P66141 and previous config saved to /var/cache/conftool/dbconfig/20240710-152616-ladsgroup.json [15:27:10] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1 (T368083) [15:27:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P66142 and previous config saved to /var/cache/conftool/dbconfig/20240710-152727-arnaudb.json [15:30:29] (03CR) 10Vgutierrez: [C:03+1] hiera: Switch cloudelastic to maglev [puppet] - 10https://gerrit.wikimedia.org/r/1053334 (https://phabricator.wikimedia.org/T368083) (owner: 10BCornwall) [15:30:35] FIRING: ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#ml-cache1001-a:7000 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:30:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, July 11 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053006 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler) [15:31:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053006 (https://phabricator.wikimedia.org/T361013) (owner: 10Daniel Kinzler) [15:33:44] (03CR) 10Arnaudb: [C:03+2] "I've started https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053326" [puppet] - 10https://gerrit.wikimedia.org/r/1048006 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [15:34:17] FIRING: [2x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:35:17] (03CR) 10Vgutierrez: [C:03+1] hiera: Switch ldap-ro to maglev [puppet] - 10https://gerrit.wikimedia.org/r/1053333 (https://phabricator.wikimedia.org/T368083) (owner: 10BCornwall) [15:35:40] (03CR) 10Slyngshede: [C:03+1] "I'm not familiar with the differences between the two schedulers, but LDAP is a simple enough protocol that I can't see it breaking anythi" [puppet] - 10https://gerrit.wikimedia.org/r/1053333 (https://phabricator.wikimedia.org/T368083) (owner: 10BCornwall) [15:36:42] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-presto1004.eqiad.wmnet with reason: Shutting down to investigate RAM issue [15:36:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-presto1004.eqiad.wmnet with reason: Shutting down to investigate RAM issue [15:37:02] (03CR) 10BCornwall: [C:03+2] hiera: Switch ldap-ro to maglev [puppet] - 10https://gerrit.wikimedia.org/r/1053333 (https://phabricator.wikimedia.org/T368083) (owner: 10BCornwall) [15:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:37:51] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:38:09] (03CR) 10BCornwall: [C:03+2] hiera: Switch cloudelastic to maglev [puppet] - 10https://gerrit.wikimedia.org/r/1053334 (https://phabricator.wikimedia.org/T368083) (owner: 10BCornwall) [15:38:49] PROBLEM - Host lsw1-e1-eqiad.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:39:10] that seems expected wdyt topranks ? [15:39:40] arnaudb: it's fine yep, although should be downtimed [15:39:51] RECOVERY - Host lsw1-e1-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [15:40:32] downtime expired, I had to hold on on reboot for some nodes to be ready and had been a little too tight with the timing [15:40:44] switch is just back online now, it's initializing itself [15:40:50] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploy v0.1.0 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053339 (https://phabricator.wikimedia.org/T369544) [15:41:11] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#9970063 (10Jhancock.wm) a:05Jhancock.wm→03None [15:41:15] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64810/IPv4: Active - evpn_switches_eqiad, AS64810/IPv4: Active - evpn_switches_eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:42:13] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:42:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T367781)', diff saved to https://phabricator.wikimedia.org/P66143 and previous config saved to /var/cache/conftool/dbconfig/20240710-154234-arnaudb.json [15:42:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:42:38] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [15:42:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1177.eqiad.wmnet with reason: Maintenance [15:42:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T367781)', diff saved to https://phabricator.wikimedia.org/P66144 and previous config saved to /var/cache/conftool/dbconfig/20240710-154256-arnaudb.json [15:44:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:44:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T367781)', diff saved to https://phabricator.wikimedia.org/P66145 and previous config saved to /var/cache/conftool/dbconfig/20240710-154404-arnaudb.json [15:44:17] RESOLVED: [2x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:30] (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configuration: Deploy v0.1.0 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053339 (https://phabricator.wikimedia.org/T369544) (owner: 10Santiago Faci) [15:45:40] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9970099 (10cmooney) Switch upgraded successfully and all hosts back online/pinging. Thanks everyone for the assistance! [15:46:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 5%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66146 and previous config saved to /var/cache/conftool/dbconfig/20240710-154615-arnaudb.json [15:46:22] T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993 [15:46:24] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploy v0.1.0 to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053339 (https://phabricator.wikimedia.org/T369544) (owner: 10Santiago Faci) [15:46:55] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9970119 (10ABran-WMF) db1190 repooling dbproxy reloaded everything looks OK [15:48:17] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:48:34] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:48:51] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1-eqsin (T368083) [15:48:59] T368083: migrate all high-traffic1 and high-traffic2 services to maglev - https://phabricator.wikimedia.org/T368083 [15:49:25] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1-eqsin (T368083) [15:50:03] (03PS1) 10David Caro: mysqld_exporter: pull from bpo in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1053341 (https://phabricator.wikimedia.org/T369722) [15:51:02] repooling ms-fe1012.eqiad.wmnet — T365993 [15:52:19] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9970127 (10Eevans) ms-fe1012 repooled, and everything looks good. [15:53:11] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [15:53:14] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [15:53:59] (03CR) 10Scott French: [C:03+2] rest-gateway: route commons-impact-analytics via metrics/commons-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053077 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:54:58] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2-eqsin (T368083) [15:55:02] T368083: migrate all high-traffic1 and high-traffic2 services to maglev - https://phabricator.wikimedia.org/T368083 [15:55:03] (03Merged) 10jenkins-bot: rest-gateway: route commons-impact-analytics via metrics/commons-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053077 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [15:55:31] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2-eqsin (T368083) [15:56:07] 06SRE, 10LDAP-Access-Requests: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9970138 (10KFrancis) The NDA is signed. Thanks! [15:56:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1089-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:56:53] ^^ expected alert, should clear shortly [15:57:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T352010)', diff saved to https://phabricator.wikimedia.org/P66147 and previous config saved to /var/cache/conftool/dbconfig/20240710-155703-ladsgroup.json [15:57:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:58:49] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [15:59:07] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [15:59:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P66148 and previous config saved to /var/cache/conftool/dbconfig/20240710-155911-arnaudb.json [15:59:49] (03CR) 10Arnaudb: [C:03+1] mysqld_exporter: pull from bpo in bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1053341 (https://phabricator.wikimedia.org/T369722) (owner: 10David Caro) [16:00:00] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1 (T368083) [16:00:16] T368083: migrate all high-traffic1 and high-traffic2 services to maglev - https://phabricator.wikimedia.org/T368083 [16:00:44] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [16:01:03] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [16:01:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 10%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66149 and previous config saved to /var/cache/conftool/dbconfig/20240710-160120-arnaudb.json [16:01:25] (03CR) 10Andrea Denisse: [C:03+2] webperf: Enable the navtiming service at boot [puppet] - 10https://gerrit.wikimedia.org/r/1053093 (https://phabricator.wikimedia.org/T366571) (owner: 10Andrea Denisse) [16:01:32] (03CR) 10Andrea Denisse: [C:03+2] webperf: Enable the statsv service at boot [puppet] - 10https://gerrit.wikimedia.org/r/1053094 (https://phabricator.wikimedia.org/T366571) (owner: 10Andrea Denisse) [16:01:50] T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993 [16:02:58] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1 (T368083) [16:05:20] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic2 (T368083) [16:05:39] T368083: migrate all high-traffic1 and high-traffic2 services to maglev - https://phabricator.wikimedia.org/T368083 [16:07:26] 10SRE-Access-Requests, 06Data-Engineering: Request for Kerb credentials for Ariel Glenn - https://phabricator.wikimedia.org/T368911#9970172 (10CDanis) [16:08:17] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic2 (T368083) [16:11:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1089-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:11:48] !log brett@cumin2002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary (T368083) [16:12:03] T368083: migrate all high-traffic1 and high-traffic2 services to maglev - https://phabricator.wikimedia.org/T368083 [16:12:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P66150 and previous config saved to /var/cache/conftool/dbconfig/20240710-161211-ladsgroup.json [16:14:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P66151 and previous config saved to /var/cache/conftool/dbconfig/20240710-161419-arnaudb.json [16:14:42] (03CR) 10David Caro: [C:03+2] "Tested in tools-db-1:" [puppet] - 10https://gerrit.wikimedia.org/r/1053341 (https://phabricator.wikimedia.org/T369722) (owner: 10David Caro) [16:14:43] !log brett@cumin2002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary (T368083) [16:16:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 25%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66152 and previous config saved to /var/cache/conftool/dbconfig/20240710-161626-arnaudb.json [16:16:42] T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993 [16:17:32] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [16:17:46] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [16:20:36] 10ops-eqiad, 06DC-Ops: hw troubleshooting: DIMM in slot B1 of an-presto1004 is no longer detected - https://phabricator.wikimedia.org/T369265#9970293 (10VRiley-WMF) 05Open→03Resolved Checked in with @BTullis to commence this ticket. However, found that a cold boot worked and it's now seeing all the mem... [16:21:14] 06SRE, 10LDAP-Access-Requests: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9970296 (10Dzahn) Thanks Katie! @volans looks to me this can be resolved now. [16:21:39] RESOLVED: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1089-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:23:46] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:24:00] woah [16:24:05] here, looking [16:24:14] I'm afk but I'll be there in five [16:24:32] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9970307 (10Scott_French) Alright, good(er) news: the service is now live at `/... [16:27:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P66153 and previous config saved to /var/cache/conftool/dbconfig/20240710-162718-ladsgroup.json [16:28:05] okay, here, looking [16:28:16] (03CR) 10Dzahn: [C:03+1] "Thanks for taking this on so quickly!" [puppet] - 10https://gerrit.wikimedia.org/r/1053274 (https://phabricator.wikimedia.org/T367012) (owner: 10Clément Goubert) [16:28:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:29:05] 06SRE, 10DNS, 10fundraising-tech-ops, 06serviceops, and 2 others: redirect benefactors.wikimedia.org (was: Cleanup unused DNS subdomains) - https://phabricator.wikimedia.org/T367012#9970327 (10Dzahn) Thanks Clément! I was about to make a subtask and rename this again but you already took it on. It's apprec... [16:29:15] acked [16:29:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T367781)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240710-162926-arnaudb.json [16:29:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1178.eqiad.wmnet with reason: Maintenance [16:29:45] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [16:29:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1178.eqiad.wmnet with reason: Maintenance [16:29:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T367781)', diff saved to https://phabricator.wikimedia.org/P66155 and previous config saved to /var/cache/conftool/dbconfig/20240710-162952-arnaudb.json [16:31:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T367781)', diff saved to https://phabricator.wikimedia.org/P66156 and previous config saved to /var/cache/conftool/dbconfig/20240710-163100-arnaudb.json [16:31:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 50%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66157 and previous config saved to /var/cache/conftool/dbconfig/20240710-163131-arnaudb.json [16:31:45] T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993 [16:33:46] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:34:50] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploy v0.1.0 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053350 (https://phabricator.wikimedia.org/T369544) [16:38:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:40:07] (03PS1) 10Dzahn: admin: add Kwaku as approver for the dns-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1053352 (https://phabricator.wikimedia.org/T276465) [16:40:39] (03CR) 10Ssingh: [C:03+1] "Looks good from Traffic but pending Kwaku's approval." [puppet] - 10https://gerrit.wikimedia.org/r/1053352 (https://phabricator.wikimedia.org/T276465) (owner: 10Dzahn) [16:42:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T352010)', diff saved to https://phabricator.wikimedia.org/P66158 and previous config saved to /var/cache/conftool/dbconfig/20240710-164225-ladsgroup.json [16:42:29] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:43:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:46:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P66159 and previous config saved to /var/cache/conftool/dbconfig/20240710-164608-arnaudb.json [16:46:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 75%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66160 and previous config saved to /var/cache/conftool/dbconfig/20240710-164637-arnaudb.json [16:46:41] T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993 [16:48:30] 10ops-eqiad, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743 (10RobH) 03NEW [16:48:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:51:19] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 225, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:51:35] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:53:25] (03CR) 10Volans: [C:03+1] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1051299 (https://phabricator.wikimedia.org/T368744) (owner: 10Volans) [16:53:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:53:45] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:56:51] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:59:02] (03PS5) 10Andrew Bogott: Openstack cli: stamp out openstack auth via env settings [puppet] - 10https://gerrit.wikimedia.org/r/1053086 (https://phabricator.wikimedia.org/T337577) [16:59:02] (03PS1) 10Andrew Bogott: wmcs-cold-migrate.py: remove [puppet] - 10https://gerrit.wikimedia.org/r/1053354 [16:59:02] (03PS1) 10Andrew Bogott: wmcs-pause-cloud: remove [puppet] - 10https://gerrit.wikimedia.org/r/1053355 [16:59:02] (03PS1) 10Andrew Bogott: wmcs-makedomain: use clouds.yaml openstack auth [puppet] - 10https://gerrit.wikimedia.org/r/1053356 (https://phabricator.wikimedia.org/T337577) [16:59:03] (03PS1) 10Andrew Bogott: wmcs-drain-hypervisor: remove spurious args, use clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1053357 (https://phabricator.wikimedia.org/T337577) [16:59:56] (03PS1) 10Jdlrobson: Add beta tag & feedback link to Appearance menu [skins/Vector] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1053358 (https://phabricator.wikimedia.org/T367871) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240710T1700) [17:00:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, July 10 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [skins/Vector] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1053358 (https://phabricator.wikimedia.org/T367871) (owner: 10Jdlrobson) [17:01:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P66161 and previous config saved to /var/cache/conftool/dbconfig/20240710-170115-arnaudb.json [17:01:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1190 (re)pooling @ 100%: post T365993 repool', diff saved to https://phabricator.wikimedia.org/P66162 and previous config saved to /var/cache/conftool/dbconfig/20240710-170143-arnaudb.json [17:01:47] T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993 [17:02:39] (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configuration: Deploy v0.1.0 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053350 (https://phabricator.wikimedia.org/T369544) (owner: 10Santiago Faci) [17:03:33] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploy v0.1.0 to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1053350 (https://phabricator.wikimedia.org/T369544) (owner: 10Santiago Faci) [17:04:25] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on relforge1004:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [17:16:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Request for Kerb credentials for Ariel Glenn - https://phabricator.wikimedia.org/T368911#9970616 (10Dzahn) Technically you should nudge the person listed as this week's clinic duty. (as of today that's godog). Ticket was just missing the tags for access reque... [17:16:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T367781)', diff saved to https://phabricator.wikimedia.org/P66163 and previous config saved to /var/cache/conftool/dbconfig/20240710-171622-arnaudb.json [17:16:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance [17:16:35] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [17:16:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1192.eqiad.wmnet with reason: Maintenance [17:16:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T367781)', diff saved to https://phabricator.wikimedia.org/P66164 and previous config saved to /var/cache/conftool/dbconfig/20240710-171644-arnaudb.json [17:16:55] (03PS13) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [17:18:34] (03PS1) 10Dzahn: admin: add Ariel to analytics-privatedata-users, add krb: present [puppet] - 10https://gerrit.wikimedia.org/r/1053360 (https://phabricator.wikimedia.org/T368911) [17:18:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [17:19:29] (03CR) 10Dzahn: "for this to work also a principal has to be created: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Kerberos/Administration#Cre" [puppet] - 10https://gerrit.wikimedia.org/r/1053360 (https://phabricator.wikimedia.org/T368911) (owner: 10Dzahn) [17:22:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Request for Kerb credentials for Ariel Glenn - https://phabricator.wikimedia.org/T368911#9970632 (10Dzahn) 05Open→03In progress p:05Triage→03High [17:25:52] (03PS14) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [17:26:55] 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9970649 (10Dzahn) a:05Dzahn→03None [17:27:09] 06SRE, 06collaboration-services, 10LDAP-Access-Requests: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9970650 (10Aklapper) [17:27:10] 06SRE, 06Traffic: Regression: Reading spam blacklists of all projects suddenly returns status 429 on fifth consecutive read - https://phabricator.wikimedia.org/T369414#9970646 (10Dzahn) 05Open→03Resolved a:03Dzahn Being bold and closing the task because the task creator said so. Please feel free to... [17:28:41] 06SRE, 06collaboration-services, 06Infrastructure-Foundations: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9970661 (10Dzahn) [17:32:49] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [17:32:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [17:39:08] (03CR) 10Dzahn: [C:03+1] gitlab: switch gitlab-replica-b from iptables to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1053306 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [17:39:56] (03CR) 10Dzahn: [C:03+2] "ah, gotcha! thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1052772 (owner: 10Ahmon Dancy) [17:52:12] (03PS1) 10CDanis: upload vcl: add a new scraper UA [puppet] - 10https://gerrit.wikimedia.org/r/1053363 [17:52:36] (03CR) 10RLazarus: [C:03+1] upload vcl: add a new scraper UA [puppet] - 10https://gerrit.wikimedia.org/r/1053363 (owner: 10CDanis) [17:52:55] (03CR) 10Ssingh: [C:03+1] upload vcl: add a new scraper UA [puppet] - 10https://gerrit.wikimedia.org/r/1053363 (owner: 10CDanis) [17:53:07] (03CR) 10CDanis: [C:03+2] upload vcl: add a new scraper UA [puppet] - 10https://gerrit.wikimedia.org/r/1053363 (owner: 10CDanis) [17:59:54] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9970756 (10jsn.sherman) @Dzahn I'm happy to sponsor @Soda on this one; apologies for the delay! [18:00:55] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:01:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:01:59] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 2.752 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:02:45] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52339 bytes in 0.204 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:03:52] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#9970773 (10Dzahn) >>! In T351202#995... [18:04:49] (03PS9) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [18:05:31] (03CR) 10CI reject: [V:04-1] Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [18:05:41] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9970788 (10Dzahn) 05Stalled→03In progress [18:09:13] (03PS1) 10Dzahn: admin: add Sohom Datta to ldap_only_admins (nda) [puppet] - 10https://gerrit.wikimedia.org/r/1053366 (https://phabricator.wikimedia.org/T366032) [18:10:09] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9970797 (10Dzahn) a:05Soda→03None [18:10:57] (03CR) 10Dzahn: [C:03+1] "uidNumber: 22439 | wikimediaGlobalAccountId: 60233279" [puppet] - 10https://gerrit.wikimedia.org/r/1053366 (https://phabricator.wikimedia.org/T366032) (owner: 10Dzahn) [18:10:59] (03PS2) 10Jdlrobson: [July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053036 (https://phabricator.wikimedia.org/T368795) [18:13:13] (03PS10) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [18:17:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T367781)', diff saved to https://phabricator.wikimedia.org/P66166 and previous config saved to /var/cache/conftool/dbconfig/20240710-181700-arnaudb.json [18:17:04] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [18:20:05] (03PS11) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [18:21:34] (03CR) 10Clare Ming: [C:04-1] "Pending discussion with ServiceOps" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [18:28:44] (03PS1) 10AOkoth: vrts: fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/1053367 (https://phabricator.wikimedia.org/T366078) [18:30:52] (03PS2) 10AOkoth: vrts: fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/1053367 (https://phabricator.wikimedia.org/T366078) [18:32:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P66168 and previous config saved to /var/cache/conftool/dbconfig/20240710-183207-arnaudb.json [18:35:21] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [18:36:21] (03CR) 10AOkoth: "Really a cosmetic change because running with the --help flag shows the correct order of arguments but will just update this for consisten" [cookbooks] - 10https://gerrit.wikimedia.org/r/1053367 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:36:26] (03CR) 10AOkoth: [C:03+2] vrts: fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/1053367 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:40:02] (03Merged) 10jenkins-bot: vrts: fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/1053367 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [18:43:33] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add 4 new IPs (2 eqiad, 2 codfw) for wdqs graph split - ryankemper@cumin2002" [18:43:43] (03PS12) 10Clare Ming: Deploy MetricsPlatform to beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) [18:45:01] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add 4 new IPs (2 eqiad, 2 codfw) for wdqs graph split - ryankemper@cumin2002" [18:45:02] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:47:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P66169 and previous config saved to /var/cache/conftool/dbconfig/20240710-184714-arnaudb.json [18:48:38] (03Abandoned) 10Ryan Kemper: wdqs graph split: new PTR records [dns] - 10https://gerrit.wikimedia.org/r/1050454 (https://phabricator.wikimedia.org/T364364) (owner: 10Ryan Kemper) [18:49:17] FIRING: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:50:06] (03CR) 10Clare Ming: [C:04-1] "waiting for +1 from SRE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [18:52:50] (03PS1) 10Dbrant: Enable account vanishing in CentralAuth. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053373 (https://phabricator.wikimedia.org/T369141) [18:55:31] (03PS5) 10Ryan Kemper: wdqs graph split: new A, PTR, and DYNA records [dns] - 10https://gerrit.wikimedia.org/r/1051446 (https://phabricator.wikimedia.org/T364364) [18:56:43] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [18:56:53] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [19:02:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T367781)', diff saved to https://phabricator.wikimedia.org/P66170 and previous config saved to /var/cache/conftool/dbconfig/20240710-190222-arnaudb.json [19:02:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1193.eqiad.wmnet with reason: Maintenance [19:02:37] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:02:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1193.eqiad.wmnet with reason: Maintenance [19:02:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1193 (T367781)', diff saved to https://phabricator.wikimedia.org/P66171 and previous config saved to /var/cache/conftool/dbconfig/20240710-190244-arnaudb.json [19:03:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T367781)', diff saved to https://phabricator.wikimedia.org/P66172 and previous config saved to /var/cache/conftool/dbconfig/20240710-190352-arnaudb.json [19:07:36] (03PS12) 10Ryan Kemper: wdqs: add main and scholarly role assignments [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [19:08:34] (03CR) 10Dwisehaupt: [C:03+1] "daniel or filippo: Could one of you +2 and deploy this when you get a chance? It codifies a civicrm config change we have tested. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1051851 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [19:19:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P66173 and previous config saved to /var/cache/conftool/dbconfig/20240710-191859-arnaudb.json [19:21:04] (03PS1) 10CDanis: more restrictive UA policy rules for cp-upload [puppet] - 10https://gerrit.wikimedia.org/r/1053377 [19:21:58] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046123 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [19:25:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on relforge1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:26:26] (03CR) 10Ssingh: [C:03+1] more restrictive UA policy rules for cp-upload [puppet] - 10https://gerrit.wikimedia.org/r/1053377 (owner: 10CDanis) [19:28:23] (03CR) 10CDanis: [C:03+2] more restrictive UA policy rules for cp-upload [puppet] - 10https://gerrit.wikimedia.org/r/1053377 (owner: 10CDanis) [19:34:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P66174 and previous config saved to /var/cache/conftool/dbconfig/20240710-193406-arnaudb.json [19:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:45:51] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for Uniquemia - https://phabricator.wikimedia.org/T369500#9971182 (10EUwandu-WMF) Hello @fgiunchedi , I was just recently converted and I have changed euwandu-ctr@wikimedia.org to my new email euwandu@wikimedia.org. Please let me know if... [19:49:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T367781)', diff saved to https://phabricator.wikimedia.org/P66176 and previous config saved to /var/cache/conftool/dbconfig/20240710-194913-arnaudb.json [19:49:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1203.eqiad.wmnet with reason: Maintenance [19:49:17] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [19:49:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1203.eqiad.wmnet with reason: Maintenance [19:49:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T367781)', diff saved to https://phabricator.wikimedia.org/P66177 and previous config saved to /var/cache/conftool/dbconfig/20240710-194935-arnaudb.json [19:50:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T367781)', diff saved to https://phabricator.wikimedia.org/P66178 and previous config saved to /var/cache/conftool/dbconfig/20240710-195043-arnaudb.json [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240710T2000). [20:00:04] jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] o/ [20:03:07] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:03:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:05:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P66179 and previous config saved to /var/cache/conftool/dbconfig/20240710-200550-arnaudb.json [20:07:51] Jdlrobson: I can deploy the patch :) [20:08:03] awesome! They probably should go out together! [20:11:13] Jdlrobson: the Vector patch https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1053358 has a Depends-On patch in VE, has that one been deployed already? [20:11:46] (03PS2) 10Jdlrobson: Add beta tag & feedback link to Appearance menu [skins/Vector] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1053358 (https://phabricator.wikimedia.org/T367871) [20:11:52] Removed that was only for CI [20:12:08] hopefully the CI issue doesn't occur here... [20:12:36] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9971253 (10Eevans) >>! In T362033#9947564, @wiki_willy wrote: > Hi @Eevans - since we've replaced all hardware parts on this host, and the error is still showing up, it doesn't seem like it's a h... [20:12:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053036 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson) [20:12:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1053358 (https://phabricator.wikimedia.org/T367871) (owner: 10Jdlrobson) [20:12:44] ok we'll see :) [20:13:57] (03PS3) 10Jdlrobson: [July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053036 (https://phabricator.wikimedia.org/T368795) [20:14:09] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053036 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson) [20:14:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1053358 (https://phabricator.wikimedia.org/T367871) (owner: 10Jdlrobson) [20:14:49] (03Merged) 10jenkins-bot: [July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053036 (https://phabricator.wikimedia.org/T368795) (owner: 10Jdlrobson) [20:17:51] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:20:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P66180 and previous config saved to /var/cache/conftool/dbconfig/20240710-202057-arnaudb.json [20:36:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T367781)', diff saved to https://phabricator.wikimedia.org/P66181 and previous config saved to /var/cache/conftool/dbconfig/20240710-203605-arnaudb.json [20:36:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1211.eqiad.wmnet with reason: Maintenance [20:36:09] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [20:36:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1211.eqiad.wmnet with reason: Maintenance [20:36:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T367781)', diff saved to https://phabricator.wikimedia.org/P66182 and previous config saved to /var/cache/conftool/dbconfig/20240710-203627-arnaudb.json [20:36:40] (03Merged) 10jenkins-bot: Add beta tag & feedback link to Appearance menu [skins/Vector] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1053358 (https://phabricator.wikimedia.org/T367871) (owner: 10Jdlrobson) [20:37:14] !log jdrewniak@deploy1002 Started scap sync-world: Backport for [[gerrit:1053036|[July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) (T368795)]], [[gerrit:1053358|Add beta tag & feedback link to Appearance menu (T367871)]] [20:37:18] T368795: Deploy dark mode to all logged in users on Vector 2022 - https://phabricator.wikimedia.org/T368795 [20:37:19] T367871: Create link for reporting system for dark mode issues - https://phabricator.wikimedia.org/T367871 [20:37:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T367781)', diff saved to https://phabricator.wikimedia.org/P66183 and previous config saved to /var/cache/conftool/dbconfig/20240710-203735-arnaudb.json [20:44:51] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 50.83 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:48:14] jan_drewniak: is this on debug? [20:49:23] I thought it was still going, but I'm looking at my console and the last update was 10 min ago :/ [20:49:32] oh! it's picking back up. [20:52:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P66184 and previous config saved to /var/cache/conftool/dbconfig/20240710-205242-arnaudb.json [21:00:04] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240710T2100) [21:03:53] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1096-1097,1106].eqiad.wmnet with reason: T348977 [21:03:56] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [21:04:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1096-1097,1106].eqiad.wmnet with reason: T348977 [21:04:25] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:06:21] !log jdrewniak@deploy1002 jdrewniak, jdlrobson: Backport for [[gerrit:1053036|[July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) (T368795)]], [[gerrit:1053358|Add beta tag & feedback link to Appearance menu (T367871)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:06:21] !log jdrewniak@deploy1002 Sync cancelled. [21:06:25] T368795: Deploy dark mode to all logged in users on Vector 2022 - https://phabricator.wikimedia.org/T368795 [21:06:25] T367871: Create link for reporting system for dark mode issues - https://phabricator.wikimedia.org/T367871 [21:07:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P66185 and previous config saved to /var/cache/conftool/dbconfig/20240710-210750-arnaudb.json [21:09:33] !log jdrewniak@deploy1002 Started scap sync-world: Backport for [[gerrit:1053036|[July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) (T368795)]], [[gerrit:1053358|Add beta tag & feedback link to Appearance menu (T367871)]] [21:09:41] * jan_drewniak Jdlrobson: I have to restart the deploy... [21:09:58] accidentally click N instead of y :/ [21:10:04] what does that mean? [21:10:12] just redo the sync, right? [21:10:26] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1096*,elastic1097*,elastic1106* for T348977 - bking@cumin2002 [21:10:29] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1096*,elastic1097*,elastic1106* for T348977 - bking@cumin2002 [21:10:29] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [21:10:47] jan_drewniak: https://i.redd.it/s2vxl4d6l5381.png [21:11:01] it's just the sync we need to redo, right? [21:11:27] yes. scap now asks if you want to continue with the deploy: y or N, and when I looked at my terminal, it looks like N was pressed [21:13:51] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:17:48] !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Backport for [[gerrit:1053036|[July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) (T368795)]], [[gerrit:1053358|Add beta tag & feedback link to Appearance menu (T367871)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:17:49] !log jdrewniak@deploy1002 Sync cancelled. [21:17:54] T368795: Deploy dark mode to all logged in users on Vector 2022 - https://phabricator.wikimedia.org/T368795 [21:17:54] T367871: Create link for reporting system for dark mode issues - https://phabricator.wikimedia.org/T367871 [21:18:27] !log jdrewniak@deploy1002 Started scap sync-world: Backport for [[gerrit:1053036|[July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) (T368795)]], [[gerrit:1053358|Add beta tag & feedback link to Appearance menu (T367871)]] [21:22:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T367781)', diff saved to https://phabricator.wikimedia.org/P66186 and previous config saved to /var/cache/conftool/dbconfig/20240710-212257-arnaudb.json [21:22:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1214.eqiad.wmnet with reason: Maintenance [21:23:01] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [21:23:01] !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Backport for [[gerrit:1053036|[July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) (T368795)]], [[gerrit:1053358|Add beta tag & feedback link to Appearance menu (T367871)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:23:06] T368795: Deploy dark mode to all logged in users on Vector 2022 - https://phabricator.wikimedia.org/T368795 [21:23:07] T367871: Create link for reporting system for dark mode issues - https://phabricator.wikimedia.org/T367871 [21:23:08] !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Continuing with sync [21:23:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1214.eqiad.wmnet with reason: Maintenance [21:23:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T367781)', diff saved to https://phabricator.wikimedia.org/P66187 and previous config saved to /var/cache/conftool/dbconfig/20240710-212319-arnaudb.json [21:24:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T367781)', diff saved to https://phabricator.wikimedia.org/P66188 and previous config saved to /var/cache/conftool/dbconfig/20240710-212427-arnaudb.json [21:30:03] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1053036|[July 10th] Vector: enable dark mode for tier 1 wikis (logged in only) (T368795)]], [[gerrit:1053358|Add beta tag & feedback link to Appearance menu (T367871)]] (duration: 11m 35s) [21:30:19] T368795: Deploy dark mode to all logged in users on Vector 2022 - https://phabricator.wikimedia.org/T368795 [21:30:19] T367871: Create link for reporting system for dark mode issues - https://phabricator.wikimedia.org/T367871 [21:31:01] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9971465 (10Scott_French) Cool, it sounds like the conversation has evolved to using a dedicated schema, and we're on the same page that a multi-value `set` sho... [21:31:19] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 66498880 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:32:21] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 40912 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:35:19] (03PS15) 10Bking: relforge: test envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) [21:35:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1053041 (https://phabricator.wikimedia.org/T368950) (owner: 10Bking) [21:39:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P66190 and previous config saved to /var/cache/conftool/dbconfig/20240710-213935-arnaudb.json [21:48:08] (03PS1) 10Dwisehaupt: crm: Add lookup for civicrm branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) [21:48:36] (03CR) 10CI reject: [V:04-1] crm: Add lookup for civicrm branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [21:54:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P66191 and previous config saved to /var/cache/conftool/dbconfig/20240710-215444-arnaudb.json [22:03:07] (03CR) 10Scott French: "Thanks for considering the dedicated-schema option!" [puppet] - 10https://gerrit.wikimedia.org/r/1053323 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [22:06:35] 06SRE, 06Infrastructure-Foundations, 06serviceops, 07ARM support: Adoption of aarch64 (aka arm64) in WMF production? (SRE Summit 2022 Session) - https://phabricator.wikimedia.org/T320811#9971586 (10bd808) >>! In T320811#8777225, @Ladsgroup wrote: > This might be interesting, specially in choosing a manufac... [22:09:03] (03CR) 10Cwhite: [C:03+1] "Parameterization LGTM. PCC NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/1042917 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [22:09:19] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [22:09:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T367781)', diff saved to https://phabricator.wikimedia.org/P66192 and previous config saved to /var/cache/conftool/dbconfig/20240710-220951-arnaudb.json [22:09:54] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [22:09:55] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:10:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1216.eqiad.wmnet with reason: Maintenance [22:10:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1226.eqiad.wmnet with reason: Maintenance [22:10:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1226.eqiad.wmnet with reason: Maintenance [22:10:13] !log gitlab-replica-b.wikimedia.org - version upgrade in progress [22:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T367781)', diff saved to https://phabricator.wikimedia.org/P66193 and previous config saved to /var/cache/conftool/dbconfig/20240710-221018-arnaudb.json [22:11:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T367781)', diff saved to https://phabricator.wikimedia.org/P66194 and previous config saved to /var/cache/conftool/dbconfig/20240710-221126-arnaudb.json [22:13:25] (03CR) 10Cwhite: [C:03+1] "Looks good! PCC looks ok. Logstash will restart once this is applied." [puppet] - 10https://gerrit.wikimedia.org/r/1042918 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [22:14:06] dzahn@cumin1002 dzahn: The backup on gitlab1003 is complete, ready to proceed with upgrade. [22:15:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T367856)', diff saved to https://phabricator.wikimedia.org/P66195 and previous config saved to /var/cache/conftool/dbconfig/20240710-221545-marostegui.json [22:15:52] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:17:51] (03PS1) 10Dzahn: puppetmaster/puppetserver: remove MaxMind db product GeoIP2-Enterprise [puppet] - 10https://gerrit.wikimedia.org/r/1053390 (https://phabricator.wikimedia.org/T366272) [22:19:44] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release [22:19:45] (03PS2) 10Dzahn: puppetmaster/puppetserver: remove MaxMind db product GeoIP2-Enterprise [puppet] - 10https://gerrit.wikimedia.org/r/1053390 (https://phabricator.wikimedia.org/T366272) [22:21:37] (03CR) 10Cwhite: [C:03+2] logstash: add curator delete job for ecs-k8s indices [puppet] - 10https://gerrit.wikimedia.org/r/1051427 (https://phabricator.wikimedia.org/T368186) (owner: 10Cwhite) [22:23:03] (03PS2) 10Dwisehaupt: crm: Add lookup for civicrm branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) [22:24:28] (03PS3) 10Dwisehaupt: crm: Add lookup for civicrm branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) [22:25:24] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release [22:26:05] woah, logmsgbot talks to me here about the next step in cookbooks:) [22:26:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P66196 and previous config saved to /var/cache/conftool/dbconfig/20240710-222633-arnaudb.json [22:27:02] (03CR) 10CI reject: [V:04-1] crm: Add lookup for civicrm branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:28:43] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9971668 (10Dzahn) @jsn.sherman Alright, thank you! I uploaded a patch to get this resolved soon. [22:29:46] dzahn@cumin1002 dzahn: The backup on gitlab1004 is complete, ready to proceed with upgrade. [22:30:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P66197 and previous config saved to /var/cache/conftool/dbconfig/20240710-223052-marostegui.json [22:33:05] (03CR) 10Dzahn: "in profile::community_crm where community_crm class is used it also needs to pass the new parameter now" [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:33:15] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3200/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053390 (https://phabricator.wikimedia.org/T366272) (owner: 10Dzahn) [22:33:42] (03CR) 10BCornwall: [C:03+1] puppetmaster/puppetserver: remove MaxMind db product GeoIP2-Enterprise [puppet] - 10https://gerrit.wikimedia.org/r/1053390 (https://phabricator.wikimedia.org/T366272) (owner: 10Dzahn) [22:34:20] (03CR) 10Dzahn: "wait, ignore my comment.. you already did that. but it must be used somewhere else I guess" [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:35:21] !log dzahn@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release [22:39:45] (03CR) 10Dzahn: [C:03+2] crm: Stop civicrm callouts to the internet for version checks [puppet] - 10https://gerrit.wikimedia.org/r/1051851 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:41:36] 06SRE, 10LDAP-Access-Requests: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9971676 (10Volans) 05Open→03Resolved Indeed. [22:41:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P66198 and previous config saved to /var/cache/conftool/dbconfig/20240710-224140-arnaudb.json [22:43:12] (03CR) 10Dzahn: [C:03+2] "https://phabricator.wikimedia.org/T343486#9971679" [puppet] - 10https://gerrit.wikimedia.org/r/1051851 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:46:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P66199 and previous config saved to /var/cache/conftool/dbconfig/20240710-224559-marostegui.json [22:46:02] (03CR) 10Dwisehaupt: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3202/co" [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [22:48:05] (03PS2) 10Zabe: Initial configuration for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026788 (https://phabricator.wikimedia.org/T362529) [22:48:27] (03PS3) 10Zabe: Initial configuration for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026788 (https://phabricator.wikimedia.org/T362529) [22:49:17] FIRING: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:20] (03PS4) 10Zabe: Initial configuration for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026788 (https://phabricator.wikimedia.org/T362529) [22:50:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T367856)', diff saved to https://phabricator.wikimedia.org/P66200 and previous config saved to /var/cache/conftool/dbconfig/20240710-225015-marostegui.json [22:50:19] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:52:40] (03CR) 10Dzahn: [V:03+1 C:03+2] "tested on puppetmaster1001 - removed the Enterprise product manually - started the systemd service manually.. fixes it" [puppet] - 10https://gerrit.wikimedia.org/r/1053390 (https://phabricator.wikimedia.org/T366272) (owner: 10Dzahn) [22:52:44] jouncebot: nowandnext [22:52:44] No deployments scheduled for the next 7 hour(s) and 7 minute(s) [22:52:44] In 7 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T0600) [22:52:44] In 7 hour(s) and 7 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240711T0600) [22:53:31] !log puppetmaster1001 - remove Enterprise product ID from MaxMind downloads. sudo systemctl start geoip_update_ipinfo - T366272 [22:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:34] T366272: Update puppet configuration to use GeoLite2 (free) instead of GeoIP2-Enterprise data - https://phabricator.wikimedia.org/T366272 [22:54:06] (03CR) 10Zabe: [C:03+2] Initial configuration for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026788 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [22:54:17] FIRING: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:25] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:43] (03Merged) 10jenkins-bot: Initial configuration for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1026788 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [22:56:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T367781)', diff saved to https://phabricator.wikimedia.org/P66201 and previous config saved to /var/cache/conftool/dbconfig/20240710-225647-arnaudb.json [22:56:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:56:51] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [22:57:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:57:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2152.codfw.wmnet with reason: Maintenance [22:57:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2152.codfw.wmnet with reason: Maintenance [22:57:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T367781)', diff saved to https://phabricator.wikimedia.org/P66202 and previous config saved to /var/cache/conftool/dbconfig/20240710-225725-arnaudb.json [22:58:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T367781)', diff saved to https://phabricator.wikimedia.org/P66203 and previous config saved to /var/cache/conftool/dbconfig/20240710-225835-arnaudb.json [22:58:54] zabe: let me know when you're all through? I'll deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1053274 next [22:58:58] (03PS1) 10Zabe: Set wgLanguageCode for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053394 (https://phabricator.wikimedia.org/T362529) [22:59:00] (03CR) 10Zabe: [C:03+2] Set wgLanguageCode for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053394 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [22:59:22] will do [22:59:25] FIRING: [2x] SystemdUnitFailed: envoyproxy.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:39] (03Merged) 10jenkins-bot: Set wgLanguageCode for aewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053394 (https://phabricator.wikimedia.org/T362529) (owner: 10Zabe) [23:00:03] !log puppetserver1001 - fixing failed unit geoip_update_ipinfo.service [23:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:23] !log Create Wikimedians of United Arab Emirates User Group Wiki # T362529 [23:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:39] !log zabe@deploy1002 Started scap sync-world: T362529 [23:00:44] (03PS4) 10Dwisehaupt: crm: Add lookup for civicrm branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) [23:00:51] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [23:01:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T367856)', diff saved to https://phabricator.wikimedia.org/P66204 and previous config saved to /var/cache/conftool/dbconfig/20240710-230107-marostegui.json [23:01:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2209.codfw.wmnet with reason: Maintenance [23:01:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2209.codfw.wmnet with reason: Maintenance [23:01:29] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:01:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T367856)', diff saved to https://phabricator.wikimedia.org/P66205 and previous config saved to /var/cache/conftool/dbconfig/20240710-230130-marostegui.json [23:04:17] RESOLVED: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:05:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P66206 and previous config saved to /var/cache/conftool/dbconfig/20240710-230522-marostegui.json [23:05:26] (03CR) 10Dzahn: [V:03+1 C:03+2] "deployed and manual "sudo systemctl start geoip_update_ipinfo" on puppetmaster1001 and puppetserver1001. no more failed units now. checked" [puppet] - 10https://gerrit.wikimedia.org/r/1053390 (https://phabricator.wikimedia.org/T366272) (owner: 10Dzahn) [23:08:24] !log zabe@deploy1002 Finished scap: T362529 (duration: 07m 44s) [23:08:27] T362529: Create a Wikimedians of United Arab Emirates User Group Wiki - https://phabricator.wikimedia.org/T362529 [23:08:29] (03CR) 10Dzahn: crm: Add lookup for civicrm branch to check out (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:09:14] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053395 [23:09:14] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053395 (owner: 10Zabe) [23:09:27] !log zabe@deploy1002 Started scap sync-world: update interwiki cache [23:09:53] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1053395 (owner: 10Zabe) [23:10:09] thanks for deploying that, rzl [23:10:17] this is so quick [23:10:32] and then we can close hat whole cleanup task afterwards, yay [23:11:13] (03CR) 10Dzahn: [V:03+1 C:03+1] "https://puppet-compiler.wmflabs.org/output/1053384/3203/crm2001.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:13:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P66207 and previous config saved to /var/cache/conftool/dbconfig/20240710-231342-arnaudb.json [23:14:24] (03PS5) 10Dwisehaupt: crm: Add lookup for civicrm branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) [23:15:53] (03CR) 10Dwisehaupt: crm: Add lookup for civicrm branch to check out (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:16:49] (03CR) 10Dzahn: [C:03+2] crm: Add lookup for civicrm branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:16:51] (03CR) 10Dzahn: [V:03+2 C:03+2] crm: Add lookup for civicrm branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:16:59] !log zabe@deploy1002 Finished scap: update interwiki cache (duration: 07m 32s) [23:17:00] (03CR) 10Dzahn: [V:03+1 C:03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:17:00] rzl: feel free to do your deployment now :) [23:17:06] zabe: thanks! [23:17:35] (03CR) 10RLazarus: [C:03+2] "Thanks for including the httpbb update <3" [puppet] - 10https://gerrit.wikimedia.org/r/1053274 (https://phabricator.wikimedia.org/T367012) (owner: 10Clément Goubert) [23:20:06] !log $ sudo cumin A:mw disable-puppet # T367012 - really just for the old mwdebug hosts [23:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:11] T367012: redirect benefactors.wikimedia.org (was: Cleanup unused DNS subdomains) - https://phabricator.wikimedia.org/T367012 [23:20:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P66208 and previous config saved to /var/cache/conftool/dbconfig/20240710-232028-marostegui.json [23:23:02] (03CR) 10Dzahn: [V:03+1 C:03+2] "ran puppet on crm2001 - it's a noop - it doesn't change anything about the status of the git repo since it's already checked out. (unless " [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:23:27] (03CR) 10Dzahn: [V:03+1 C:03+2] "you can test on cloud now, though" [puppet] - 10https://gerrit.wikimedia.org/r/1053384 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [23:26:04] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on relforge1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [23:27:36] !log rzl@deploy1002 Started scap sync-world: T367012 [23:27:46] T367012: redirect benefactors.wikimedia.org (was: Cleanup unused DNS subdomains) - https://phabricator.wikimedia.org/T367012 [23:28:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P66209 and previous config saved to /var/cache/conftool/dbconfig/20240710-232849-arnaudb.json [23:29:03] !log rzl@deploy1002 rzl: T367012 synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:29:47] (03PS1) 10Cwhite: opensearch: Enable configuration of watermark parameters [puppet] - 10https://gerrit.wikimedia.org/r/1048538 (https://phabricator.wikimedia.org/T368168) [23:29:47] (03CR) 10Cwhite: [C:03+2] "PCC OK https://puppet-compiler.wmflabs.org/output/1048538/3199/" [puppet] - 10https://gerrit.wikimedia.org/r/1048538 (https://phabricator.wikimedia.org/T368168) (owner: 10Cwhite) [23:30:54] !log rzl@deploy1002 rzl: Continuing with sync [23:34:46] !log rzl@deploy1002 Finished scap: T367012 (duration: 07m 45s) [23:34:50] T367012: redirect benefactors.wikimedia.org (was: Cleanup unused DNS subdomains) - https://phabricator.wikimedia.org/T367012 [23:35:20] !log $ sudo cumin A:all-mw enable-puppet T367012 [23:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T367856)', diff saved to https://phabricator.wikimedia.org/P66210 and previous config saved to /var/cache/conftool/dbconfig/20240710-233535-marostegui.json [23:35:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [23:35:39] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:35:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [23:35:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T367856)', diff saved to https://phabricator.wikimedia.org/P66211 and previous config saved to /var/cache/conftool/dbconfig/20240710-233558-marostegui.json [23:37:12] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:35] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053397 [23:38:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1053397 (owner: 10TrainBranchBot) [23:39:23] finished 👍 [23:41:06] rzl: works for me:) thank you! [23:42:21] 06SRE, 10DNS, 10fundraising-tech-ops, 06serviceops, 06Traffic: redirect benefactors.wikimedia.org (was: Cleanup unused DNS subdomains) - https://phabricator.wikimedia.org/T367012#9971801 (10Dzahn) [23:43:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T367781)', diff saved to https://phabricator.wikimedia.org/P66212 and previous config saved to /var/cache/conftool/dbconfig/20240710-234356-arnaudb.json [23:43:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2154.codfw.wmnet with reason: Maintenance [23:44:00] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [23:44:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2154.codfw.wmnet with reason: Maintenance [23:44:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T367781)', diff saved to https://phabricator.wikimedia.org/P66213 and previous config saved to /var/cache/conftool/dbconfig/20240710-234418-arnaudb.json [23:46:03] 06SRE, 10DNS, 10fundraising-tech-ops, 06serviceops, 06Traffic: redirect benefactors.wikimedia.org (was: Cleanup unused DNS subdomains) - https://phabricator.wikimedia.org/T367012#9971806 (10Dzahn) Thanks to Clement and Reuven for the redirect change and deploying it. benefactors redirects now. @Pppery... [23:46:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T367781)', diff saved to https://phabricator.wikimedia.org/P66214 and previous config saved to /var/cache/conftool/dbconfig/20240710-234629-arnaudb.json [23:46:44] (03CR) 10Ssingh: "Thanks for taking care of it @dzahn@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1053390 (https://phabricator.wikimedia.org/T366272) (owner: 10Dzahn) [23:49:12] 06SRE, 10DNS, 10fundraising-tech-ops, 06serviceops, 06Traffic: Cleanup DNS subdomains displaying wikimedia.org homepage when they shouldn't - https://phabricator.wikimedia.org/T367012#9971809 (10Pppery) 05Open→03Resolved