[00:03:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T352010)', diff saved to https://phabricator.wikimedia.org/P65032 and previous config saved to /var/cache/conftool/dbconfig/20240616-000343-ladsgroup.json [00:03:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1045124 (owner: 10TrainBranchBot) [00:03:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [00:03:49] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:04:00] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [00:04:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [00:04:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [00:04:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T352010)', diff saved to https://phabricator.wikimedia.org/P65033 and previous config saved to /var/cache/conftool/dbconfig/20240616-000421-ladsgroup.json [00:25:51] (03PS1) 10Urbanecm: admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/1045211 [01:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:35:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T364069)', diff saved to https://phabricator.wikimedia.org/P65034 and previous config saved to /var/cache/conftool/dbconfig/20240616-023541-marostegui.json [02:35:46] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [02:38:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P65035 and previous config saved to /var/cache/conftool/dbconfig/20240616-025048-marostegui.json [02:58:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P65036 and previous config saved to /var/cache/conftool/dbconfig/20240616-030555-marostegui.json [03:21:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T364069)', diff saved to https://phabricator.wikimedia.org/P65037 and previous config saved to /var/cache/conftool/dbconfig/20240616-032102-marostegui.json [03:21:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [03:21:08] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:21:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [05:08:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T352010)', diff saved to https://phabricator.wikimedia.org/P65038 and previous config saved to /var/cache/conftool/dbconfig/20240616-050838-ladsgroup.json [05:08:44] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:23:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P65039 and previous config saved to /var/cache/conftool/dbconfig/20240616-052345-ladsgroup.json [05:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:38:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P65040 and previous config saved to /var/cache/conftool/dbconfig/20240616-053852-ladsgroup.json [05:54:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T352010)', diff saved to https://phabricator.wikimedia.org/P65041 and previous config saved to /var/cache/conftool/dbconfig/20240616-055359-ladsgroup.json [05:54:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [05:54:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [05:54:05] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:54:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T352010)', diff saved to https://phabricator.wikimedia.org/P65042 and previous config saved to /var/cache/conftool/dbconfig/20240616-055411-ladsgroup.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240616T0700) [08:10:14] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:15:44] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:02:38] RECOVERY - Host rdb1014 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [12:29:18] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:32:33] FIRING: KubernetesCalicoDown: mw2321.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2321.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:11:30] (03CR) 10Gergő Tisza: password: Document wmgPasswordSecretKey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034905 (https://phabricator.wikimedia.org/T150647) (owner: 10Krinkle) [13:16:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T352010)', diff saved to https://phabricator.wikimedia.org/P65043 and previous config saved to /var/cache/conftool/dbconfig/20240616-131630-ladsgroup.json [13:16:36] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:31:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P65044 and previous config saved to /var/cache/conftool/dbconfig/20240616-133137-ladsgroup.json [13:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:46:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P65045 and previous config saved to /var/cache/conftool/dbconfig/20240616-134645-ladsgroup.json [14:01:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T352010)', diff saved to https://phabricator.wikimedia.org/P65046 and previous config saved to /var/cache/conftool/dbconfig/20240616-140152-ladsgroup.json [14:01:54] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [14:01:58] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:02:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [14:02:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T352010)', diff saved to https://phabricator.wikimedia.org/P65047 and previous config saved to /var/cache/conftool/dbconfig/20240616-140214-ladsgroup.json [14:12:54] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:14] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:18:16] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:23:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 7.609 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:23:52] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:24:08] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:38:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:55:05] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1045990 [16:32:48] FIRING: KubernetesCalicoDown: mw2321.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2321.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:57:34] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9896324 (10Papaul) @Clement_Goubert hey we have a site visit at eqdfw on this Monday at 11:00am CT to check a power issue on our router that is there so we ca... [19:50:17] (03PS15) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [19:50:28] (03CR) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [20:32:48] FIRING: KubernetesCalicoDown: mw2321.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2321.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:25:28] (03PS1) 10Jon Harald Søby: Enable subpages for the main namespace in sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046116 (https://phabricator.wikimedia.org/T367674) [21:26:33] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 17 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046116 (https://phabricator.wikimedia.org/T367674) (owner: 10Jon Harald Søby) [21:34:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T352010)', diff saved to https://phabricator.wikimedia.org/P65049 and previous config saved to /var/cache/conftool/dbconfig/20240616-213400-ladsgroup.json [21:34:06] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:49:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P65050 and previous config saved to /var/cache/conftool/dbconfig/20240616-214907-ladsgroup.json [22:04:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P65051 and previous config saved to /var/cache/conftool/dbconfig/20240616-220414-ladsgroup.json [22:19:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T352010)', diff saved to https://phabricator.wikimedia.org/P65052 and previous config saved to /var/cache/conftool/dbconfig/20240616-221921-ladsgroup.json [22:19:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [22:19:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:19:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [22:19:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T352010)', diff saved to https://phabricator.wikimedia.org/P65053 and previous config saved to /var/cache/conftool/dbconfig/20240616-221944-ladsgroup.json [22:47:50] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:38:09] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1046118 [23:38:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1046118 (owner: 10TrainBranchBot) [23:54:50] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 509.13 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica