[00:00:54] !log restarting haproxy on cp3068 and cp3072 [00:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:28] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048122 (owner: 10TrainBranchBot) [00:03:17] PROBLEM - SSH on puppetserver1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:03:43] (03PS1) 10Arlolra: Remove unused Linter configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292) [00:04:25] RECOVERY - Webrequests Varnishkafka log producer on cp3068 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:04:25] RECOVERY - Webrequests Varnishkafka log producer on cp3072 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:07:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T352010)', diff saved to https://phabricator.wikimedia.org/P65275 and previous config saved to /var/cache/conftool/dbconfig/20240621-000716-ladsgroup.json [00:07:21] RECOVERY - eventlogging Varnishkafka log producer on cp3071 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:07:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:07:27] RECOVERY - Webrequests Varnishkafka log producer on cp3071 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:07:27] RECOVERY - statsv Varnishkafka log producer on cp3071 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:08:25] RECOVERY - eventlogging Varnishkafka log producer on cp3070 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:08:25] RECOVERY - statsv Varnishkafka log producer on cp3070 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:08:25] RECOVERY - Webrequests Varnishkafka log producer on cp3070 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:08:26] !log [cp3067:~] $ sudo systemctl start logrotate [00:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:35] !log [cp3072:~] $ sudo systemctl start varnishkafka-webrequest.service [00:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:09] RECOVERY - SSH on puppetserver1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:09:21] RECOVERY - Webrequests Varnishkafka log producer on cp3067 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/webrequest.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:10:21] RECOVERY - eventlogging Varnishkafka log producer on cp3068 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:10:21] RECOVERY - eventlogging Varnishkafka log producer on cp3072 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:10:27] RECOVERY - statsv Varnishkafka log producer on cp3072 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:10:27] RECOVERY - statsv Varnishkafka log producer on cp3068 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:13:37] RECOVERY - eventlogging Varnishkafka log producer on cp3067 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:13:37] RECOVERY - statsv Varnishkafka log producer on cp3067 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka [00:15:49] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.019e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [00:22:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P65276 and previous config saved to /var/cache/conftool/dbconfig/20240621-002223-ladsgroup.json [00:37:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P65277 and previous config saved to /var/cache/conftool/dbconfig/20240621-003730-ladsgroup.json [00:37:46] (03PS1) 10Arlolra: Follow the defaults for Parsoid on MFE on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720) [00:52:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T352010)', diff saved to https://phabricator.wikimedia.org/P65278 and previous config saved to /var/cache/conftool/dbconfig/20240621-005237-ladsgroup.json [00:52:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:00:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367856)', diff saved to https://phabricator.wikimedia.org/P65279 and previous config saved to /var/cache/conftool/dbconfig/20240621-010002-marostegui.json [01:00:14] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:15:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P65280 and previous config saved to /var/cache/conftool/dbconfig/20240621-011509-marostegui.json [01:30:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P65281 and previous config saved to /var/cache/conftool/dbconfig/20240621-013016-marostegui.json [01:30:25] FIRING: SystemdUnitFailed: wmf_auto_restart_parsoid-rt.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:40:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_parsoid-rt.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T367856)', diff saved to https://phabricator.wikimedia.org/P65282 and previous config saved to /var/cache/conftool/dbconfig/20240621-014523-marostegui.json [01:45:25] FIRING: SystemdUnitFailed: wmf_auto_restart_parsoid-rt.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [01:45:31] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [01:45:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [01:45:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T367856)', diff saved to https://phabricator.wikimedia.org/P65283 and previous config saved to /var/cache/conftool/dbconfig/20240621-014545-marostegui.json [02:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_parsoid-rt.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:25] FIRING: SystemdUnitFailed: wmf_auto_restart_parsoid-rt.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:47] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:23:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:59:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T367856)', diff saved to https://phabricator.wikimedia.org/P65284 and previous config saved to /var/cache/conftool/dbconfig/20240621-035934-marostegui.json [03:59:40] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:05:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T364069)', diff saved to https://phabricator.wikimedia.org/P65285 and previous config saved to /var/cache/conftool/dbconfig/20240621-040523-marostegui.json [04:05:30] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:14:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P65286 and previous config saved to /var/cache/conftool/dbconfig/20240621-041441-marostegui.json [04:20:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P65287 and previous config saved to /var/cache/conftool/dbconfig/20240621-042030-marostegui.json [04:29:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P65288 and previous config saved to /var/cache/conftool/dbconfig/20240621-042948-marostegui.json [04:35:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161', diff saved to https://phabricator.wikimedia.org/P65289 and previous config saved to /var/cache/conftool/dbconfig/20240621-043537-marostegui.json [04:38:05] (03PS1) 10Marostegui: db1209: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1048202 [04:38:38] (03CR) 10Marostegui: [C:03+2] db1209: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1048202 (owner: 10Marostegui) [04:43:08] 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#9912484 (10Marostegui) >>! In T196366#9911494, @Ladsgroup wrote: > @Marostegui To get the list... [04:44:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T367856)', diff saved to https://phabricator.wikimedia.org/P65290 and previous config saved to /var/cache/conftool/dbconfig/20240621-044455-marostegui.json [04:44:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [04:45:01] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:45:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2197.codfw.wmnet with reason: Maintenance [04:50:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T364069)', diff saved to https://phabricator.wikimedia.org/P65291 and previous config saved to /var/cache/conftool/dbconfig/20240621-045044-marostegui.json [04:50:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [04:50:53] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:51:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [04:51:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T364069)', diff saved to https://phabricator.wikimedia.org/P65292 and previous config saved to /var/cache/conftool/dbconfig/20240621-045107-marostegui.json [05:27:13] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:27:57] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:28:13] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:57] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 143, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:31:51] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 64 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [05:55:00] (03PS2) 10Ayounsi: Netbox 4: getstats.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918359 (https://phabricator.wikimedia.org/T336275) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240621T0600) [06:01:14] (03PS1) 10Ayounsi: Netbox 4: getstats.py [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1048260 (https://phabricator.wikimedia.org/T336275) [06:01:50] (03Abandoned) 10Ayounsi: Netbox 4: getstats.py [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918359 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:25:30] (03CR) 10Ayounsi: [C:03+1] policies/cr-labs: remove obsolete ntp.anycast.wmnet [homer/public] - 10https://gerrit.wikimedia.org/r/1048066 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [06:35:54] (03CR) 10DCausse: "how are we going to handle duplication moving forward?" [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [06:43:25] FIRING: SystemdUnitFailed: wmf_auto_restart_parsoid-rt.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:52:30] (03CR) 10DCausse: "should we change puppet://modules/alertmanager/templates/alertmanager.yml.erb instead and route search-platform alerts to the relevant cha" [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [06:53:31] (03CR) 10Ayounsi: Set eqdfw to use default aggregate policy, and modify eqord policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [06:58:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [06:58:33] (03CR) 10Ayounsi: Set eqdfw to use default aggregate policy, and modify eqord policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240621T0700) [07:03:57] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:03:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 depool for debugging T368098', diff saved to https://phabricator.wikimedia.org/P65293 and previous config saved to /var/cache/conftool/dbconfig/20240621-070358-arnaudb.json [07:04:04] T368098: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098 [07:04:46] (03CR) 10JMeybohm: [C:03+1] kubernetes: split unavailable-replicas alert per team [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [07:06:15] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:08:26] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:08:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: repool to fill up vslow/dump', diff saved to https://phabricator.wikimedia.org/P65294 and previous config saved to /var/cache/conftool/dbconfig/20240621-070847-arnaudb.json [07:10:09] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9912591 (10eoghan) The migration to the new host is done. The last remaining item before we can close this ticket is to decommission the old... [07:14:21] (03PS10) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [07:14:49] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [07:15:50] 06SRE, 06DBA: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9912598 (10ABran-WMF) I depooled the host by reflex, its currently repooling right now [07:23:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: repool to fill up vslow/dump', diff saved to https://phabricator.wikimedia.org/P65295 and previous config saved to /var/cache/conftool/dbconfig/20240621-072353-arnaudb.json [07:27:15] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 24.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:32:37] (03PS11) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [07:33:11] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [07:36:06] (03CR) 10Clément Goubert: [C:03+2] docker_registry_ha: Double nginx worker_rlimit_nofile, worker_connections [puppet] - 10https://gerrit.wikimedia.org/r/1047946 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert) [07:37:27] (03PS12) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [07:38:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: repool to fill up vslow/dump', diff saved to https://phabricator.wikimedia.org/P65296 and previous config saved to /var/cache/conftool/dbconfig/20240621-073858-arnaudb.json [07:39:06] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: add custom nginx config to block manual Trusted Runners edits [puppet] - 10https://gerrit.wikimedia.org/r/1047470 (https://phabricator.wikimedia.org/T366786) (owner: 10Jelto) [07:41:12] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [07:45:26] (03CR) 10Alexandros Kosiaris: [C:03+2] mobileapps: enable access to eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048009 (https://phabricator.wikimedia.org/T368052) (owner: 10Alexandros Kosiaris) [07:46:21] (03Merged) 10jenkins-bot: mobileapps: enable access to eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048009 (https://phabricator.wikimedia.org/T368052) (owner: 10Alexandros Kosiaris) [07:49:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:50:09] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 260 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Swift [07:50:43] FIRING: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:50:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:50:48] FIRING: ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#swift-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:50:52] huh [07:50:53] here [07:50:55] uh oh [07:51:09] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Swift [07:51:20] here [07:51:22] oh wait, swift recovered? [07:51:24] !incidents [07:51:24] 4766 (ACKED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [07:51:25] 4767 (ACKED) VarnishUnavailable global sre (varnish-upload thanos-rule) [07:51:25] 4768 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [07:51:25] 4765 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [07:51:25] 4764 (RESOLVED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [07:51:33] the rest should recover soon then? [07:52:37] thumbor 5xx rate isn't dropping yet [07:53:25] ah there it goes [07:53:47] RESOLVED: [2x] ProbeDown: Service swift-https:443 has failed probes (http_swift-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:54:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: repool to fill up vslow/dump', diff saved to https://phabricator.wikimedia.org/P65297 and previous config saved to /var/cache/conftool/dbconfig/20240621-075404-arnaudb.json [07:54:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:55:34] (03PS13) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [07:55:43] RESOLVED: VarnishUnavailable: varnish-upload has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable [07:55:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [07:58:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_parsoid-rt.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:19] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [07:59:25] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:37] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [08:00:54] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [08:02:52] (03PS1) 10Giuseppe Lavagetto: modules: clean up old unused versions of mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048359 [08:03:18] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048122 (owner: 10TrainBranchBot) [08:03:27] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [08:03:57] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [08:03:58] (03PS14) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [08:04:24] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [08:04:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:04:46] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [08:07:23] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9912675 (10SGupta-WMF) @Scott_French Here is the tag on main branch https://gi... [08:07:37] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [08:09:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:12:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048362 [08:12:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048362 (owner: 10TrainBranchBot) [08:14:17] (03PS15) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [08:14:42] !log restarting logrotate.service on cp[3068,3070-3071].esams.wmnet [08:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:00] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [08:26:33] (03PS6) 10AOkoth: vtrs: upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) [08:28:19] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:30:17] (03PS1) 10Ayounsi: Netbox 4: create new script directory and enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) [08:30:39] (03PS7) 10AOkoth: vtrs: upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) [08:32:44] (03CR) 10CI reject: [V:04-1] Netbox 4: create new script directory and enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:33:16] (03PS16) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [08:33:33] (03PS2) 10Ayounsi: Netbox 4: create new script directory and enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) [08:34:33] (03CR) 10Cathal Mooney: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:35:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:35:57] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:36:34] (03CR) 10Btullis: "That's exactly what I was wondering, too. I'll have a look and see if I can see any practicable way of doing it." [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [08:36:49] (03CR) 10CI reject: [V:04-1] Netbox 4: create new script directory and enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:37:11] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:37:29] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [08:37:37] (03CR) 10jenkins-bot: Netbox 4: create new script directory and enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:37:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048362 (owner: 10TrainBranchBot) [08:38:55] (03PS3) 10Ayounsi: Netbox 4: create new script directory and enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) [08:38:57] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 8.225 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:39:01] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:39:49] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52198 bytes in 2.865 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:39:59] !log aborrero@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirt1053.eqiad.wmnet [08:40:39] (03PS17) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [08:41:01] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [08:41:44] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:41:54] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl2002.codfw.wmnet with OS bullseye [08:42:00] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9912742 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl2002.co... [08:43:19] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 22.08 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:43:29] (03PS4) 10Ayounsi: Netbox 4: create new script directory and enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) [08:43:36] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:46:30] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:47:22] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirt1053.eqiad.wmnet [08:56:36] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1053.eqiad.wmnet with OS bookworm [08:57:50] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: host reimage [08:59:44] (03CR) 10Ayounsi: [C:03+2] "Tested locally, self merging to the dev branch." [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1048260 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:01:19] (03Merged) 10jenkins-bot: Netbox 4: getstats.py [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1048260 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:02:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl2002.codfw.wmnet with reason: host reimage [09:09:25] (03PS18) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [09:09:53] (03PS1) 10Clément Goubert: mediawiki: Reimage scap proxies as videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1048376 (https://phabricator.wikimedia.org/T368058) [09:10:05] (03PS1) 10Clément Goubert: scap_proxies: move all proxies to videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/1048377 (https://phabricator.wikimedia.org/T368058) [09:10:35] (03CR) 10Elukey: "Asked some questions as follow up, these classes are still a bit new for me sorry :)" [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:13:13] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [09:14:00] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [09:16:22] (03CR) 10Ayounsi: "Thanks ! it's good to have fresh eyes on it !" [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:16:26] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage [09:17:25] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [09:20:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T364069)', diff saved to https://phabricator.wikimedia.org/P65298 and previous config saved to /var/cache/conftool/dbconfig/20240621-092009-marostegui.json [09:20:15] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:21:55] (03CR) 10Kamila Součková: admin: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [09:23:20] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 340.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:23:23] (03PS1) 10Kamila Součková: Revert^4 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1048383 [09:23:35] (03PS1) 10Slyngshede: Syncronize and update templates to support new version of Thymeleaf. [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1048384 [09:24:04] (03PS1) 10Btullis: Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) [09:24:16] (03CR) 10Kamila Součková: [C:04-1] "to be merged only once 2002 is actually up" [dns] - 10https://gerrit.wikimedia.org/r/1048383 (owner: 10Kamila Součková) [09:25:49] (03PS2) 10Btullis: Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) [09:26:49] (03CR) 10Elukey: [C:03+1] Netbox 4: create new script directory and enable debug (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:27:21] (03PS1) 10Brouberol: karapace: Disable icinga monitoring for karapace hosts [puppet] - 10https://gerrit.wikimedia.org/r/1048386 (https://phabricator.wikimedia.org/T363461) [09:28:22] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv6: Idle - NTT, AS2914/IPv4: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:28:40] (03PS2) 10Brouberol: karapace: Disable icinga monitoring for karapace hosts [puppet] - 10https://gerrit.wikimedia.org/r/1048386 (https://phabricator.wikimedia.org/T363461) [09:28:46] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:28:46] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:28:52] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:29:21] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3011/console" [puppet] - 10https://gerrit.wikimedia.org/r/1048386 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [09:29:35] (03CR) 10Alexandros Kosiaris: [C:04-1] "The linked task is a closed and declined task. Maybe some other task should be linked instead?" [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) (owner: 10Btullis) [09:29:57] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3012/console" [puppet] - 10https://gerrit.wikimedia.org/r/1048386 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [09:30:16] (03PS3) 10Brouberol: karapace: Disable icinga monitoring for karapace hosts [puppet] - 10https://gerrit.wikimedia.org/r/1048386 (https://phabricator.wikimedia.org/T363461) [09:30:56] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3013/console" [puppet] - 10https://gerrit.wikimedia.org/r/1048386 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [09:31:17] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603/ using stat1009.eqiad.wmnet) [09:31:52] dcausse, ryankemper: ^ \o/ !! [09:33:34] \o/ [09:34:04] (03CR) 10Btullis: [C:03+1] "Looks good, but you could also have done it at the role level. Maybe not worth changing, as we hope to decommission them soon." [puppet] - 10https://gerrit.wikimedia.org/r/1048386 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [09:35:05] (03PS1) 10Hnowlan: service: set shellbox-video to production [puppet] - 10https://gerrit.wikimedia.org/r/1048387 (https://phabricator.wikimedia.org/T357309) [09:35:16] (03CR) 10CI reject: [V:04-1] service: set shellbox-video to production [puppet] - 10https://gerrit.wikimedia.org/r/1048387 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [09:35:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P65299 and previous config saved to /var/cache/conftool/dbconfig/20240621-093517-marostegui.json [09:36:32] (03PS2) 10Hnowlan: service: set shellbox-video to production [puppet] - 10https://gerrit.wikimedia.org/r/1048387 (https://phabricator.wikimedia.org/T357309) [09:37:21] (03CR) 10Brouberol: [V:03+1 C:03+2] karapace: Disable icinga monitoring for karapace hosts [puppet] - 10https://gerrit.wikimedia.org/r/1048386 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [09:37:54] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:38:54] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:39:24] (03CR) 10Ayounsi: [C:03+2] Netbox 4: create new script directory and enable debug [puppet] - 10https://gerrit.wikimedia.org/r/1048368 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:39:54] (03PS3) 10Btullis: Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) [09:39:56] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:40:25] FIRING: SystemdUnitFailed: systemd-timedated.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3014/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) (owner: 10Btullis) [09:41:12] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1053.eqiad.wmnet with OS bookworm [09:41:49] (03CR) 10LSobanski: [C:03+1] ugprade: bump to version 6.5.x [software/otrs] - 10https://gerrit.wikimedia.org/r/1047997 (https://phabricator.wikimedia.org/T364958) (owner: 10AOkoth) [09:42:49] (03PS19) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [09:43:11] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [09:45:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on testreduce1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:40] !log brouberol@cumin2002 START - Cookbook sre.hosts.downtime for 12 days, 0:00:00 on karapace[1001-1002].eqiad.wmnet with reason: The hosts are soon to be decommissioned [09:45:57] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12 days, 0:00:00 on karapace[1001-1002].eqiad.wmnet with reason: The hosts are soon to be decommissioned [09:46:31] (03PS1) 10Btullis: Configure check_private_data_report for an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1048388 (https://phabricator.wikimedia.org/T365453) [09:48:59] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3015/console" [puppet] - 10https://gerrit.wikimedia.org/r/1048388 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [09:50:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P65300 and previous config saved to /var/cache/conftool/dbconfig/20240621-095024-marostegui.json [09:51:40] (03CR) 10Kamila Součková: [C:03+1] service: set shellbox-video to production [puppet] - 10https://gerrit.wikimedia.org/r/1048387 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [09:52:50] (03PS1) 10Btullis: [WIP] Remove references to clouddb1021 once the host has been decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1048390 (https://phabricator.wikimedia.org/T365453) [09:53:23] (03CR) 10Brouberol: [C:03+1] Configure check_private_data_report for an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1048388 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [09:53:38] (03PS20) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [09:54:36] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 143, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:54:56] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:54:58] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:04:57] (03CR) 10Hnowlan: [C:03+2] service: set shellbox-video to production [puppet] - 10https://gerrit.wikimedia.org/r/1048387 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:05:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T364069)', diff saved to https://phabricator.wikimedia.org/P65301 and previous config saved to /var/cache/conftool/dbconfig/20240621-100531-marostegui.json [10:05:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [10:05:37] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:05:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [10:05:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T364069)', diff saved to https://phabricator.wikimedia.org/P65302 and previous config saved to /var/cache/conftool/dbconfig/20240621-100554-marostegui.json [10:08:17] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1048388 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [10:09:45] (03CR) 10Kamila Součková: Revert^4 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1048383 (owner: 10Kamila Součková) [10:09:54] (03PS2) 10Kamila Součková: Revert^4 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1048383 [10:11:05] (03CR) 10Kamila Součková: [C:03+2] Revert^4 "Add wikikube-ctrl2002 to server SRV record for etcd" [dns] - 10https://gerrit.wikimedia.org/r/1048383 (owner: 10Kamila Součková) [10:12:42] (03PS1) 10KCVelaga: Add Metrics Platform stream configuration and registration for MinT for Wikipedia Readers feature by Language and Product Localization team. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) [10:13:36] (03PS2) 10Hnowlan: Add shellbox-video discovery [dns] - 10https://gerrit.wikimedia.org/r/1043817 (https://phabricator.wikimedia.org/T357309) [10:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:16:01] (03CR) 10Btullis: [V:03+1 C:03+2] Configure check_private_data_report for an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1048388 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [10:25:58] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 479, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:26:21] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136 (10fnegri) 03NEW [10:27:50] 06SRE, 06cloud-services-team, 10Data-Services: [wikireplicas] Make sure there is no sensitive data in clouddb hosts - https://phabricator.wikimedia.org/T368136#9912971 (10fnegri) [10:28:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl2002.codfw.wmnet with OS bullseye [10:29:06] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9912975 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl2002.codfw.... [10:34:01] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9912990 (10kamila) 05Open→03Resolved Done, thanks a lot for the help @Papaul ! [10:34:48] (03PS1) 10Ayounsi: Netbox 4: create customscript parent directory as well [puppet] - 10https://gerrit.wikimedia.org/r/1048402 (https://phabricator.wikimedia.org/T336275) [10:35:10] (03CR) 10CI reject: [V:04-1] Netbox 4: create customscript parent directory as well [puppet] - 10https://gerrit.wikimedia.org/r/1048402 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [10:35:44] (03PS2) 10Ayounsi: Netbox 4: create customscript parent directory as well [puppet] - 10https://gerrit.wikimedia.org/r/1048402 (https://phabricator.wikimedia.org/T336275) [10:36:30] !log kamila@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl2001.codfw.wmnet [10:36:35] !log kamila@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl2002.codfw.wmnet [10:44:30] (03PS21) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [10:52:27] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [10:53:45] (03CR) 10Kamila Součková: [C:03+1] Add shellbox-video discovery [dns] - 10https://gerrit.wikimedia.org/r/1043817 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:56:11] !log restart swift-proxy on ms-fe1010 T360913 [10:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:16] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [10:57:38] !log restart swift-proxy on ms-fe2011 ms-fe2012 T360913 [10:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:04] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:00:02] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240621T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: That opportune time for a GitLab version upgrades deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240621T1100). [11:02:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52198 bytes in 1.760 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:02:58] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 1.804 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:05:18] (03PS1) 10Superpes15: [ltwiki] Add a new 'rollbacker' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048408 (https://phabricator.wikimedia.org/T367993) [11:05:53] (03CR) 10CI reject: [V:04-1] [ltwiki] Add a new 'rollbacker' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048408 (https://phabricator.wikimedia.org/T367993) (owner: 10Superpes15) [11:06:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Maintenance [11:06:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Maintenance [11:06:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T367856)', diff saved to https://phabricator.wikimedia.org/P65303 and previous config saved to /var/cache/conftool/dbconfig/20240621-110638-marostegui.json [11:06:43] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:07:40] (03CR) 10Hnowlan: [C:03+2] Add shellbox-video discovery [dns] - 10https://gerrit.wikimedia.org/r/1043817 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [11:10:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9913034 (10kamila) [11:13:00] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9913036 (10kamila) Needs approval from one of: - Olja Dimitrjevic - Dan Andreescu - Will Doran - Andreas Hoelzl... [11:20:11] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9913044 (10kamila) [11:20:49] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9913045 (10kamila) @cwylo Can you please confirm that you have read the [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Us... [11:21:55] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9913049 (10kamila) [11:22:31] 06SRE, 10LDAP-Access-Requests: Offboard Lea WMDE from the WMF systems - https://phabricator.wikimedia.org/T368139 (10WMDE-leszek) 03NEW [11:25:37] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for cwylo - https://phabricator.wikimedia.org/T368027#9913062 (10kamila) @leila could you please confirm this access request? Thank you! [11:26:07] (03CR) 10Santiago Faci: [C:04-1] "Just an issue, the configured schema is wrong. The instrument is using the web/base one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) (owner: 10KCVelaga) [11:27:12] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T368140 (10DSmit-WMF) 03NEW [11:27:41] 06SRE, 10LDAP-Access-Requests: Grant Access to ? LDAP GROUP for daphnesmit - https://phabricator.wikimedia.org/T368140#9913073 (10DSmit-WMF) [11:28:57] 06SRE, 10LDAP-Access-Requests: Grant Access to I don't exactly know which LDAP GROUPs for daphnesmit - https://phabricator.wikimedia.org/T368140#9913075 (10DSmit-WMF) [11:37:51] !log hnowlan@cumin1002 START - Cookbook sre.dns.wipe-cache shellbox-video.discovery.wmnet on all recursors [11:37:55] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) shellbox-video.discovery.wmnet on all recursors [11:39:38] (03PS8) 10Hnowlan: Add shellbox-video vars/config, enable on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T356241) [11:43:45] (03CR) 10Stevemunene: [C:03+2] superset: add availability monitor (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1005540 (https://phabricator.wikimedia.org/T356484) (owner: 10Stevemunene) [11:45:14] (03PS1) 10Urbanecm: CommunityConfiguration: Log info and higher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048419 [11:54:39] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9913146 (10Ladsgroup) This is clearly a bug in dumps. [11:54:53] (03PS1) 10Btullis: Enable monitoring for an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1048422 (https://phabricator.wikimedia.org/T365453) [11:56:22] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3017/console" [puppet] - 10https://gerrit.wikimedia.org/r/1048422 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [11:59:25] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:39] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9913158 (10Peachey88) [12:09:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [12:09:32] (03PS1) 10Jcrespo: dbbackups: Set sql_mode as loose so that invalid enum values are ok [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) [12:14:35] (03CR) 10Brouberol: [C:03+1] Enable monitoring for an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1048422 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [12:15:09] (03CR) 10Btullis: [V:03+1 C:03+2] Enable monitoring for an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1048422 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [12:16:29] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9913180 (10ABran-WMF) 05Open→03In progress p:05Triage→03Medium [12:16:31] (03CR) 10Jcrespo: "Example:" [puppet] - 10https://gerrit.wikimedia.org/r/1048424 (https://phabricator.wikimedia.org/T367162) (owner: 10Jcrespo) [12:25:57] (03PS22) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [12:26:32] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [12:29:03] (03PS23) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [12:31:53] (03CR) 10CI reject: [V:04-1] [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [12:33:03] 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9913238 (10kamila) [12:39:55] (03PS24) 10DCausse: [WIP] wdqs: allow to configure internal federated endpoints [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) [12:44:12] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048038 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [12:47:49] (03CR) 10Ottomata: "@akosiaris@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) (owner: 10Btullis) [12:52:56] (03CR) 10Bking: "I discussed this yesterday in #wikimedia-observability . Based on this discussion, it's my understanding that you have to create unique al" [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [12:52:59] 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9913302 (10kamila) [12:53:38] (03PS4) 10Btullis: Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) [12:54:05] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9913305 (10Papaul) @kamila anytime [12:55:01] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3018/co" [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) (owner: 10Btullis) [12:55:16] 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9913307 (10kamila) @KFrancis can you please make sure @MunizaA's NDA is signed? Thank you! [13:00:59] (03CR) 10Klausman: [C:03+1] kubernetes: split unavailable-replicas alert per team [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [13:02:24] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048436 (https://phabricator.wikimedia.org/T349774) [13:03:54] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9913339 (10kamila) @DSmit-WMF Are you requesting SSH access too, or just tools? [13:04:12] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048436 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [13:05:17] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048436 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [13:06:43] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:07:06] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:07:07] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:07:49] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:07:50] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:08:22] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:08:37] PROBLEM - SSH on an-presto1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:08:50] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9913344 (10kamila) [13:08:55] (03PS2) 10Jon Harald Søby: Add new protection level (edituserprotected) for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) [13:09:24] (03PS1) 10Giuseppe Lavagetto: modules: add new version of mesh.configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048439 [13:09:25] (03PS1) 10Giuseppe Lavagetto: mesh.configuration: fix compliace with spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048440 [13:09:29] RECOVERY - SSH on an-presto1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:10:48] (03CR) 10Jon Harald Søby: Add new protection level (edituserprotected) for nowiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) (owner: 10Jon Harald Søby) [13:20:06] (03CR) 10DCausse: [C:03+1] team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772) (owner: 10Bking) [13:20:28] (03PS1) 10KartikMistry: AX Language selector entrypoint: Fix AX URL [extensions/ContentTranslation] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1048443 (https://phabricator.wikimedia.org/T363183) [13:21:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [extensions/ContentTranslation] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1048443 (https://phabricator.wikimedia.org/T363183) (owner: 10KartikMistry) [13:21:15] !log btullis@deploy1002 Started deploy [performance/asoranking@febfb9f]: (no justification provided) [13:21:19] !log btullis@deploy1002 Finished deploy [performance/asoranking@febfb9f]: (no justification provided) (duration: 00m 04s) [13:21:30] FIRING: [2x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:28] (03PS2) 10KCVelaga: Add Metrics Platform stream configuration and registration for MinT for Wikipedia Readers feature by Language and Product Localization team. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048393 (https://phabricator.wikimedia.org/T368028) [13:25:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T367856)', diff saved to https://phabricator.wikimedia.org/P65306 and previous config saved to /var/cache/conftool/dbconfig/20240621-132506-marostegui.json [13:25:12] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [13:25:21] (03PS1) 10Slyngshede: C:apereo_cas Add CAS 7 properties [puppet] - 10https://gerrit.wikimedia.org/r/1048445 (https://phabricator.wikimedia.org/T367487) [13:25:25] (03PS1) 10Jgiannelos: push-notifications: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048446 [13:26:30] RESOLVED: [2x] ProbeDown: Service wdqs2020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:27:00] (03PS2) 10Jgiannelos: push-notifications: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048446 [13:28:11] (03CR) 10Bking: [C:03+2] team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772) (owner: 10Bking) [13:29:51] (03Merged) 10jenkins-bot: team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772) (owner: 10Bking) [13:30:24] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: No unicast IP ranges announced to peers from eqdfw - https://phabricator.wikimedia.org/T367439#9913434 (10cmooney) Gonna copy some of the discussion from the patch here as I think it's easier for discussion and a record of what we decide:... [13:30:47] (03CR) 10Cathal Mooney: Set eqdfw to use default aggregate policy, and modify eqord policy (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1043229 (https://phabricator.wikimedia.org/T367439) (owner: 10Cathal Mooney) [13:33:23] (03PS1) 10Ayounsi: Netbox 4: replace `device_role` with `role` [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1048449 (https://phabricator.wikimedia.org/T336275) [13:35:42] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#9913462 (10elukey) [13:38:22] !incidents [13:38:22] 4768 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [13:38:22] 4767 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [13:38:22] 4766 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [13:38:23] 4765 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [13:38:23] 4764 (RESOLVED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [13:38:26] (03PS1) 10JMeybohm: admin_ng: Bind to privileged PSP if restricted PSP is disabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048453 (https://phabricator.wikimedia.org/T273507) [13:40:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P65309 and previous config saved to /var/cache/conftool/dbconfig/20240621-134013-marostegui.json [13:43:23] (03PS1) 10Ilias Sarantopoulos: ml-services: use force_http in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048455 (https://phabricator.wikimedia.org/T360455) [13:55:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P65312 and previous config saved to /var/cache/conftool/dbconfig/20240621-135521-marostegui.json [14:00:52] (03CR) 10Dzahn: admin: Extend access for AndyRussG (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [14:02:39] (03PS3) 10Jon Harald Søby: Add new protection level (edituserprotected) for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047441 (https://phabricator.wikimedia.org/T367943) [14:03:49] (03Abandoned) 10Ayounsi: Netbox 3.5: multiple cable terminations and endpoints [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918518 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:10:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T367856)', diff saved to https://phabricator.wikimedia.org/P65313 and previous config saved to /var/cache/conftool/dbconfig/20240621-141028-marostegui.json [14:10:29] (03Abandoned) 10Hashar: tox: pin style dependencies to avoid CI failures [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043748 (owner: 10Hashar) [14:10:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [14:10:34] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [14:10:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2217.codfw.wmnet with reason: Maintenance [14:10:45] (03PS1) 10Bking: search-platform/data-platform: route alerts to data-platform-alerts IRC [puppet] - 10https://gerrit.wikimedia.org/r/1048467 (https://phabricator.wikimedia.org/T368107) [14:10:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T367856)', diff saved to https://phabricator.wikimedia.org/P65314 and previous config saved to /var/cache/conftool/dbconfig/20240621-141050-marostegui.json [14:11:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048467 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [14:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:10] (03CR) 10Alexandros Kosiaris: "Looking at the history of the task (and not the task itself), I am gonna suggest that re-opening may not have been the best path forward." [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) (owner: 10Btullis) [14:19:34] (03CR) 10Ssingh: "Looks good! I see that this patch doesn't have the etcd data for apus; is that some other patch?" [puppet] - 10https://gerrit.wikimedia.org/r/1048005 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:22:01] (03CR) 10Ssingh: "I meant in conftool-data/node/eqiad.yaml, codfw.yaml. https://wikitech.wikimedia.org/wiki/LVS#Add_data_in_etcd" [puppet] - 10https://gerrit.wikimedia.org/r/1048005 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:22:13] (03CR) 10DCausse: [C:03+1] "lgtm, adding Erik & Peter for visibility and see if they have concerns to stop sending alerts to #wikimedia-operations" [puppet] - 10https://gerrit.wikimedia.org/r/1048467 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [14:24:23] (03CR) 10Bking: "I thought about this a bit more, and do think we can do this in Puppet and avoid the duplication. I've created https://gerrit.wikimedia.or" [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [14:24:34] (03Abandoned) 10Bking: team-data-platform: Add all team-search-platform alerts to team-data-platform [alerts] - 10https://gerrit.wikimedia.org/r/1048074 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [14:30:15] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1048005 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:34:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T364069)', diff saved to https://phabricator.wikimedia.org/P65317 and previous config saved to /var/cache/conftool/dbconfig/20240621-143450-marostegui.json [14:34:56] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:36:04] (03CR) 10Ssingh: "Ah sorry! I missed that. Looking there." [puppet] - 10https://gerrit.wikimedia.org/r/1048005 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [14:38:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:47:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P65318 and previous config saved to /var/cache/conftool/dbconfig/20240621-144957-marostegui.json [14:53:35] (03PS1) 10DCausse: [DNM] wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) [14:54:00] (03CR) 10CI reject: [V:04-1] [DNM] wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [14:55:48] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:51] (03CR) 10Ayounsi: [C:03+2] "Merging in the dev branch to unblock the overall upgrade prep-work, don't hesitate to still provide feedback." [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1048449 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:58:01] (03Merged) 10jenkins-bot: Netbox 4: replace `device_role` with `role` [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1048449 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [14:58:14] (03PS2) 10DCausse: [DNM] wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) [15:00:44] (03CR) 10CI reject: [V:04-1] [DNM] wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [15:01:25] (03CR) 10Ssingh: [C:03+1] Discovery setup for apus [puppet] - 10https://gerrit.wikimedia.org/r/1048005 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:01:46] (03PS3) 10DCausse: [DNM] wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) [15:04:21] (03CR) 10Btullis: [C:03+1] "Looks good to me, too. Assuming the Erik and Peter agree, let's deploy on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/1048467 (https://phabricator.wikimedia.org/T368107) (owner: 10Bking) [15:04:37] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [15:05:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P65319 and previous config saved to /var/cache/conftool/dbconfig/20240621-150504-marostegui.json [15:05:15] (03PS1) 10Elukey: move_server.py: fix function call name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1048486 (https://phabricator.wikimedia.org/T368148) [15:05:25] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 125, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:33] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 559, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:10:02] (03CR) 10Ayounsi: [C:03+1] "nice! I'm surprised we didn't caught it sooner" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1048486 (https://phabricator.wikimedia.org/T368148) (owner: 10Elukey) [15:10:47] (03PS4) 10DCausse: [DNM] wdqs: enable throttling only for requests coming from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) [15:13:18] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1048485 (https://phabricator.wikimedia.org/T361950) (owner: 10DCausse) [15:16:26] (03CR) 10Ssingh: conftool-data: add apus entries in codfw & eqiad; lvs::realserver to rgws (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047988 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:19:17] (03PS3) 10MVernon: conftool-data: add apus entries in codfw & eqiad; lvs::realserver to rgws [puppet] - 10https://gerrit.wikimedia.org/r/1047988 (https://phabricator.wikimedia.org/T279621) [15:20:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T364069)', diff saved to https://phabricator.wikimedia.org/P65321 and previous config saved to /var/cache/conftool/dbconfig/20240621-152011-marostegui.json [15:20:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [15:20:18] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:20:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [15:20:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:20:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:20:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T364069)', diff saved to https://phabricator.wikimedia.org/P65322 and previous config saved to /var/cache/conftool/dbconfig/20240621-152038-marostegui.json [15:20:52] (03CR) 10MVernon: "Good catch, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1047988 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:34:05] (03CR) 10Elukey: [C:03+2] move_server.py: fix function call name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1048486 (https://phabricator.wikimedia.org/T368148) (owner: 10Elukey) [15:35:19] (03Merged) 10jenkins-bot: move_server.py: fix function call name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1048486 (https://phabricator.wikimedia.org/T368148) (owner: 10Elukey) [15:37:40] !log elukey@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [15:37:57] !log elukey@cumin1002 END (FAIL) - Cookbook sre.netbox.update-extras (exit_code=1) rolling restart_daemons on A:netbox-canary [15:39:23] !log elukey@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [15:40:14] (03CR) 10Krinkle: [POC] Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [15:40:51] !log elukey@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [15:41:01] !log elukey@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:41:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:46:32] (03CR) 10Krinkle: [C:04-1] [POC] Handle sso.wikimedia.org domain (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [15:52:25] (03PS1) 10MVernon: conftool-data: updates for apus [puppet] - 10https://gerrit.wikimedia.org/r/1048493 (https://phabricator.wikimedia.org/T279621) [15:52:29] (03PS1) 10MVernon: conftool-data: services entry for apus [puppet] - 10https://gerrit.wikimedia.org/r/1048494 (https://phabricator.wikimedia.org/T279621) [15:52:34] (03PS1) 10MVernon: apus: service catalogue entry and lvs::realserver setup [puppet] - 10https://gerrit.wikimedia.org/r/1048495 (https://phabricator.wikimedia.org/T279621) [15:53:38] (03CR) 10Ssingh: [C:03+1] conftool-data: services entry for apus [puppet] - 10https://gerrit.wikimedia.org/r/1048494 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:55:20] (03CR) 10Pmiazga: [C:03+1] [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [15:56:16] (03CR) 10Ssingh: apus: service catalogue entry and lvs::realserver setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048495 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:57:30] (03PS1) 10Kamila Součková: opentelemetry: update k8s API IP addresses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048498 [15:59:17] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:42] (03PS2) 10MVernon: conftool-data: updates for apus [puppet] - 10https://gerrit.wikimedia.org/r/1048493 (https://phabricator.wikimedia.org/T279621) [16:00:42] (03PS2) 10MVernon: conftool-data: services entry for apus [puppet] - 10https://gerrit.wikimedia.org/r/1048494 (https://phabricator.wikimedia.org/T279621) [16:00:43] (03PS2) 10MVernon: apus: service catalogue entry and lvs::realserver setup [puppet] - 10https://gerrit.wikimedia.org/r/1048495 (https://phabricator.wikimedia.org/T279621) [16:00:54] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:10] (03CR) 10Ssingh: [C:03+1] conftool-data: updates for apus [puppet] - 10https://gerrit.wikimedia.org/r/1048493 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:02:33] (03PS3) 10MVernon: apus: service catalogue entry and lvs::realserver setup [puppet] - 10https://gerrit.wikimedia.org/r/1048495 (https://phabricator.wikimedia.org/T279621) [16:02:41] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:44] (03CR) 10MVernon: apus: service catalogue entry and lvs::realserver setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1048495 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:04:23] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 521 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:04:24] (03Abandoned) 10MVernon: conftool-data: add apus entries in codfw & eqiad; lvs::realserver to rgws [puppet] - 10https://gerrit.wikimedia.org/r/1047988 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:05:06] (03Abandoned) 10MVernon: Discovery setup for apus [puppet] - 10https://gerrit.wikimedia.org/r/1048005 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:06:57] (03CR) 10C. Scott Ananian: [C:03+1] Follow the defaults for Parsoid on MFE on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048144 (https://phabricator.wikimedia.org/T363720) (owner: 10Arlolra) [16:06:57] (03CR) 10Ssingh: [C:03+1] apus: service catalogue entry and lvs::realserver setup [puppet] - 10https://gerrit.wikimedia.org/r/1048495 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:07:37] (03CR) 10C. Scott Ananian: [C:03+1] Remove unused Linter configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1048138 (https://phabricator.wikimedia.org/T343292) (owner: 10Arlolra) [16:07:43] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 221.80 ms [16:09:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:14:21] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 72 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:27:34] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9913903 (10BCornwall) [16:41:23] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 102 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:46:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:21] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 57 probes of 729 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:51:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:52:37] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T367648#9913972 (10Papaul) 05Open→03Resolved a:03Papaul Removed both power supplies for 3 minutes and put them back. IDRAC is back up. [17:06:59] 10SRE-swift-storage, 06Privacy Engineering, 06Security-Team, 13Patch-For-Review, and 3 others: Images of private wikis are publicly accessible if attacker knows the URL or the filename - https://phabricator.wikimedia.org/T340189#9914026 (10sbassett) [17:21:03] (03CR) 10Ottomata: "FWIW, that task I think has no direct connection to this." [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T258511) (owner: 10Btullis) [17:23:01] 10ops-codfw, 06DC-Ops, 06serviceops, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165 (10BCornwall) 03NEW [17:23:31] 06SRE, 10LDAP-Access-Requests: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914059 (10Dzahn) [17:23:39] 10ops-codfw, 06DC-Ops, 06serviceops, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9914057 (10BCornwall) p:05Triage→03High [17:24:44] 10ops-codfw, 06DC-Ops, 06serviceops, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9914063 (10BCornwall) [17:24:48] 06SRE, 10LDAP-Access-Requests: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914064 (10Dzahn) [17:25:42] 06SRE, 10LDAP-Access-Requests: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914066 (10Dzahn) - removed from the 2 Phabricator groups just now [17:26:35] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914070 (10Dzahn) [17:27:30] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914074 (10Dzahn) 05Open→03In progress p:05Triage→03High [17:28:51] (03PS5) 10Btullis: Enable the MariaDB binlog on the analytics mariadb replicas [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T358373) [17:32:08] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9914085 (10Dzahn) Welcome to WMF, Daphne! The things listed (web login to observability tools, gerrit +2 (though it's more complex than 'generally all repos')) sound like just the "wmf" group is enoug... [17:32:34] (03CR) 10Btullis: "Thanks both. I have now removed that first link. I had only intended it to be a useful cross-reference of context, but I appreciate that i" [puppet] - 10https://gerrit.wikimedia.org/r/1048385 (https://phabricator.wikimedia.org/T358373) (owner: 10Btullis) [17:44:39] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9914100 (10DSmit-WMF) Hi! Thank you. I think your right. Yeah i closed it because i was managing ssh with 1password and that doesnt quite seem to work with all the ssh settings for hosts you might nee... [17:47:38] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914104 (10Dzahn) I ran the offboard-user script on mwmaint which told me how what ldapmodify command to run and then I just did that. remo... [17:47:44] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914105 (10Dzahn) [17:49:02] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06Data-Platform-SRE, and 3 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9914102 (10Ottomata) [17:52:14] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914111 (10Dzahn) Ran offboard-user with the -p flag for Phabricator, it checked all the remaining Phabricator groups but none of them are p... [17:52:21] (03PS3) 10Pppery: Add fallback languages for Phabricator [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593 [17:53:44] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914113 (10Dzahn) ` Is not member of any LDAP group Is not a member in any privileged group lea-wmde does not exist in modules/admin/data/da... [17:53:59] (03PS4) 10Pppery: Add fallback languages for Phabricator [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1047593 [17:56:48] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9914129 (10Dzahn) Thanks! sounds good! Keeping the 2 things separate requests and tickets is actually ideal, because one is an [[ https://phabricator.wikimedia.org/tag/ldap-access-requests/ | LDAP ac... [18:09:07] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:13] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:01] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52198 bytes in 4.720 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:10:05] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8923 bytes in 1.368 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:17:22] (03PS1) 10Cwhite: opensearch: Enable configuration of watermark parameters [puppet] - 10https://gerrit.wikimedia.org/r/1048538 (https://phabricator.wikimedia.org/T368168) [18:24:49] (03PS75) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [18:40:28] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9914204 (10CDanis) Apologies @dcaro but I had less time for this than I expected this week, was only able to do some... [18:41:38] (03PS1) 10Brennen Bearnes: WIP: gitlab: remove unused ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1048544 (https://phabricator.wikimedia.org/T355097) [18:42:02] (03CR) 10CI reject: [V:04-1] WIP: gitlab: remove unused ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1048544 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [18:45:46] (03PS76) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [18:47:39] (03PS77) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [19:15:25] (03CR) 10AikoChou: [C:03+2] ml-services: use force_http in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048455 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [19:16:21] (03Merged) 10jenkins-bot: ml-services: use force_http in articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048455 (https://phabricator.wikimedia.org/T360455) (owner: 10Ilias Sarantopoulos) [19:36:55] (03PS5) 10Scott French: mediawiki: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) [19:36:55] (03PS3) 10Scott French: mediawiki: enable securityContext in all canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) [19:36:56] (03PS3) 10Scott French: mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) [19:43:15] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [19:50:12] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9914298 (10Dzahn) - I followed [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group | Access_requests#WMF_Group ]] and ran the [[ https://wikitech.wikimedia.org/wiki/SRE/Cli... [19:51:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T364069)', diff saved to https://phabricator.wikimedia.org/P65325 and previous config saved to /var/cache/conftool/dbconfig/20240621-195115-marostegui.json [19:51:21] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [19:52:46] (03PS1) 10Dwisehaupt: Remove old ingenico/globalcollect job checks [puppet] - 10https://gerrit.wikimedia.org/r/1048551 (https://phabricator.wikimedia.org/T368114) [19:54:45] (03PS1) 10Dzahn: admin: add Daphne Smit to ldap_only users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/1048552 (https://phabricator.wikimedia.org/T368140) [19:57:28] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to wmf for daphnesmit - https://phabricator.wikimedia.org/T368140#9914312 (10Dzahn) 05Open→03In progress p:05Triage→03High [19:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P65326 and previous config saved to /var/cache/conftool/dbconfig/20240621-200622-marostegui.json [20:09:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:10:15] (03PS1) 10AikoChou: ml-services: update articlequality image and storage URI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048559 (https://phabricator.wikimedia.org/T360455) [20:10:35] (03CR) 10AOkoth: [V:03+2 C:03+2] ugprade: bump to version 6.5.x [software/otrs] - 10https://gerrit.wikimedia.org/r/1047997 (https://phabricator.wikimedia.org/T364958) (owner: 10AOkoth) [20:21:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P65327 and previous config saved to /var/cache/conftool/dbconfig/20240621-202129-marostegui.json [20:26:24] (03CR) 10AikoChou: "It should work. I tested it from a pod in experimental ns :p" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1048559 (https://phabricator.wikimedia.org/T360455) (owner: 10AikoChou) [20:36:01] (03PS2) 10Brennen Bearnes: WIP: gitlab: remove unused ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1048544 (https://phabricator.wikimedia.org/T355097) [20:36:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T364069)', diff saved to https://phabricator.wikimedia.org/P65328 and previous config saved to /var/cache/conftool/dbconfig/20240621-203636-marostegui.json [20:36:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [20:36:42] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [20:36:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [20:36:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T364069)', diff saved to https://phabricator.wikimedia.org/P65329 and previous config saved to /var/cache/conftool/dbconfig/20240621-203659-marostegui.json [20:46:44] 06SRE, 06collaboration-services, 10LDAP-Access-Requests, 10Phabricator: Offboard Lea WMDE (Lea Voget) from the WMF systems - https://phabricator.wikimedia.org/T368139#9914425 (10Dzahn) @SLyngshede-WMF Wanna take a look if I forgot anything? [20:46:47] (03PS3) 10Brennen Bearnes: gitlab: remove unused ldap_group_sync_user [puppet] - 10https://gerrit.wikimedia.org/r/1048544 (https://phabricator.wikimedia.org/T355097) [20:49:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T367856)', diff saved to https://phabricator.wikimedia.org/P65330 and previous config saved to /var/cache/conftool/dbconfig/20240621-204941-marostegui.json [20:49:47] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [20:51:57] (03PS2) 10Dwisehaupt: frack: Remove old ingenico/globalcollect job checks [puppet] - 10https://gerrit.wikimedia.org/r/1048551 (https://phabricator.wikimedia.org/T368114) [20:54:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:04:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P65331 and previous config saved to /var/cache/conftool/dbconfig/20240621-210448-marostegui.json [21:12:29] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:19:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P65332 and previous config saved to /var/cache/conftool/dbconfig/20240621-211956-marostegui.json [21:28:28] Gerrit is super slow, anyone around? [21:29:24] brett: ^ [21:29:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:31:05] PROBLEM - HTTPS on gerrit1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/330/ [21:31:39] !log restart apache2 on gerrit1003 [21:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:05] RECOVERY - HTTPS on gerrit1003 is OK: SSL OK - Certificate gerrit.wikimedia.org valid until 2024-08-03 19:50:23 +0000 (expires in 42 days) https://phabricator.wikimedia.org/project/view/330/ [21:34:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:58] cwhite: thank you :) [21:35:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T367856)', diff saved to https://phabricator.wikimedia.org/P65333 and previous config saved to /var/cache/conftool/dbconfig/20240621-213503-marostegui.json [21:35:09] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [21:35:10] :) [21:35:36] cwhite: thanks <3 [21:37:02] cwhite: it's going slow again [21:37:11] Cc brennen thcipriani [21:38:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:43:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:43:48] I'm going to make a task [21:45:08] RhinosF1: I think we've got the root cause, should be back shortly. I think there's an open task about this...digging around now. [21:45:48] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:46:17] * denisse Walking the dogs, BRB. [21:46:34] thcipriani: if not I'll press create on mine [21:46:59] gerrit should be back [21:47:08] I did eh.. something [21:48:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:48:47] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:49:46] (03PS1) 10Dzahn: gerrit: block an IP with bad_browser [puppet] - 10https://gerrit.wikimedia.org/r/1048574 [21:52:25] (03CR) 10Brennen Bearnes: [C:03+1] gerrit: block an IP with bad_browser [puppet] - 10https://gerrit.wikimedia.org/r/1048574 (owner: 10Dzahn) [21:53:10] (03CR) 10Dzahn: [C:03+2] gerrit: block an IP with bad_browser [puppet] - 10https://gerrit.wikimedia.org/r/1048574 (owner: 10Dzahn) [22:00:27] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Mailing list Delivery Mode set to None - https://phabricator.wikimedia.org/T368134#9914549 (10Peachey88) [22:15:48] FIRING: SystemdUnitFailed: mail-aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:21:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:38:36] PROBLEM - MariaDB Replica Lag: s1 #page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 346.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:41:47] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9914610 (10cmooney) 05Open→03Resolved [22:42:31] !incidents [22:42:31] 4769 (ACKED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [22:42:31] 4768 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [22:42:32] 4767 (RESOLVED) VarnishUnavailable global sre (varnish-upload thanos-rule) [22:42:32] 4766 (RESOLVED) ProbeDown sre (10.2.2.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 eqiad) [22:42:32] 4765 (RESOLVED) HaproxyUnavailable cache_text global sre (thanos-rule) [22:42:37] !ack 4769 [22:42:37] 4769 (ACKED) db1206 (paged)/MariaDB Replica Lag: s1 (paged) [22:43:22] * denisse back [22:43:57] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9914615 (10Dzahn) SRE got paged again. ` 22:38 <+icinga-wm> PROBLEM - MariaDB Replica Lag: s1 # page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 346.30 seconds... [22:44:18] I'm taking a look at 4769. Cc brett [22:44:44] denisse: See https://phabricator.wikimedia.org/T368098#9914615 [22:45:10] brett: thanks, taking a look. [22:51:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:31] RESOLVED: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:36] PROBLEM - MariaDB Replica Lag: s1 #page on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 326.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:01:53] (03CR) 10Dzahn: "ah, maybe I misunderstood.. my answer was for "change the UID of an existing user in LDAP". If you meant "change the UID in admin.yaml and" [puppet] - 10https://gerrit.wikimedia.org/r/1047473 (https://phabricator.wikimedia.org/T367681) (owner: 10Kamila Součková) [23:07:18] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9914641 (10andrea.denisse) The task is marked as in progress but has no person assigned to it. [23:14:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:14:28] 06SRE, 06Traffic, 13Patch-For-Review: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360#9914657 (10Dwisehaupt) Frack config has been updated to use the new ntp-[abc].anycast.wmnet servers. The previous dnsXXXX and ntp.anycast.wmnet entries hav... [23:17:31] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9914664 (10JJMC89) This is impacting the wikis - bots are receiving `maxlag` errors. [23:17:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:19:33] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9914670 (10Scott_French) If we expect that this particularly expensive dumps run is going to take a while, and as a result will cause db1206 to lag behind significantly, would it be possible /... [23:22:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:24:21] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9914684 (10Dzahn) There is an alert in Icinga that says there are too many runners. "PROCS CRITICAL: 15 processes with UID = 38 (list), reg... [23:28:21] !log # dbctl instance db1206 set-weight 10 --section s1 [23:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:59] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9914688 (10Dzahn) 23:28 < brett> !log # dbctl instance db1206 set-weight 10 --section s1 [23:38:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048591 [23:38:41] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1048591 (owner: 10TrainBranchBot) [23:40:35] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [23:41:15] brett: that doesn't seem to have done anything as far as I can tell [23:41:32] Yeah, it was already on a really low weight [23:41:58] I had only checked one other db before changing it but they're all much higher weights than 50 [23:42:23] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [23:43:28] !log brett@puppetmaster1001 dbctl commit (dc=all): 'set db1206 s1 weight to 1 - T368098', diff saved to https://phabricator.wikimedia.org/P65334 and previous config saved to /var/cache/conftool/dbconfig/20240621-234328-brett.json [23:43:33] T368098: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098 [23:43:57] also didn't commit :) [23:45:35] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [23:46:42] (03PS1) 10Cwhite: logstash: move thumbor logs to logstash-thumbor partition [puppet] - 10https://gerrit.wikimedia.org/r/1048592 (https://phabricator.wikimedia.org/T368180) [23:47:23] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [23:49:24] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9914743 (10Dzahn) lag went down from over 8 minutes to 4 min shortly after the commit [23:53:33] even with a weight of 1, I'm still hitting db1206 [23:54:02] !log delete remaining 2024.03 log indexes to make room on logstash eqiad and codfw T368180 [23:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:07] T368180: Thumbor high log volume and unstructured logging - https://phabricator.wikimedia.org/T368180 [23:54:20] JJMC89: the lag is down to 2 minutes and falling. CRIT is only over 3 mintes [23:54:35] RECOVERY - MariaDB Replica Lag: s1 #page on db1206 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:54:45] recovery :) [23:54:50] \o/ [23:55:30] yes, just find it odd that I was still hitting the db with that low of a weight [23:55:32] at 2 min it suddenly dropped to 30 sec [23:55:54] JJMC89: there were a few minutes in between where the weight change was made but not committed [23:56:22] yes - it was after the commit [23:57:12] 06SRE, 06DBA, 10Dumps-Generation: db1206 - replica lag - page - 20240620 - https://phabricator.wikimedia.org/T368098#9914757 (10Dzahn) ` 23:54 <+icinga-wm> RECOVERY - MariaDB Replica Lag: s1 # page on db1206 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/tro... [23:58:00] I dont have a good answer. just hoping it's gone now for sure [23:59:40] FIRING: SystemdUnitFailed: wmf_auto_restart_apache2.service on lists1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed