[16:05:37] (03CR) 10Ottomata: [C: 03+1] mw-page-content-change-enrich: Switch to mw-api-int-async [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004156 (https://phabricator.wikimedia.org/T357785) (owner: 10Clément Goubert) [17:05:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56921 and previous config saved to /var/cache/conftool/dbconfig/20240217-170518-ladsgroup.json [17:05:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:16:17] 10SRE, 10Data-Engineering-Radar, 10Traffic: Lock-in Varnish and VarnishKafka versions - https://phabricator.wikimedia.org/T304617#9553038 (10SDeckelmann-WMF) Given the significant impact of this issue on data collection (which caused us to not be able to use about a year's worth of unique devices data -- and... [17:20:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P56922 and previous config saved to /var/cache/conftool/dbconfig/20240217-172024-ladsgroup.json [17:35:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P56923 and previous config saved to /var/cache/conftool/dbconfig/20240217-173531-ladsgroup.json [17:50:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56924 and previous config saved to /var/cache/conftool/dbconfig/20240217-175038-ladsgroup.json [17:50:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:50:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:50:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:51:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2138:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56925 and previous config saved to /var/cache/conftool/dbconfig/20240217-175100-ladsgroup.json [17:53:35] (SystemdUnitFailed) firing: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:08:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:48:35] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:22] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:26:30] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-api-int (k8s) 1.387s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:30:54] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 187 probes of 730 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:33:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-api-int (k8s) 1.387s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [20:45:56] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 81 probes of 730 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:47:08] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 86.07 ms [22:08:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:56:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56926 and previous config saved to /var/cache/conftool/dbconfig/20240217-225656-ladsgroup.json [22:57:02] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:12:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P56927 and previous config saved to /var/cache/conftool/dbconfig/20240217-231203-ladsgroup.json [23:27:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P56928 and previous config saved to /var/cache/conftool/dbconfig/20240217-232709-ladsgroup.json [23:42:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T352010)', diff saved to https://phabricator.wikimedia.org/P56929 and previous config saved to /var/cache/conftool/dbconfig/20240217-234216-ladsgroup.json [23:42:18] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [23:42:27] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:42:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance