[00:01:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [00:01:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [00:03:01] PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:06:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [00:06:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [00:06:26] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on mw1473:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:51] RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:13:17] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1002 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:13:17] PROBLEM - Check unit status of sync-puppet-volatile on puppetserver1003 is CRITICAL: CRITICAL: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:14:16] FIRING: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:30] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052191 (owner: 10TrainBranchBot) [00:23:17] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1002 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:23:17] RECOVERY - Check unit status of sync-puppet-volatile on puppetserver1003 is OK: OK: Status of the systemd unit sync-puppet-volatile https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:24:16] FIRING: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:34] RESOLVED: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [00:36:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [00:41:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [00:41:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [00:41:26] FIRING: [2x] SystemdUnitFailed: rsyslog-imfile-remedy.service on mw1473:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:46:33] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [00:46:33] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [00:47:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T364069)', diff saved to https://phabricator.wikimedia.org/P65841 and previous config saved to /var/cache/conftool/dbconfig/20240705-004707-marostegui.json [00:47:10] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:01:33] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [01:01:33] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [01:02:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P65842 and previous config saved to /var/cache/conftool/dbconfig/20240705-010214-marostegui.json [01:05:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [01:10:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [01:17:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P65843 and previous config saved to /var/cache/conftool/dbconfig/20240705-011721-marostegui.json [01:20:27] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:32:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T364069)', diff saved to https://phabricator.wikimedia.org/P65844 and previous config saved to /var/cache/conftool/dbconfig/20240705-013229-marostegui.json [01:32:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [01:32:32] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:32:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1235.eqiad.wmnet with reason: Maintenance [01:32:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T364069)', diff saved to https://phabricator.wikimedia.org/P65845 and previous config saved to /var/cache/conftool/dbconfig/20240705-013250-marostegui.json [01:43:45] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Swift [01:44:45] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 295 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift [01:53:25] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [01:54:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [01:54:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [01:56:23] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Swift [01:59:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [01:59:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [02:16:25] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [02:17:25] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Swift [02:21:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [02:26:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [02:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:49] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [02:54:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [02:54:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [02:55:47] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Swift [02:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [02:59:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [03:06:26] FIRING: [2x] SystemdUnitFailed: rsyslog-imfile-remedy.service on mw1473:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:47:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [03:47:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [03:52:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [03:52:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [03:54:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [03:54:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [03:59:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [03:59:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [04:00:33] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [04:02:31] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Swift [04:07:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [04:10:48] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [04:12:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [04:15:48] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [04:18:48] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [04:20:33] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.050 second response time https://wikitech.wikimedia.org/wiki/Swift [04:21:31] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Swift [04:21:55] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [04:23:48] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [04:26:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T364069)', diff saved to https://phabricator.wikimedia.org/P65846 and previous config saved to /var/cache/conftool/dbconfig/20240705-042641-marostegui.json [04:26:45] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:26:53] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Swift [04:36:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [04:41:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [04:41:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P65847 and previous config saved to /var/cache/conftool/dbconfig/20240705-044148-marostegui.json [04:42:31] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:42:46] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1052096 (owner: 10L10n-bot) [04:49:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [04:49:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [04:49:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T367856)', diff saved to https://phabricator.wikimedia.org/P65848 and previous config saved to /var/cache/conftool/dbconfig/20240705-044912-marostegui.json [04:49:15] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [04:49:33] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.055 second response time https://wikitech.wikimedia.org/wiki/Swift [04:49:57] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Swift [04:50:33] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.067 second response time https://wikitech.wikimedia.org/wiki/Swift [04:51:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [04:51:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [04:51:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T367856)', diff saved to https://phabricator.wikimedia.org/P65849 and previous config saved to /var/cache/conftool/dbconfig/20240705-045145-marostegui.json [04:51:57] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 296 bytes in 0.741 second response time https://wikitech.wikimedia.org/wiki/Swift [04:55:33] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 0.551 second response time https://wikitech.wikimedia.org/wiki/Swift [04:55:34] (03PS1) 10Marostegui: install_server: Allow reimage db22[21-40] [puppet] - 10https://gerrit.wikimedia.org/r/1052197 (https://phabricator.wikimedia.org/T368922) [04:56:33] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Swift [04:56:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P65850 and previous config saved to /var/cache/conftool/dbconfig/20240705-045655-marostegui.json [04:57:49] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1052197 (https://phabricator.wikimedia.org/T368922) (owner: 10Marostegui) [04:59:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [04:59:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [04:59:56] (03CR) 10Marostegui: [C:03+2] install_server: Allow reimage db22[21-40] [puppet] - 10https://gerrit.wikimedia.org/r/1052197 (https://phabricator.wikimedia.org/T368922) (owner: 10Marostegui) [05:00:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2136', diff saved to https://phabricator.wikimedia.org/P65851 and previous config saved to /var/cache/conftool/dbconfig/20240705-050028-root.json [05:04:13] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9955625 (10Marostegui) [05:04:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [05:04:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [05:05:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [05:05:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [05:09:33] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [05:09:33] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [05:12:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T364069)', diff saved to https://phabricator.wikimedia.org/P65852 and previous config saved to /var/cache/conftool/dbconfig/20240705-051202-marostegui.json [05:12:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [05:12:06] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:12:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [05:12:23] (03PS1) 10Marostegui: pc1017: Add host [puppet] - 10https://gerrit.wikimedia.org/r/1052198 (https://phabricator.wikimedia.org/T368920) [05:13:07] (03CR) 10Marostegui: [C:03+2] pc1017: Add host [puppet] - 10https://gerrit.wikimedia.org/r/1052198 (https://phabricator.wikimedia.org/T368920) (owner: 10Marostegui) [05:13:35] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 0.853 second response time https://wikitech.wikimedia.org/wiki/Swift [05:14:35] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Swift [05:17:50] (03PS1) 10Marostegui: db22[21-40]: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1052199 [05:18:15] (03CR) 10Marostegui: [C:03+2] db22[21-40]: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/1052199 (owner: 10Marostegui) [05:20:27] FIRING: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:27:28] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1222 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1052200 (https://phabricator.wikimedia.org/T369339) [05:27:33] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1052201 (https://phabricator.wikimedia.org/T369339) [05:37:31] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:38:31] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 143, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240705T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:20] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9955660 (10Volans) @AndyRussG thanks, I've got your updated email on LDAP and added your user to the `nda` group. I think you should have again acce... [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:07] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:10:59] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Wow weird bug, thanks for fixing this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052141 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [06:14:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [06:14:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [06:16:09] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:19:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [06:19:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [06:34:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9955677 (10Volans) @Miriam I'm not sure if we have every written them somewhere explicitly. After a quick look as I can't find a clear reference. I'll... [06:35:32] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9955679 (10Volans) [06:38:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [06:38:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [06:42:14] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9955682 (10Volans) As I don't see an NDA on file for @JJMC89, adding @KFrancis for preparing one or confirming it already exists if I missed it. [06:43:16] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052127 (owner: 10Giuseppe Lavagetto) [06:43:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [06:43:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [06:43:40] volans: I don't have one yet [06:44:19] ack, thanks for confirming :) [06:48:39] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Swift [06:49:37] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Swift [06:50:13] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9955686 (10AndyRussG) Thanks, @Volans. I can confirm that I have access to JupyterHub again, yaaaayyyy ;) [06:51:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [06:51:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [06:53:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.052 second response time https://wikitech.wikimedia.org/wiki/Swift [06:55:57] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Swift [06:56:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [06:56:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [06:59:59] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.054 second response time https://wikitech.wikimedia.org/wiki/Swift [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240705T0700) [07:00:59] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift [07:06:41] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw2383:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:11:11] (03CR) 10Volans: "much better, thanks! Left couple of comments inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [07:21:48] (03PS1) 10Marostegui: site.pp: New parsercache host [puppet] - 10https://gerrit.wikimedia.org/r/1052205 (https://phabricator.wikimedia.org/T368920) [07:25:52] (03CR) 10Marostegui: [C:03+2] site.pp: New parsercache host [puppet] - 10https://gerrit.wikimedia.org/r/1052205 (https://phabricator.wikimedia.org/T368920) (owner: 10Marostegui) [07:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:39:16] (03CR) 10Elukey: [C:03+2] knative-serving: remove _example settings shipped with upstream yamls [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052141 (https://phabricator.wikimedia.org/T368359) (owner: 10Elukey) [07:41:24] (03Abandoned) 10Elukey: role::puppetserver: skip puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/1050607 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [07:44:23] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:44:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:47:24] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:49:12] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:50:12] RESOLVED: HelmReleaseBadStatus: Helm release knative-serving/knative-serving on k8s-mlstaging@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s-mlstaging&var-namespace=knative-serving - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [07:50:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:50:51] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:56:02] (03PS1) 10Marostegui: installserver: Allow reimage pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1052255 (https://phabricator.wikimedia.org/T368920) [08:00:13] (03CR) 10Marostegui: [C:03+2] installserver: Allow reimage pc1017 [puppet] - 10https://gerrit.wikimedia.org/r/1052255 (https://phabricator.wikimedia.org/T368920) (owner: 10Marostegui) [08:04:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [08:04:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [08:08:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T367856)', diff saved to https://phabricator.wikimedia.org/P65854 and previous config saved to /var/cache/conftool/dbconfig/20240705-080807-marostegui.json [08:08:08] (03PS1) 10Elukey: merge_cli: fix a puppet-merge.sh comment [puppet] - 10https://gerrit.wikimedia.org/r/1052260 (https://phabricator.wikimedia.org/T366355) [08:08:10] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:10:48] (03CR) 10Filippo Giunchedi: mariadb: recording rules to monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050376 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb) [08:11:49] (03CR) 10Filippo Giunchedi: [C:03+1] mariadb: add monitoring on io pressure for mariadb hosts [alerts] - 10https://gerrit.wikimedia.org/r/1049196 (https://phabricator.wikimedia.org/T367281) (owner: 10Arnaudb) [08:13:19] (03CR) 10Filippo Giunchedi: [C:03+2] pontoon: Remove more puppet 5 leftovers [puppet] - 10https://gerrit.wikimedia.org/r/1047502 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:20:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [08:20:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [08:23:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P65855 and previous config saved to /var/cache/conftool/dbconfig/20240705-082314-marostegui.json [08:25:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [08:25:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [08:26:19] (03PS1) 10Elukey: puppetmaster::gitclone: disarm pre-commit and post-commit hooks [puppet] - 10https://gerrit.wikimedia.org/r/1052261 (https://phabricator.wikimedia.org/T368023) [08:27:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 461.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:30:01] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1052261 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [08:36:20] I'll take a look at the NEL not reported alerts [08:38:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P65856 and previous config saved to /var/cache/conftool/dbconfig/20240705-083821-marostegui.json [08:42:05] (03PS1) 10Elukey: services: lower mesh's envoy concurrency to 8 for Wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052262 (https://phabricator.wikimedia.org/T368238) [08:45:14] (03CR) 10Clément Goubert: [C:03+1] services: lower mesh's envoy concurrency to 8 for Wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052262 (https://phabricator.wikimedia.org/T368238) (owner: 10Elukey) [08:45:15] (03PS1) 10Elukey: services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) [08:47:18] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [08:47:18] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [08:51:37] 06SRE: NEL almost not reported anymore / very infrequently - https://phabricator.wikimedia.org/T369345 (10fgiunchedi) 03NEW [08:51:38] I'll ack with T369345 [08:51:39] T369345: NEL almost not reported anymore / very infrequently - https://phabricator.wikimedia.org/T369345 [08:52:18] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [08:52:18] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [08:53:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T367856)', diff saved to https://phabricator.wikimedia.org/P65857 and previous config saved to /var/cache/conftool/dbconfig/20240705-085329-marostegui.json [08:53:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [08:53:32] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:53:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [08:53:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:54:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [08:54:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T367856)', diff saved to https://phabricator.wikimedia.org/P65858 and previous config saved to /var/cache/conftool/dbconfig/20240705-085406-marostegui.json [08:55:26] !log silence NELNotReported NELByCountryNotReported until Tues - T369345 [08:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:49] (03Abandoned) 10Fabfur: hiera: use benthos on cp3073 (first esams host) [puppet] - 10https://gerrit.wikimedia.org/r/1036190 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [09:05:38] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: add approvers to analytics-research-admins - https://phabricator.wikimedia.org/T368435#9955984 (10Miriam) All clear, thank you @Volans! @dzahn may I suggest we also add @XiaoXiao-WMF as approver (the other Research manager) in case I am... [09:06:32] (03PS1) 10Ayounsi: Point netbox-next to netbox-dev2003 [dns] - 10https://gerrit.wikimedia.org/r/1052266 (https://phabricator.wikimedia.org/T336275) [09:06:42] 06SRE: NEL almost not reported anymore / very infrequently - https://phabricator.wikimedia.org/T369345#9955987 (10fgiunchedi) [09:06:48] (03PS2) 10Ayounsi: Point netbox-next to netbox-dev2003 [dns] - 10https://gerrit.wikimedia.org/r/1052266 (https://phabricator.wikimedia.org/T336275) [09:09:41] (03CR) 10Elukey: [C:03+1] Point netbox-next to netbox-dev2003 [dns] - 10https://gerrit.wikimedia.org/r/1052266 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:15:06] (03PS1) 10Ayounsi: netbox-dev2003: move from netbox-dev to netbox-next [puppet] - 10https://gerrit.wikimedia.org/r/1052267 (https://phabricator.wikimedia.org/T336275) [09:15:20] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052267 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:16:42] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052267 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:21:33] (03PS1) 10Marostegui: db2158: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1052268 [09:23:31] (03CR) 10Marostegui: [C:03+2] db2158: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1052268 (owner: 10Marostegui) [09:25:50] (03PS1) 10Lucas Werkmeister (WMDE): Define custom search-index-data-formatter-callback [extensions/EntitySchema] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052269 (https://phabricator.wikimedia.org/T369149) [09:25:56] (03CR) 10Elukey: [C:03+1] netbox-dev2003: move from netbox-dev to netbox-next [puppet] - 10https://gerrit.wikimedia.org/r/1052267 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:26:05] (03PS1) 10Lucas Werkmeister (WMDE): Try looking up search index data formatters by data type [extensions/WikibaseCirrusSearch] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052270 (https://phabricator.wikimedia.org/T369149) [09:26:31] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 330.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:26:33] (03CR) 10Lucas Werkmeister (WMDE): "Note: there’s no need to backport the Wikibase change or follow-up WikibaseCirrusSearch change mentioned in the commit message." [extensions/WikibaseCirrusSearch] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052270 (https://phabricator.wikimedia.org/T369149) (owner: 10Lucas Werkmeister (WMDE)) [09:26:40] !log netbox-dev2003: move from netbox-dev to netbox-next - T336275 [09:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:43] T336275: Upgrade Netbox to 4.x - https://phabricator.wikimedia.org/T336275 [09:26:50] (03CR) 10Ayounsi: [C:03+2] netbox-dev2003: move from netbox-dev to netbox-next [puppet] - 10https://gerrit.wikimedia.org/r/1052267 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:28:57] thcipriani, hashar: help! I’d like to do an emergency deploy for https://gerrit.wikimedia.org/r/1052269 and https://gerrit.wikimedia.org/r/1052270 to fix T369149 [09:28:57] T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”) - https://phabricator.wikimedia.org/T369149 [09:29:42] currently, search updates for some 200 Wikidata items, including important and widely-used classes like “human” or “album”, are blocked by this bug (i.e. the search sees old versions of the data) [09:30:16] (03PS1) 10Vgutierrez: hiera: Don't delete X-Forwarded-Proto request header [puppet] - 10https://gerrit.wikimedia.org/r/1052271 (https://phabricator.wikimedia.org/T369345) [09:30:21] reverting is not an option because the issue is caused by new on-wiki data which was enabled by a feature flag we turned on (even if we turned off the feature flag again – which we really don’t want to do – it wouldn’t help with the issue at all) [09:30:56] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052271 (https://phabricator.wikimedia.org/T369345) (owner: 10Vgutierrez) [09:31:08] (03CR) 10Elukey: [C:03+2] Point netbox-next to netbox-dev2003 [dns] - 10https://gerrit.wikimedia.org/r/1052266 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [09:31:18] * Lucas_WMDE hopes some WMFers are around on the friday after july 4th… [09:32:01] (03CR) 10Fabfur: [C:03+1] hiera: Don't delete X-Forwarded-Proto request header [puppet] - 10https://gerrit.wikimedia.org/r/1052271 (https://phabricator.wikimedia.org/T369345) (owner: 10Vgutierrez) [09:32:35] (03CR) 10Filippo Giunchedi: [C:03+1] hiera: Don't delete X-Forwarded-Proto request header [puppet] - 10https://gerrit.wikimedia.org/r/1052271 (https://phabricator.wikimedia.org/T369345) (owner: 10Vgutierrez) [09:33:14] (03CR) 10Vgutierrez: [C:03+2] hiera: Don't delete X-Forwarded-Proto request header [puppet] - 10https://gerrit.wikimedia.org/r/1052271 (https://phabricator.wikimedia.org/T369345) (owner: 10Vgutierrez) [09:34:35] (03CR) 10Cathal Mooney: [C:03+1] "Nice to see it happening!" [dns] - 10https://gerrit.wikimedia.org/r/1052144 (https://phabricator.wikimedia.org/T359054) (owner: 10Ssingh) [09:35:26] (03PS1) 10Btullis: cephcsi: bump_image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052275 (https://phabricator.wikimedia.org/T327259) [09:35:29] !log running puppet on A:cp to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1052271 (T369345) [09:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:32] T369345: NEL almost not reported anymore / very infrequently - https://phabricator.wikimedia.org/T369345 [09:37:44] (03PS6) 10Jgiannelos: Remove page html endpoints from changeprop [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051361 (https://phabricator.wikimedia.org/T367418) [09:42:33] (03CR) 10Btullis: [C:03+2] cephcsi: bump_image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052275 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:45:33] (03Merged) 10jenkins-bot: cephcsi: bump_image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052275 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:51:54] Lucas_WMDE: I guess go ahead since you are literally the expert on those [09:52:17] if you rather have someone to pair with, I have to prepare lunch for kid and I am going away for the next hour~ [09:52:48] hashar: thanks! I could also deploy later if that’s more convenient [09:52:55] (03CR) 10Hashar: [C:03+1] Define custom search-index-data-formatter-callback [extensions/EntitySchema] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052269 (https://phabricator.wikimedia.org/T369149) (owner: 10Lucas Werkmeister (WMDE)) [09:52:56] but I think I’m fine doing the deployment itself on my own [09:53:00] (03CR) 10Hashar: [C:03+1] Try looking up search index data formatters by data type [extensions/WikibaseCirrusSearch] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052270 (https://phabricator.wikimedia.org/T369149) (owner: 10Lucas Werkmeister (WMDE)) [09:53:01] just don’t want to break the policy ^^ [09:53:13] well I don't have much avalue beside monitoring the logs :D [09:53:28] but at least now i am aware! [09:53:30] I guess I’ll just do it now then [09:53:32] so feel free to deploy now [09:53:35] great [09:53:36] \o/ [09:53:38] gives me more time to check the logs before I go for lunch myself w^ [09:53:41] * ^^ [09:53:52] I have +1ed both changes [09:53:56] thanks! [09:54:02] I should be able to test the changes on mwdebug too, at least [09:54:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/EntitySchema] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052269 (https://phabricator.wikimedia.org/T369149) (owner: 10Lucas Werkmeister (WMDE)) [09:54:20] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/WikibaseCirrusSearch] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052270 (https://phabricator.wikimedia.org/T369149) (owner: 10Lucas Werkmeister (WMDE)) [09:54:49] (if anyone objects to the deployment, you probably have at least 10 minutes to tell me before CI is done ^^) [09:56:14] :D [09:56:16] I am off [09:56:22] alright, see you later! [09:56:31] oh and if you look at the mediawiki-new-errors dashboard at https://logstash.wikimedia.org/app/dashboards#/view/c7013c90-a487-11ec-be91-b3435f0c0c49 [09:56:37] there are some log entries which I have not filtered out [09:56:51] slyngs, fabfur: (pinging oncallers) FYI, I’m doing an emergency deployment ^ [09:56:53] they are rather minor as far as I can tell and I filed most of them in Phabricator [09:57:05] * Lucas_WMDE looks [09:57:18] so not much to worry about, it was a rather quiet train [09:57:21] I am off to prepare lunch [09:57:39] Lucas_WMDE: Noted, thank you [09:58:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9956200 (10elukey) I confirm that using the `ADMIN` (uppercase) user everything works fine, I was able to use Redfish today! [10:06:10] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9956217 (10elukey) @Papaul I compared the BIOS settings between sretest2001 and kubernetes2054, these are the differences: sretest2001 has `ConsoleRedirection`=False m... [10:10:24] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:10:24] (03Merged) 10jenkins-bot: Define custom search-index-data-formatter-callback [extensions/EntitySchema] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052269 (https://phabricator.wikimedia.org/T369149) (owner: 10Lucas Werkmeister (WMDE)) [10:11:13] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:16:02] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052129 (owner: 10Giuseppe Lavagetto) [10:19:45] (03Merged) 10jenkins-bot: Try looking up search index data formatters by data type [extensions/WikibaseCirrusSearch] (wmf/1.43.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1052270 (https://phabricator.wikimedia.org/T369149) (owner: 10Lucas Werkmeister (WMDE)) [10:20:01] wheee [10:20:17] !log lucaswerkmeister-wmde@deploy1002 Started scap sync-world: Backport for [[gerrit:1052269|Define custom search-index-data-formatter-callback (T369149)]], [[gerrit:1052270|Try looking up search index data formatters by data type (T369149)]] [10:20:20] T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”) - https://phabricator.wikimedia.org/T369149 [10:22:46] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1052269|Define custom search-index-data-formatter-callback (T369149)]], [[gerrit:1052270|Try looking up search index data formatters by data type (T369149)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:23:09] testing… [10:23:21] yay, https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&pageids=120965176&prop=cirrusbuilddoc no longer errors [10:24:02] o_O but it no longer errors even when I turn off WikimediaDebug? [10:24:05] but I was able to see the error earlier… [10:24:26] that’s… concerning [10:24:45] did it accidentally deploy everywhere already somehow? [10:25:11] I’ll take a bit more time to look into this if that’s okay with everyone, the fix isn’t urgent and this seems worrying [10:25:17] * Lucas_WMDE looks at k8s deployments [10:27:10] according to the server response header, the non-error response comes from mw-api-ext.eqiad.main-6c9d8796d6-xvr5m, and the mw-api-ext.eqiad.main-6c9d8796d6 replicaset (in kube_env mw-api-ext eqiad) is 14h old [10:27:25] and its image has the version 2024-07-04-200554-webserver, i.e. yesterday [10:27:30] ohhhhh [10:27:35] I bet it’s cached [10:27:55] and the old code gets the search data from where the new code (on mwdebug) left it in the cache [10:28:06] I saw a cache *somewhere* in the call stack when I was working on this locally [10:30:41] hm, but when I try some other page IDs (e.g. 133 or 24533) they also succeed [10:30:44] Lucas_WMDE: yeah it's cache [10:30:49] without me ever loading the search data for those on mwdebug [10:30:53] curl --connect-to www.wikidata.org:443:mw-api-ext.svc.eqiad.wmnet:4447 'https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&pageids=120965176&prop=cirrusbuilddoc' | jq [10:30:55] so I wouldn’t expect those to be in the cache… [10:31:13] hmm no I'm wrong, that's not an error [10:31:49] actually [10:32:10] what is it supposed to return? [10:32:28] claime: with the fix it’s supposed to return what you see there, I think [10:32:31] without the fix it would be an internal error [10:32:34] (the cache I encountered locally was in ParserOutputPageProperties::finalizeReal() btw) [10:32:48] you can see the error at https://phabricator.wikimedia.org/T369149#9949518, though the API response wouldn’t include the stack trace [10:32:49] oh, parsercache, not edge cache [10:33:01] object cache but yeah [10:33:50] hmm, I wonder [10:33:57] maybe other items are not expected to be affected after all? [10:33:58] let me test something [10:35:31] (03PS1) 10Btullis: Add the analytics-wmde keytab to the stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/1052280 (https://phabricator.wikimedia.org/T340648) [10:35:35] okay, I can reproduce it at https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&pageids=4246474&prop=cirrusbuilddoc [10:35:41] it doesn’t affect all items with P12861 statements, as I had thought [10:35:48] it needs to be in a qualifier of a normal item statement, apparently [10:35:52] hence all the other items not being affected [10:36:04] but https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&pageids=4246474&prop=cirrusbuilddoc shows the error before the fix, and now I’m going to enable WikimediaDebug [10:36:07] (03PS2) 10Btullis: Add the analytics-wmde keytab to the stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/1052280 (https://phabricator.wikimedia.org/T340648) [10:36:13] ("[2ab276ab-c3c6-491f-a185-ac87f7119231] Caught exception of type TypeError" ftr) [10:36:21] and with WikimediaDebug it works [10:36:33] and now it works again without WikimediaDebug as well, because object cache [10:36:38] okay, mystery solved [10:36:42] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [10:36:52] 06SRE, 06Infrastructure-Foundations, 10netops: Model GRE tunnels in Netbox - https://phabricator.wikimedia.org/T369351 (10cmooney) 03NEW p:05Triage→03Low [10:36:53] :D [10:37:19] so my claim for the emergency deploy was incorrect, it turns out – most of those 200 items won’t have been affected by the fix after all [10:37:29] but I’m still happy to have it deployed, even if it just fixes the one property ^^ [10:37:40] (since it was a fairly annoying issue there – you could only find the property by its old label) [10:37:50] 06SRE, 10LDAP-Access-Requests: Grant access to wmf to lferreira - https://phabricator.wikimedia.org/T369348#9956301 (10Nemoralis) [10:40:47] (03PS1) 10Marostegui: pc2017: Add host [puppet] - 10https://gerrit.wikimedia.org/r/1052281 (https://phabricator.wikimedia.org/T368919) [10:41:39] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1052269|Define custom search-index-data-formatter-callback (T369149)]], [[gerrit:1052270|Try looking up search index data formatters by data type (T369149)]] (duration: 21m 22s) [10:41:42] T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”) - https://phabricator.wikimedia.org/T369149 [10:41:56] (03PS2) 10Marostegui: pc2017: Add host [puppet] - 10https://gerrit.wikimedia.org/r/1052281 (https://phabricator.wikimedia.org/T368919) [10:42:43] (03PS10) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [10:43:16] (03CR) 10Elukey: "Added the first BIOS settings after comparing sretest2001 with kubernetes2054" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [10:43:38] (03CR) 10Marostegui: [C:03+2] pc2017: Add host [puppet] - 10https://gerrit.wikimedia.org/r/1052281 (https://phabricator.wikimedia.org/T368919) (owner: 10Marostegui) [10:44:04] * Lucas_WMDE done deploying (fyi slyngst, fabfur, hashar) [10:47:01] (03PS1) 10Btullis: Add dummy keytabs for analytics-wmde on stat servers. [labs/private] - 10https://gerrit.wikimedia.org/r/1052282 (https://phabricator.wikimedia.org/T340648) [10:47:24] (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy keytabs for analytics-wmde on stat servers. [labs/private] - 10https://gerrit.wikimedia.org/r/1052282 (https://phabricator.wikimedia.org/T340648) (owner: 10Btullis) [10:48:52] (03PS1) 10Clément Goubert: check_ferm: Add -w 5 to iptables check [puppet] - 10https://gerrit.wikimedia.org/r/1052283 (https://phabricator.wikimedia.org/T354855) [10:49:34] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3170/co" [puppet] - 10https://gerrit.wikimedia.org/r/1052280 (https://phabricator.wikimedia.org/T340648) (owner: 10Btullis) [10:52:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [10:52:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [10:59:41] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240705T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: Your horoscope predicts another GitLab version upgrades deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240705T1100). [11:00:41] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Swift [11:01:33] (03PS1) 10NMW03: Enable VisualEditor by default on Italian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052285 (https://phabricator.wikimedia.org/T369342) [11:06:41] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw2383:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:37] (03CR) 10Btullis: [V:03+1 C:03+2] Add the analytics-wmde keytab to the stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/1052280 (https://phabricator.wikimedia.org/T340648) (owner: 10Btullis) [11:11:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P65860 and previous config saved to /var/cache/conftool/dbconfig/20240705-111146-ladsgroup.json [11:13:02] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance [11:13:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2207.codfw.wmnet with reason: Maintenance [11:13:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2207 (T352010)', diff saved to https://phabricator.wikimedia.org/P65861 and previous config saved to /var/cache/conftool/dbconfig/20240705-111322-ladsgroup.json [11:13:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:16:26] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on mw2383:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:22:08] (03PS1) 10Jelto: aptrepo: import gitlab-runner package for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1052287 (https://phabricator.wikimedia.org/T367717) [11:23:42] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3171/console" [puppet] - 10https://gerrit.wikimedia.org/r/1052287 (https://phabricator.wikimedia.org/T367717) (owner: 10Jelto) [11:26:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P65862 and previous config saved to /var/cache/conftool/dbconfig/20240705-112652-ladsgroup.json [11:29:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on kubernetes1051.eqiad.wmnet with reason: Hardware issue [11:30:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on kubernetes1051.eqiad.wmnet with reason: Hardware issue [11:30:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: kubernetes1051.eqiad.wmnet failed to pull mediawiki images - https://phabricator.wikimedia.org/T369011#9956418 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=53cf057a-4641-401a-ab84-392d5d8f2444) set by cgoubert@cumin1002... [11:31:59] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1052287 (https://phabricator.wikimedia.org/T367717) (owner: 10Jelto) [11:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:39:03] (03CR) 10Jelto: [V:03+1 C:03+2] aptrepo: import gitlab-runner package for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1052287 (https://phabricator.wikimedia.org/T367717) (owner: 10Jelto) [11:39:35] 06SRE, 06Traffic, 13Patch-For-Review: Disable acceptance of IPv6 router-advertisement on non-default LVS interface - https://phabricator.wikimedia.org/T358260#9956433 (10cmooney) Bit of an update on this one. We had a problem recently after lvs2011 was rebooted which is related, which we need to address. *... [11:41:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P65863 and previous config saved to /var/cache/conftool/dbconfig/20240705-114157-ladsgroup.json [11:46:31] (03CR) 10Alexandros Kosiaris: [C:03+1] check_ferm: Add -w 5 to iptables check [puppet] - 10https://gerrit.wikimedia.org/r/1052283 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [11:53:27] !log T369149: re-indexed wikidata P12861 (cirrus_rerender.rerender --wiki wikidatawiki allpages --namespace 120 --from-title P12861 --to-title P12861) [11:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:29] T369149: Search has outdated label for P12861 (“Shape Expression for class” rather than “EntitySchema for class”) - https://phabricator.wikimedia.org/T369149 [11:57:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1231 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P65864 and previous config saved to /var/cache/conftool/dbconfig/20240705-115703-ladsgroup.json [11:57:09] (03PS1) 10Btullis: cephcsi: Enable the prometheus-liveness container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052292 (https://phabricator.wikimedia.org/T327259) [12:00:15] (03PS2) 10Alexandros Kosiaris: mediawiki-image-download: Drop to 80% [puppet] - 10https://gerrit.wikimedia.org/r/1039620 (https://phabricator.wikimedia.org/T366778) [12:00:23] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] mediawiki-image-download: Drop to 80% [puppet] - 10https://gerrit.wikimedia.org/r/1039620 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris) [12:00:33] (03CR) 10Clément Goubert: [C:03+1] mediawiki-image-download: Drop to 80% [puppet] - 10https://gerrit.wikimedia.org/r/1039620 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris) [12:01:35] (03CR) 10Clément Goubert: [C:03+2] check_ferm: Add -w 5 to iptables check [puppet] - 10https://gerrit.wikimedia.org/r/1052283 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [12:06:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T367856)', diff saved to https://phabricator.wikimedia.org/P65865 and previous config saved to /var/cache/conftool/dbconfig/20240705-120608-marostegui.json [12:06:12] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:10:03] (03CR) 10Alexandros Kosiaris: [C:03+1] Remove conftool-data and service catalog for legacy appservers 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050383 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:10:25] (03CR) 10Alexandros Kosiaris: [C:03+1] Remove legacy appservers from profile::lvs::realserver::pools 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050382 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:11:33] (03CR) 10Alexandros Kosiaris: [C:03+1] "This might break some very rarely run internal worksloads, but it's arguably better they adapt than maintain these forever." [dns] - 10https://gerrit.wikimedia.org/r/1050304 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:11:56] (03CR) 10Alexandros Kosiaris: [C:03+1] service.yaml: Switch api and appserver to lvs_setup 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/1050381 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [12:11:58] (03CR) 10Btullis: [C:03+2] cephcsi: Enable the prometheus-liveness container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052292 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [12:13:08] (03CR) 10Alexandros Kosiaris: [C:03+1] "+1, but we need to communicate this before doing it. wikitech-l@ and engineering-all in slack should suffice, I guess?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051343 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [12:13:49] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [12:15:02] (03Merged) 10jenkins-bot: cephcsi: Enable the prometheus-liveness container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052292 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [12:17:30] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T368766#9956539 (10phaultfinder) [12:19:23] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:19:49] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:21:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P65866 and previous config saved to /var/cache/conftool/dbconfig/20240705-122115-marostegui.json [12:26:43] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 0.609 second response time https://wikitech.wikimedia.org/wiki/Swift [12:27:15] (03Abandoned) 10Alexandros Kosiaris: WIP deployment::rsync: Temporarily disable stunnel [puppet] - 10https://gerrit.wikimedia.org/r/1051782 (https://phabricator.wikimedia.org/T364417) (owner: 10Alexandros Kosiaris) [12:27:43] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Swift [12:28:47] (03CR) 10Alexandros Kosiaris: [C:03+1] "+1 but is this still needed?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013035 (https://phabricator.wikimedia.org/T353876) (owner: 10Clément Goubert) [12:36:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P65867 and previous config saved to /var/cache/conftool/dbconfig/20240705-123623-marostegui.json [12:36:36] (03CR) 10Clément Goubert: "We haven't seen this repeat since the patch was written, so I guess it would be better to pick https://phabricator.wikimedia.org/T361483 b" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013035 (https://phabricator.wikimedia.org/T353876) (owner: 10Clément Goubert) [12:49:26] (03PS2) 10Alexandros Kosiaris: mwdebug: Change various uses to mw-on-k8s version [puppet] - 10https://gerrit.wikimedia.org/r/1051344 (owner: 10Jforrester) [12:49:26] (03PS2) 10Alexandros Kosiaris: mwdebug: Drop mwdebug\d{4} for bare metal servers [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [12:50:34] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T367856)', diff saved to https://phabricator.wikimedia.org/P65868 and previous config saved to /var/cache/conftool/dbconfig/20240705-125130-marostegui.json [12:51:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [12:51:33] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [12:51:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [12:51:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T367856)', diff saved to https://phabricator.wikimedia.org/P65869 and previous config saved to /var/cache/conftool/dbconfig/20240705-125152-marostegui.json [12:54:01] (03Abandoned) 10Alexandros Kosiaris: changeprop: Exclude commons files with 100+ pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013035 (https://phabricator.wikimedia.org/T353876) (owner: 10Clément Goubert) [12:54:03] (03CR) 10CI reject: [V:04-1] mwdebug: Drop mwdebug\d{4} for bare metal servers [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [12:57:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1246.eqiad.wmnet with reason: Long schema change [12:57:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1246.eqiad.wmnet with reason: Long schema change [13:05:39] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [13:10:39] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [13:30:33] 06SRE, 06serviceops-radar, 13Patch-For-Review: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165#9956684 (10akosiaris) 05Open→03Resolved a:03akosiaris My git grep above was wrongly also matching the `deploy-servi... [13:40:13] !log hashar@deploy1002 Started deploy [integration/docroot@18c8279]: Add AQS documentation to landing page - T368484 [13:40:16] T368484: Add AQS docs site to doc.wikimedia.org homepage - https://phabricator.wikimedia.org/T368484 [13:40:20] !log hashar@deploy1002 Finished deploy [integration/docroot@18c8279]: Add AQS documentation to landing page - T368484 (duration: 00m 06s) [13:46:24] (03PS1) 10Ssingh: conftool/cli: add option to log actions with a reason string [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 [13:49:16] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:04] (03PS3) 10Alexandros Kosiaris: mwdebug: Drop mwdebug\d{4} for bare metal servers [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [13:56:44] (03CR) 10JMeybohm: [C:03+1] services: upgrade mesh's envoy Docker version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052263 (https://phabricator.wikimedia.org/T368366) (owner: 10Elukey) [13:56:57] (03CR) 10JMeybohm: [C:03+1] services: lower mesh's envoy concurrency to 8 for Wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052262 (https://phabricator.wikimedia.org/T368238) (owner: 10Elukey) [13:57:05] (03CR) 10Alexandros Kosiaris: [C:03+1] conftool/cli: add option to log actions with a reason string [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 (owner: 10Ssingh) [14:00:34] (03CR) 10Alexandros Kosiaris: [C:04-2] "-2ing until we resolve https://phabricator.wikimedia.org/T324003." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1051343 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [14:02:05] (03CR) 10Alexandros Kosiaris: [C:04-2] "I fixed the failing tests, but -2ing until we resolve https://phabricator.wikimedia.org/T324003." [puppet] - 10https://gerrit.wikimedia.org/r/1051345 (https://phabricator.wikimedia.org/T367949) (owner: 10Jforrester) [14:09:47] (03PS2) 10Arnaudb: mariadb: recording rules to monitor [puppet] - 10https://gerrit.wikimedia.org/r/1050376 (https://phabricator.wikimedia.org/T367283) [14:09:58] (03CR) 10Arnaudb: mariadb: recording rules to monitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1050376 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb) [14:14:25] (03CR) 10Alexandros Kosiaris: [C:04-1] "I don't think this will work. I see in https://puppet-compiler.wmflabs.org/output/1052129/3909/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1052129 (owner: 10Giuseppe Lavagetto) [14:15:27] (03CR) 10Alexandros Kosiaris: [C:03+1] "PCC looks ok to me" [puppet] - 10https://gerrit.wikimedia.org/r/1052127 (owner: 10Giuseppe Lavagetto) [14:15:37] (03CR) 10Alexandros Kosiaris: [C:03+1] mediawiki::sites: switch to use APACHE_RUN_PORT [puppet] - 10https://gerrit.wikimedia.org/r/1052128 (owner: 10Giuseppe Lavagetto) [14:19:16] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:01] (03CR) 10AOkoth: vtrs: upgrade cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) (owner: 10AOkoth) [14:23:51] (03CR) 10Alexandros Kosiaris: [V:03+2] "I don't think it matters much. The image isn't going to be deleted by merging this change, we 'll need to delete it manually. Which needs " [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051407 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [14:27:05] (03PS1) 10Elukey: redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) [14:27:33] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] Revert "Resurrect fluent-bit image" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1051407 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [14:33:22] (03CR) 10CI reject: [V:04-1] redfish: add the add_account function [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:37:06] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#9956803 (10GPSLeo) I am currently again experiencing lots of different error... [14:38:55] (03PS1) 10Alexandros Kosiaris: api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) [14:39:16] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:39] (03CR) 10CI reject: [V:04-1] api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [14:45:34] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:53:48] 06SRE, 06Traffic: Migrate DNS depooling of sites from operation/dns (git) to confctl - https://phabricator.wikimedia.org/T369366 (10ssingh) 03NEW [14:55:06] (03PS2) 10Ssingh: conftool/cli: add option to log actions with a reason string [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 (https://phabricator.wikimedia.org/T369366) [14:55:47] (03CR) 10Ssingh: "No code change since last review (thanks for that @akosiaris@wikimedia.org!). Commit message updated." [software/conftool] - 10https://gerrit.wikimedia.org/r/1052307 (https://phabricator.wikimedia.org/T369366) (owner: 10Ssingh) [14:56:18] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operation/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9956889 (10ssingh) p:05Triage→03Medium [14:56:39] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9956890 (10ssingh) [14:57:39] (03PS3) 10Filippo Giunchedi: mariadb: recording rules to monitor [puppet] - 10https://gerrit.wikimedia.org/r/1050376 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb) [14:58:03] (03CR) 10Filippo Giunchedi: [C:03+1] "I've tweaked the last PS slightly and pushed a new one, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1050376 (https://phabricator.wikimedia.org/T367283) (owner: 10Arnaudb) [14:59:16] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:34] (03PS2) 10Alexandros Kosiaris: api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) [15:00:11] (03CR) 10Hnowlan: [C:04-1] api-gateway: Remove eventgate logging support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [15:00:21] (03CR) 10CI reject: [V:04-1] api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [15:01:04] (03CR) 10Alexandros Kosiaris: api-gateway: Remove eventgate logging support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [15:02:19] 06SRE, 06Traffic, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9956911 (10ssingh) [15:03:00] (03PS3) 10Alexandros Kosiaris: api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) [15:03:10] (03CR) 10Alexandros Kosiaris: api-gateway: Remove eventgate logging support (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [15:03:48] (03CR) 10CI reject: [V:04-1] api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [15:04:58] (03CR) 10Elukey: [C:04-1] "Still WIP with tests, it's Friday and I choose to avoid fighting with Pytests :D" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1052311 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:28:48] (03PS4) 10Alexandros Kosiaris: api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) [15:29:46] (03CR) 10CI reject: [V:04-1] api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [15:34:58] (03PS5) 10Alexandros Kosiaris: api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) [15:35:43] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 0.421 second response time https://wikitech.wikimedia.org/wiki/Swift [15:36:47] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.096 second response time https://wikitech.wikimedia.org/wiki/Swift [15:37:07] (03CR) 10Alexandros Kosiaris: [C:03+2] api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [15:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:38:08] (03Merged) 10jenkins-bot: api-gateway: Remove eventgate logging support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052314 (https://phabricator.wikimedia.org/T251812) (owner: 10Alexandros Kosiaris) [15:55:29] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 179 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:56:24] (03PS1) 10Kamila Součková: Add $wgMaxShellWallClockTime setting for shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 [15:57:05] (03CR) 10CI reject: [V:04-1] Add $wgMaxShellWallClockTime setting for shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (owner: 10Kamila Součková) [16:00:07] (03PS2) 10Kamila Součková: Add $wgMaxShellWallClockTime setting for shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) [16:10:10] (03CR) 10Hnowlan: [C:04-1] Add $wgMaxShellWallClockTime setting for shellbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [16:10:23] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 90 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:16:57] (03CR) 10Kamila Součková: Add $wgMaxShellWallClockTime setting for shellbox (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [16:17:21] (03PS3) 10Kamila Součková: Add $wgMaxShellWallClockTime setting for shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) [16:28:47] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Swift [16:31:47] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Swift [16:39:15] 06SRE, 10SRE-Access-Requests: Requesting access to stewards-users for JJMC89 - https://phabricator.wikimedia.org/T369314#9957098 (10Urbanecm) Hi @Volans, I see the group approval field was checked, but the WMF sponsor one is not checked. Is it okay for me (the group approver) to also act as the sponsor? Or do... [16:40:00] (03CR) 10Hnowlan: [C:03+1] Add $wgMaxShellWallClockTime setting for shellbox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1052325 (https://phabricator.wikimedia.org/T356241) (owner: 10Kamila Součková) [17:00:01] !log andrewtavis-wmde@deploy1002 Started deploy [airflow-dags/wmde@73c6618]: (no justification provided) [17:00:08] !log andrewtavis-wmde@deploy1002 Finished deploy [airflow-dags/wmde@73c6618]: (no justification provided) (duration: 00m 06s) [17:03:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65871 and previous config saved to /var/cache/conftool/dbconfig/20240705-170356-root.json [17:11:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T352010)', diff saved to https://phabricator.wikimedia.org/P65872 and previous config saved to /var/cache/conftool/dbconfig/20240705-171131-ladsgroup.json [17:11:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:12:40] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384 (10cmooney) 03NEW p:05Triage→03Medium [17:12:42] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#9957289 (10cmooney) [17:16:18] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322#9957290 (10cmooney) 05Open→03Resolved I'm going to close this task now, the current gnmic collection is providing what we need i... [17:17:59] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability, 10Observability-Metrics: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210#9957316 (10cmooney) 05Open→03Resolved Seems like a great tool, but we are going to move forward with pulling these stats using... [17:19:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65873 and previous config saved to /var/cache/conftool/dbconfig/20240705-171901-root.json [17:20:57] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#9957330 (10cmooney) [17:26:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P65874 and previous config saved to /var/cache/conftool/dbconfig/20240705-172639-ladsgroup.json [17:34:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65875 and previous config saved to /var/cache/conftool/dbconfig/20240705-173406-root.json [17:40:19] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#9957358 (10Mike_Peel) Also getting errors while uploading: ` The MediaWiki... [17:41:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207', diff saved to https://phabricator.wikimedia.org/P65876 and previous config saved to /var/cache/conftool/dbconfig/20240705-174146-ladsgroup.json [17:46:26] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:47:49] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:49:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65877 and previous config saved to /var/cache/conftool/dbconfig/20240705-174912-root.json [17:49:29] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:56:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2207 (T352010)', diff saved to https://phabricator.wikimedia.org/P65878 and previous config saved to /var/cache/conftool/dbconfig/20240705-175653-ladsgroup.json [17:56:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:58:37] (03PS1) 10Btullis: cephcsi: Grant the provisioner access to the ceph userID secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052341 (https://phabricator.wikimedia.org/T327259) [17:59:28] (03PS2) 10Btullis: cephcsi: Grant the provisioner access to the ceph userID secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052341 (https://phabricator.wikimedia.org/T327259) [18:00:27] (03PS3) 10Btullis: cephcsi: Grant the provisioner access to the ceph userID secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1052341 (https://phabricator.wikimedia.org/T327259) [18:04:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65879 and previous config saved to /var/cache/conftool/dbconfig/20240705-180417-root.json [18:10:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T367856)', diff saved to https://phabricator.wikimedia.org/P65880 and previous config saved to /var/cache/conftool/dbconfig/20240705-181020-marostegui.json [18:10:24] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [18:19:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65881 and previous config saved to /var/cache/conftool/dbconfig/20240705-181923-root.json [18:19:33] relaying from -tech: issues with swift https://phabricator.wikimedia.org/T328872#9956803 increasingly more 503s over the last several days: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&viewPanel=37&from=now-30d&to=now-1m [18:24:34] 10SRE-swift-storage, 10API Platform, 06Commons, 10MediaWiki-File-management, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872#9957425 (10TheDJ) https://logstash.wikimedia.org/app/dashboards#/view/AXFV7J... [18:25:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P65882 and previous config saved to /var/cache/conftool/dbconfig/20240705-182527-marostegui.json [18:28:46] 10SRE-swift-storage, 10MediaWiki-Uploading, 06serviceops: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388 (10TheDJ) 03NEW [18:31:03] 10SRE-swift-storage, 10MediaWiki-Uploading, 06serviceops: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9957443 (10TheDJ) p:05Triage→03Unbreak! [18:31:41] 10SRE-swift-storage, 10MediaWiki-Uploading, 06serviceops: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9957441 (10TheDJ) [18:34:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65883 and previous config saved to /var/cache/conftool/dbconfig/20240705-183428-root.json [18:39:14] 10SRE-swift-storage, 10MediaWiki-Uploading, 06serviceops: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9957445 (10TheDJ) [18:40:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P65884 and previous config saved to /var/cache/conftool/dbconfig/20240705-184034-marostegui.json [18:43:08] 10SRE-swift-storage, 10MediaWiki-Uploading, 06serviceops: Upload errors due to swift failures, 503s - https://phabricator.wikimedia.org/T369388#9957452 (10andrea.denisse) Thanks for reporting the issue, I'm investigating it and I've shared it with fellow SRE's for advice. [18:51:51] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Swift [18:52:51] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Swift [18:55:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T367856)', diff saved to https://phabricator.wikimedia.org/P65885 and previous config saved to /var/cache/conftool/dbconfig/20240705-185542-marostegui.json [18:55:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [18:55:45] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [18:55:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [18:56:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T367856)', diff saved to https://phabricator.wikimedia.org/P65886 and previous config saved to /var/cache/conftool/dbconfig/20240705-185604-marostegui.json [19:00:38] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 16), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9957464 (10mforns) [19:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:41:53] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1052185 (https://phabricator.wikimedia.org/T369322) (owner: 10Urbanecm) [20:04:51] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [20:05:01] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [21:20:10] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [21:21:07] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Swift [21:25:10] (03PS10) 10AOkoth: vtrs: upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) [21:45:59] PROBLEM - Swift https backend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Swift [21:46:41] FIRING: [2x] SystemdUnitFailed: rsync-deployment_module.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:46:59] RECOVERY - Swift https backend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Swift [21:47:49] FIRING: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:49:29] FIRING: KeyholderUnarmed: 19 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [22:59:13] PROBLEM - Swift https frontend on ms-fe1014 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.057 second response time https://wikitech.wikimedia.org/wiki/Swift [23:00:13] RECOVERY - Swift https frontend on ms-fe1014 is OK: HTTP OK: HTTP/1.1 200 OK - 294 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Swift [23:07:31] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:37:11] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:38:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052357 [23:38:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1052357 (owner: 10TrainBranchBot)