[00:33:13] 06SRE, 10SRE-Access-Requests, 06Movement-Insights, 13Patch-For-Review: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9743173 (10OSefu-WMF) Approved [01:15:25] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:10:25] (SystemdUnitFailed) firing: (2) wmf_auto_restart_redis-server.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:18:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:38:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:50] 06SRE, 10SRE-Access-Requests, 06Movement-Insights, 13Patch-For-Review: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9743417 (10Dzahn) >>! In T363360#9742998, @Hghani wrote: > My contract expiry date is June 30th 2024. I believe the contac... [04:10:49] 06SRE, 10SRE-Access-Requests, 06Movement-Insights, 13Patch-For-Review: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9743419 (10Dzahn) a:05OSefu-WMF→03BCornwall [04:12:56] (03CR) 10Dzahn: [V:03+1 C:03+1] "has manager approval now" [puppet] - 10https://gerrit.wikimedia.org/r/1023965 (https://phabricator.wikimedia.org/T363360) (owner: 10BCornwall) [04:24:40] (03CR) 10Dzahn: [C:03+1] "lgtm. should it be added one of the "misc" aliases for now or something?" [puppet] - 10https://gerrit.wikimedia.org/r/1023735 (https://phabricator.wikimedia.org/T346935) (owner: 10Muehlenhoff) [04:26:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [04:50:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T352010)', diff saved to https://phabricator.wikimedia.org/P61185 and previous config saved to /var/cache/conftool/dbconfig/20240425-045023-ladsgroup.json [04:50:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:05:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P61186 and previous config saved to /var/cache/conftool/dbconfig/20240425-050531-ladsgroup.json [05:08:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T352010)', diff saved to https://phabricator.wikimedia.org/P61187 and previous config saved to /var/cache/conftool/dbconfig/20240425-050845-ladsgroup.json [05:09:04] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:10:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:20:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P61188 and previous config saved to /var/cache/conftool/dbconfig/20240425-052038-ladsgroup.json [05:23:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P61189 and previous config saved to /var/cache/conftool/dbconfig/20240425-052354-ladsgroup.json [05:35:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T352010)', diff saved to https://phabricator.wikimedia.org/P61190 and previous config saved to /var/cache/conftool/dbconfig/20240425-053545-ladsgroup.json [05:35:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [05:35:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:36:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [05:36:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2161 (T352010)', diff saved to https://phabricator.wikimedia.org/P61191 and previous config saved to /var/cache/conftool/dbconfig/20240425-053608-ladsgroup.json [05:39:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P61192 and previous config saved to /var/cache/conftool/dbconfig/20240425-053901-ladsgroup.json [05:54:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T352010)', diff saved to https://phabricator.wikimedia.org/P61193 and previous config saved to /var/cache/conftool/dbconfig/20240425-055408-ladsgroup.json [05:54:11] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [05:54:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:54:25] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [05:54:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T352010)', diff saved to https://phabricator.wikimedia.org/P61194 and previous config saved to /var/cache/conftool/dbconfig/20240425-055431-ladsgroup.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T0600) [06:00:04] kormat, marostegui, Amir1, and arnaudb: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T0600). [06:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:25] (SystemdUnitFailed) firing: (2) wmf_auto_restart_redis-server.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:22:53] (03PS1) 10Muehlenhoff: idm::redis: Fix name for Redis auto restart [puppet] - 10https://gerrit.wikimedia.org/r/1024092 [06:28:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023838 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [06:29:48] (03CR) 10Slyngshede: idm::redis: Fix name for Redis auto restart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024092 (owner: 10Muehlenhoff) [06:32:22] (03PS2) 10Muehlenhoff: idm::redis: Fix name for Redis auto restart [puppet] - 10https://gerrit.wikimedia.org/r/1024092 [06:32:41] (03CR) 10Muehlenhoff: idm::redis: Fix name for Redis auto restart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024092 (owner: 10Muehlenhoff) [06:34:53] !log uninstalling redis on netbox hosts, it uses the central Redis servers for a while now [06:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:13] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for redis/arclamp [puppet] - 10https://gerrit.wikimedia.org/r/1024263 (https://phabricator.wikimedia.org/T135991) [06:58:02] !log installing glibc security updates [06:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Amir1 and Urbanecm: May I have your attention please! UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T0700) [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:04:22] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/1024265 (https://phabricator.wikimedia.org/T135991) [07:08:12] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:09:24] (03PS1) 10EoghanGaffney: gitlab: Fix rsync includes/excludes for data backup [puppet] - 10https://gerrit.wikimedia.org/r/1024287 (https://phabricator.wikimedia.org/T361219) [07:09:50] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024092 (owner: 10Muehlenhoff) [07:10:18] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for alertmanager-webhook-logger [puppet] - 10https://gerrit.wikimedia.org/r/1024288 (https://phabricator.wikimedia.org/T135991) [07:12:57] (03CR) 10Muehlenhoff: [C:03+2] idm::redis: Fix name for Redis auto restart [puppet] - 10https://gerrit.wikimedia.org/r/1024092 (owner: 10Muehlenhoff) [07:14:43] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: hw troubleshooting: /dev/sdg disk not working properly in cloudcephosd1017.eqiad.wmnet - https://phabricator.wikimedia.org/T359049#9743683 (10dcaro) 05Open→03Resolved The drive is back online and in the cluster 👍 [07:15:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade GitLab Replica to new version [07:18:56] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023902 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [07:21:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023965 (https://phabricator.wikimedia.org/T363360) (owner: 10BCornwall) [07:27:42] (03CR) 10Muehlenhoff: [C:03+2] "No need I think. It's a bit of an outlier." [puppet] - 10https://gerrit.wikimedia.org/r/1023735 (https://phabricator.wikimedia.org/T346935) (owner: 10Muehlenhoff) [07:33:53] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [07:35:25] (SystemdUnitFailed) firing: (2) wmf_auto_restart_redis-server.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:38:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [07:43:53] (03CR) 10Hashar: logging: do not explicitly set blackhole handler (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [07:43:55] jouncebot: now [07:43:55] For the next 0 hour(s) and 16 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T0700) [07:44:13] (03CR) 10Aklapper: [C:03+2] Delete "AM" and "PM" translations breaking search [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1023926 (https://phabricator.wikimedia.org/T363215) (owner: 10Pppery) [07:44:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [07:44:19] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1241.eqiad.wmnet with reason: T362746 [07:44:32] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1241.eqiad.wmnet with reason: T362746 [07:44:35] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [07:45:06] (03Merged) 10jenkins-bot: logging: do not explicitly set blackhole handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023441 (https://phabricator.wikimedia.org/T228838) (owner: 10Hashar) [07:45:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1241', diff saved to https://phabricator.wikimedia.org/P61195 and previous config saved to /var/cache/conftool/dbconfig/20240425-074516-arnaudb.json [07:45:56] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1023441|logging: do not explicitly set blackhole handler (T228838)]] [07:46:06] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [07:47:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1241.eqiad.wmnet with OS bookworm [07:48:43] !log hashar@deploy1002 hashar: Backport for [[gerrit:1023441|logging: do not explicitly set blackhole handler (T228838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:50:55] !log hashar@deploy1002 hashar: Continuing with sync [07:56:15] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [07:58:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023426 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [07:59:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [07:59:56] (03CR) 10JMeybohm: [C:03+1] wikifeeds: Use mesh modules version enabling IPv6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023824 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [08:01:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1241.eqiad.wmnet with reason: host reimage [08:02:13] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1023441|logging: do not explicitly set blackhole handler (T228838)]] (duration: 16m 17s) [08:02:40] T228838: Consider enabling all MW log channels by default for WMF - https://phabricator.wikimedia.org/T228838 [08:02:46] (03CR) 10JMeybohm: [C:03+2] modules: Add restrictedSecurityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:02:58] (03CR) 10JMeybohm: [C:03+2] New module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021917 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:03:09] (03CR) 10JMeybohm: [C:03+2] eventgate: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019007 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [08:03:32] (03CR) 10JMeybohm: [C:03+2] eventgate-*: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019018 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [08:03:37] (03CR) 10JMeybohm: [C:03+2] eventgate: Add securityContext for all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022164 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:03:43] (03CR) 10JMeybohm: [C:03+2] _scaffold: Don't include tag in image_name preset responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021871 (owner: 10JMeybohm) [08:04:36] (03Merged) 10jenkins-bot: _scaffold: Don't include tag in image_name preset responses [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021871 (owner: 10JMeybohm) [08:04:38] (03Merged) 10jenkins-bot: New module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021917 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:04:40] (03Merged) 10jenkins-bot: Fix mcrouter module to work out of the box from scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021918 (https://phabricator.wikimedia.org/T355237) (owner: 10JMeybohm) [08:04:42] (03Merged) 10jenkins-bot: modules: Add restrictedSecurityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:04:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1241.eqiad.wmnet with reason: host reimage [08:05:49] looking at https://grafana.wikimedia.org/d/000000102/production-logging the logs look fine [08:06:00] (03Merged) 10jenkins-bot: eventgate: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019007 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [08:06:03] (03Merged) 10jenkins-bot: eventgate-*: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019018 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [08:06:05] (03Merged) 10jenkins-bot: eventgate: Add securityContext for all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1022164 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [08:11:14] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [08:14:32] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for kadmind [puppet] - 10https://gerrit.wikimedia.org/r/1024333 (https://phabricator.wikimedia.org/T135991) [08:14:58] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for kadmind [puppet] - 10https://gerrit.wikimedia.org/r/1024333 (https://phabricator.wikimedia.org/T135991) [08:15:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [08:17:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024333 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:19:43] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [08:20:49] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [08:21:15] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1024287 (https://phabricator.wikimedia.org/T361219) (owner: 10EoghanGaffney) [08:21:43] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [08:22:25] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [08:22:43] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [08:23:07] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [08:23:21] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [08:23:44] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [08:26:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1241.eqiad.wmnet with OS bookworm [08:26:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [08:32:54] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [08:34:14] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [08:38:13] !log jelto@cumin1002 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [08:38:33] (03PS1) 10Muehlenhoff: releases: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1024336 (https://phabricator.wikimedia.org/T135991) [08:39:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61196 and previous config saved to /var/cache/conftool/dbconfig/20240425-083931-arnaudb.json [08:39:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1160', diff saved to https://phabricator.wikimedia.org/P61197 and previous config saved to /var/cache/conftool/dbconfig/20240425-083956-arnaudb.json [08:40:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: T362746 [08:40:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1160.eqiad.wmnet with reason: T362746 [08:40:32] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [08:42:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1160.eqiad.wmnet with OS bookworm [08:47:23] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-codfw [08:48:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-codfw [08:50:42] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [08:51:37] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [08:51:57] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for karapace [puppet] - 10https://gerrit.wikimedia.org/r/1024338 (https://phabricator.wikimedia.org/T135991) [08:54:01] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [08:54:36] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [08:54:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: host reimage [08:54:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61198 and previous config saved to /var/cache/conftool/dbconfig/20240425-085437-arnaudb.json [08:57:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1160.eqiad.wmnet with reason: host reimage [08:57:46] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [08:58:23] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [08:59:06] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [09:01:02] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [09:02:50] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bullseye [09:04:12] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [09:04:45] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [09:05:50] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [09:06:03] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas-eqiad [09:06:51] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [09:07:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas-eqiad [09:07:30] (03PS2) 10Btullis: Add server aliases to the cirrus/cfssl proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) [09:07:31] (03PS4) 10Btullis: Switch relforge certificates from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023426 (https://phabricator.wikimedia.org/T360439) [09:07:31] (03PS2) 10Btullis: Switch elasticsearch::cirrus tlsproxy to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023813 (https://phabricator.wikimedia.org/T360439) [09:09:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61199 and previous config saved to /var/cache/conftool/dbconfig/20240425-090942-arnaudb.json [09:10:19] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2128/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023426 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [09:10:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:12:43] (03CR) 10Hashar: [C:03+1] "Yesterday I was wondering why the internal calls to `/rpc/RunSingleJob.php` ended up generating `xff` logs. Looks like that will nicely sh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020277 (owner: 10Hnowlan) [09:12:49] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2129/co" [puppet] - 10https://gerrit.wikimedia.org/r/1023813 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [09:13:30] !log jmm@cumin2002 START - Cookbook sre.elasticsearch.restart-nginx rolling restart_daemons on A:cloudelastic [09:15:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [09:15:52] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [09:16:32] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [09:16:34] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org [09:17:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.elasticsearch.restart-nginx (exit_code=0) rolling restart_daemons on A:cloudelastic [09:18:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1160.eqiad.wmnet with OS bookworm [09:19:02] (03CR) 10Btullis: Add server aliases to the cirrus/cfssl proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [09:19:55] (03CR) 10Btullis: [V:03+1 C:03+2] Switch the wcqs tlsproxy to use pki [puppet] - 10https://gerrit.wikimedia.org/r/1023815 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [09:20:16] (03CR) 10Marco Fossati: [C:03+1] "Thanks for the heads up @btullis@wikimedia.org, much appreciated. The patch looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [09:21:28] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [09:22:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 5%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61200 and previous config saved to /var/cache/conftool/dbconfig/20240425-092229-arnaudb.json [09:22:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org [09:22:50] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [09:23:01] (03CR) 10Muehlenhoff: [C:03+1] Add server aliases to the cirrus/cfssl proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [09:24:41] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [09:24:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61201 and previous config saved to /var/cache/conftool/dbconfig/20240425-092448-arnaudb.json [09:25:41] (03CR) 10Gmodena: T354456: 23 April 2024 update of ruwiki redacted pages (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 (owner: 10Htriedman) [09:29:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [09:29:52] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [09:32:56] (03CR) 10Effie Mouzeli: [C:03+1] ClusterConfigTest: Add mw-on-k8s specific tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020280 (owner: 10Clément Goubert) [09:36:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [09:37:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61202 and previous config saved to /var/cache/conftool/dbconfig/20240425-093735-arnaudb.json [09:38:21] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9744050 (10BTullis) [09:38:46] (03CR) 10Btullis: [V:03+1 C:03+2] Switch wdqs::internal tlsproxy from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023825 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [09:38:58] (03CR) 10Btullis: [V:03+1 C:03+2] Switch wdqs::public tlsproxy from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023819 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [09:39:11] (03PS2) 10Btullis: Switch wdqs::internal tlsproxy from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023825 (https://phabricator.wikimedia.org/T360439) [09:39:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61203 and previous config saved to /var/cache/conftool/dbconfig/20240425-093954-arnaudb.json [09:45:25] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:41] (03PS1) 10WMDE-Fisch: Set conflicting gadget settings for the Cite extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024345 (https://phabricator.wikimedia.org/T362771) [09:52:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61204 and previous config saved to /var/cache/conftool/dbconfig/20240425-095242-arnaudb.json [09:54:20] (03PS2) 10WMDE-Fisch: Set conflicting gadget settings for the Cite extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024345 (https://phabricator.wikimedia.org/T362771) [09:55:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1241 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61205 and previous config saved to /var/cache/conftool/dbconfig/20240425-095459-arnaudb.json [09:56:29] (03PS1) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for apache/envoy [puppet] - 10https://gerrit.wikimedia.org/r/1024347 (https://phabricator.wikimedia.org/T135991) [09:56:50] (03PS2) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for apache/envoy [puppet] - 10https://gerrit.wikimedia.org/r/1024347 (https://phabricator.wikimedia.org/T135991) [09:59:34] (03PS1) 10Aklapper: Automate quarterly Phabricator data for WMF QLS [puppet] - 10https://gerrit.wikimedia.org/r/1024348 (https://phabricator.wikimedia.org/T362804) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1000) [10:01:55] (03CR) 10EoghanGaffney: [C:03+1] apt_staging: Enable profile::auto_restarts::service for rsync/nginx/envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023783 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:03:30] (03CR) 10Clément Goubert: [C:03+1] redis: use python3-redis to support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1023954 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [10:04:11] (03CR) 10Clément Goubert: [C:03+1] Enable profile::auto_restarts::service for vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/1024265 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:04:16] (03PS2) 10JMeybohm: Kubernetes: Move use_pki_certs from site to common [puppet] - 10https://gerrit.wikimedia.org/r/1023856 [10:04:17] (03PS1) 10JMeybohm: etcd::v3: Remove unused template in wrong place [puppet] - 10https://gerrit.wikimedia.org/r/1024349 [10:07:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61206 and previous config saved to /var/cache/conftool/dbconfig/20240425-100748-arnaudb.json [10:08:19] (03CR) 10Btullis: [V:03+2 C:03+2] Switch wdqs::internal tlsproxy from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023825 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [10:09:05] (03PS1) 10Muehlenhoff: builder: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024350 (https://phabricator.wikimedia.org/T135991) [10:09:46] (03CR) 10Clément Goubert: [C:03+1] deployment_server: add bullseye support, python3 package names [puppet] - 10https://gerrit.wikimedia.org/r/1023955 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [10:13:02] (03PS1) 10Muehlenhoff: idp-build: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024351 (https://phabricator.wikimedia.org/T135991) [10:13:14] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/1024265 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:14:26] (03CR) 10Muehlenhoff: [C:03+2] apt_staging: Enable profile::auto_restarts::service for rsync/nginx/envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023783 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:14:43] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9744186 (10BTullis) [10:19:12] (03CR) 10Muehlenhoff: [C:04-1] "You can simply remove these, these were added over a decade ago and have no obvious use any more even on the current deployment hosts; all" [puppet] - 10https://gerrit.wikimedia.org/r/1023955 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [10:22:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61207 and previous config saved to /var/cache/conftool/dbconfig/20240425-102255-arnaudb.json [10:26:16] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for rsync on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1024353 (https://phabricator.wikimedia.org/T135991) [10:26:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:28:36] (03PS1) 10Muehlenhoff: an-web: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024354 (https://phabricator.wikimedia.org/T135991) [10:33:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:34:06] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9744245 (10BTullis) [10:35:06] (03CR) 10Btullis: [C:03+2] Add server aliases to the cirrus/cfssl proxy config [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [10:36:21] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1024354 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:36:55] (03CR) 10Btullis: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024353 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:38:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Post reimage', diff saved to https://phabricator.wikimedia.org/P61208 and previous config saved to /var/cache/conftool/dbconfig/20240425-103802-arnaudb.json [10:40:25] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:45:05] (03PS11) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [10:48:52] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:50:43] (03PS1) 10Muehlenhoff: netbox::standalone: Enable profile::auto_restarts::service for postgres [puppet] - 10https://gerrit.wikimedia.org/r/1024359 (https://phabricator.wikimedia.org/T135991) [10:53:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024359 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:55:05] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1024338 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:58:36] (03PS12) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [10:59:29] (03PS2) 10Muehlenhoff: netbox::standalone: Enable profile::auto_restarts::service for postgres [puppet] - 10https://gerrit.wikimedia.org/r/1024359 (https://phabricator.wikimedia.org/T135991) [10:59:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024359 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:02:56] (03PS1) 10Clément Goubert: sidecar-controller: Bump memory on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024362 (https://phabricator.wikimedia.org/T348284) [11:03:37] (03CR) 10TChin: Add datasets-config helm chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [11:07:40] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for rsync on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/1024353 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:09:46] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for karapace [puppet] - 10https://gerrit.wikimedia.org/r/1024338 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:10:14] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host backup1005.eqiad.wmnet with OS bookworm [11:10:22] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9744362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bookworm [11:11:21] (03CR) 10Muehlenhoff: [C:03+2] an-web: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024354 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:12:34] (03CR) 10Hnowlan: [C:03+1] ClusterConfigTest: Add mw-on-k8s specific tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020280 (owner: 10Clément Goubert) [11:15:12] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9744384 (10jcrespo) Booting failed (PXE): ` PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al Debian 12 (bookworm) amd64 (Wikimedia edition)... [11:15:27] !log root@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1005.eqiad.wmnet with OS bookworm [11:15:35] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9744386 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bookworm executed with errors: - b... [11:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:17:28] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host backup1005.eqiad.wmnet with OS bullseye [11:17:34] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9744387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bullseye [11:21:46] (03PS1) 10Stevemunene: datahub: create dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) [11:29:13] (03CR) 10JMeybohm: [C:03+1] sidecar-controller: Bump memory on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024362 (https://phabricator.wikimedia.org/T348284) (owner: 10Clément Goubert) [11:29:50] (03CR) 10Clément Goubert: [C:03+2] sidecar-controller: Bump memory on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024362 (https://phabricator.wikimedia.org/T348284) (owner: 10Clément Goubert) [11:32:11] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2131/console" [puppet] - 10https://gerrit.wikimedia.org/r/1024349 (owner: 10JMeybohm) [11:32:49] (03Merged) 10jenkins-bot: sidecar-controller: Bump memory on wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024362 (https://phabricator.wikimedia.org/T348284) (owner: 10Clément Goubert) [11:33:18] (03CR) 10JMeybohm: [V:03+1] "PCC: All noop but one canceled (toolsbeta)" [puppet] - 10https://gerrit.wikimedia.org/r/1024349 (owner: 10JMeybohm) [11:35:25] (SystemdUnitFailed) firing: wmf_auto_restart_redis-server.service on idm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:36:44] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:36:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:37:07] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:37:20] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:38:54] (03CR) 10Hnowlan: [C:03+1] etcd::v3: Remove unused template in wrong place [puppet] - 10https://gerrit.wikimedia.org/r/1024349 (owner: 10JMeybohm) [11:40:08] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9744428 (10jcrespo) If booted into bullseye. Sadly data was not available, a partition had been created that didn't correspond with the original partitioning schema. [11:40:19] (03CR) 10Clément Goubert: [C:03+1] Kubernetes: Move use_pki_certs from site to common [puppet] - 10https://gerrit.wikimedia.org/r/1023856 (owner: 10JMeybohm) [11:41:01] (03CR) 10JMeybohm: [V:03+1 C:03+2] etcd::v3: Remove unused template in wrong place [puppet] - 10https://gerrit.wikimedia.org/r/1024349 (owner: 10JMeybohm) [11:41:06] (03CR) 10JMeybohm: [C:03+2] Kubernetes: Move use_pki_certs from site to common [puppet] - 10https://gerrit.wikimedia.org/r/1023856 (owner: 10JMeybohm) [11:45:25] (SystemdUnitFailed) resolved: wmf_auto_restart_redis-server.service on idm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:44] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. One remaining question inline." [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [11:54:25] (03CR) 10TChin: Add datasets-config helm chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1200) [12:00:29] (03CR) 10Hnowlan: [C:03+1] kubernetes: add usernames for commons-impact-analytics to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1023959 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [12:00:45] (03CR) 10Hnowlan: [C:03+1] admin_ng: add namespace for commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023956 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [12:02:32] !log root@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup1005.eqiad.wmnet with OS bullseye [12:02:41] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9744471 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bullseye executed with errors: - b... [12:03:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db[2155,2187].codfw.wmnet with reason: T362746 [12:03:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db[2155,2187].codfw.wmnet with reason: T362746 [12:03:44] T362746: Upgrade s4 to MariaDB 10.6 - https://phabricator.wikimedia.org/T362746 [12:04:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2155', diff saved to https://phabricator.wikimedia.org/P61211 and previous config saved to /var/cache/conftool/dbconfig/20240425-120409-arnaudb.json [12:05:41] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bookworm [12:10:14] (03CR) 10Vgutierrez: [C:03+1] benthos:haproxy_cache: pass root cas file path as envvar [puppet] - 10https://gerrit.wikimedia.org/r/1023879 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [12:16:27] (03CR) 10Awight: [C:03+1] Set conflicting gadget settings for the Cite extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024345 (https://phabricator.wikimedia.org/T362771) (owner: 10WMDE-Fisch) [12:20:05] (03PS2) 10Hnowlan: mw-videoscaler: helmfile scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020860 (https://phabricator.wikimedia.org/T355292) [12:25:48] (03PS7) 10Hnowlan: shellbox: add PHP + Apache timeout settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [12:25:48] (03CR) 10Hnowlan: "Fixed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005139 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [12:26:31] (03PS7) 10Kamila Součková: Create a shellbox deployment for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) [12:28:02] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2155.codfw.wmnet with OS bookworm [12:29:03] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bookworm [12:36:37] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9744535 (10MoritzMuehlenhoff) [12:38:44] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host backup1005.eqiad.wmnet with OS bullseye [12:38:51] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9744547 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bullseye [12:42:59] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#9744568 (10MoritzMuehlenhoff) >>! In T291916#9742764, @Dzahn wrote: > @Muehlenhoff Where does deploy* (deployment_server role both prod and wmcs) fit in? Since we are s... [12:44:40] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db2155.codfw.wmnet with OS bookworm [12:47:59] (03Abandoned) 10Urbanecm: Set wgTranslateGroupSynchronizationCache to false explicitly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657337 (https://phabricator.wikimedia.org/T272428) (owner: 10Urbanecm) [12:50:06] (03PS1) 10JMeybohm: Disable boostrap mode on all k8s etcd clusters [puppet] - 10https://gerrit.wikimedia.org/r/1024395 [12:54:22] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2132/co" [puppet] - 10https://gerrit.wikimedia.org/r/1024395 (owner: 10JMeybohm) [12:58:45] (03CR) 10Cathal Mooney: magru: add lvs700[1-3] and related configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1300). [13:00:05] claime and WMDE-Fisch: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:18] here [13:00:23] here [13:01:17] I can self-deploy, it's a tests patch to mediawiki-config [13:02:01] claime: Could you do mine afterwards as well. It's kind of a noop [13:02:40] sure [13:02:45] thx [13:03:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020280 (owner: 10Clément Goubert) [13:03:48] (03Merged) 10jenkins-bot: ClusterConfigTest: Add mw-on-k8s specific tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1020280 (owner: 10Clément Goubert) [13:04:07] o/ I'm just here to lurk and learn [13:04:19] !log cgoubert@deploy1002 Started scap: Backport for [[gerrit:1020280|ClusterConfigTest: Add mw-on-k8s specific tests]] [13:04:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bullseye [13:06:42] I can’t deploy today, sorry [13:06:47] jouncebot: next [13:06:47] In 2 hour(s) and 53 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1600) [13:07:06] !log cgoubert@deploy1002 cgoubert: Backport for [[gerrit:1020280|ClusterConfigTest: Add mw-on-k8s specific tests]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:07:07] (if there’s something very important, I could deploy in the break after the window, once I’m back from lunch :P) [13:07:11] Lucas_WMDE: It's all right, I'll deploy WMDE-Fisch's patch and I think it's all there is [13:07:19] sounds good, thanks! [13:07:33] !log cgoubert@deploy1002 cgoubert: Continuing with sync [13:08:52] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2155.codfw.wmnet with OS bullseye [13:09:13] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bookworm [13:13:10] (03PS1) 10Alexandros Kosiaris: Add parsoidtest1001 preseed and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1024399 (https://phabricator.wikimedia.org/T363399) [13:15:25] (03PS1) 10Alexandros Kosiaris: Switch scandium references to parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) [13:17:47] (03CR) 10Muehlenhoff: [C:03+2] Add an option to pass the Druid firewall settings compatible with nftables [puppet] - 10https://gerrit.wikimedia.org/r/1023402 (owner: 10Muehlenhoff) [13:18:17] (03PS2) 10Alexandros Kosiaris: Add parsoidtest1001 preseed and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1024399 (https://phabricator.wikimedia.org/T363399) [13:18:17] (03PS2) 10Alexandros Kosiaris: Switch scandium references to parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) [13:18:17] (03PS1) 10Alexandros Kosiaris: fix [puppet] - 10https://gerrit.wikimedia.org/r/1024401 [13:18:18] (03PS1) 10Alexandros Kosiaris: Remove scandium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T363402) [13:19:05] (03PS3) 10Alexandros Kosiaris: Add parsoidtest1001 preseed and site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1024399 (https://phabricator.wikimedia.org/T363399) [13:19:05] (03PS3) 10Alexandros Kosiaris: Switch scandium references to parsoidtest1001 [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) [13:19:05] (03PS2) 10Alexandros Kosiaris: Remove scandium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T363402) [13:19:13] !log cgoubert@deploy1002 Finished scap: Backport for [[gerrit:1020280|ClusterConfigTest: Add mw-on-k8s specific tests]] (duration: 14m 54s) [13:19:29] WMDE-Fisch: I'll proceed with your patch, do you have something to test? [13:19:47] claime: Nope. Feel free to go on [13:19:53] ack [13:20:34] (03Abandoned) 10Alexandros Kosiaris: fix [puppet] - 10https://gerrit.wikimedia.org/r/1024401 (owner: 10Alexandros Kosiaris) [13:20:50] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1005.eqiad.wmnet with reason: host reimage [13:21:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cgoubert@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024345 (https://phabricator.wikimedia.org/T362771) (owner: 10WMDE-Fisch) [13:22:30] (03Merged) 10jenkins-bot: Set conflicting gadget settings for the Cite extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024345 (https://phabricator.wikimedia.org/T362771) (owner: 10WMDE-Fisch) [13:22:59] !log cgoubert@deploy1002 Started scap: Backport for [[gerrit:1024345|Set conflicting gadget settings for the Cite extension (T362771)]] [13:23:17] T362771: Move ReferencePreviews related config flags to Cite's codebase - https://phabricator.wikimedia.org/T362771 [13:23:21] (03PS1) 10Muehlenhoff: druid::broker: Switch to nftables for test_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1024403 [13:23:57] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1005.eqiad.wmnet with reason: host reimage [13:25:14] WMDE-Fisch: I'm using the opportunity to test if serializing mw-on-k8s deployments is faster than parallel, so it may or may not take a little bit longer to deploy fyi [13:25:42] claime: No worries feel free [13:26:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2155.codfw.wmnet with reason: host reimage [13:26:07] !log cgoubert@deploy1002 cgoubert and wmde-fisch: Backport for [[gerrit:1024345|Set conflicting gadget settings for the Cite extension (T362771)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:26:20] testservers look not broken [13:26:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024403 (owner: 10Muehlenhoff) [13:26:48] continuing [13:26:53] !log cgoubert@deploy1002 cgoubert and wmde-fisch: Continuing with sync [13:28:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2155.codfw.wmnet with reason: host reimage [13:30:46] (03CR) 10Bking: [C:03+2] Switch relforge certificates from cergen to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023426 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [13:30:53] (03PS2) 10Muehlenhoff: druid::broker: Switch to firewall::service for test_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1024403 [13:34:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024403 (owner: 10Muehlenhoff) [13:37:41] (03PS3) 10Muehlenhoff: druid::broker: Switch to firewall::service for test_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1024403 [13:38:30] (03CR) 10Btullis: "Thanks Marco. The only action that I'd welcome from you or someone on your team is to check that it's still working after I have deployed " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014660 (https://phabricator.wikimedia.org/T360531) (owner: 10Btullis) [13:41:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024403 (owner: 10Muehlenhoff) [13:43:39] (03PS1) 10JMeybohm: Kubernetes: Drop unused etcd_srv_name [puppet] - 10https://gerrit.wikimedia.org/r/1024406 (https://phabricator.wikimedia.org/T329826) [13:44:20] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9744746 (10BTullis) [13:44:33] !log cgoubert@deploy1002 Finished scap: Backport for [[gerrit:1024345|Set conflicting gadget settings for the Cite extension (T362771)]] (duration: 21m 33s) [13:44:52] T362771: Move ReferencePreviews related config flags to Cite's codebase - https://phabricator.wikimedia.org/T362771 [13:45:25] Verdict, it takes 50% longer sequentially than sending all deployments at once and letting the k8s scheduler hit the wall and sort it out on its own [13:45:32] Good to know x) [13:45:53] !log root@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - root@cumin1002" [13:47:25] WMDE-Fisch: All done [13:47:41] (03PS1) 10C. Scott Ananian: Turn on ParserMigration extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024407 [13:47:46] !log UTC afternoon backports window closed [13:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:13] claime: thanks! [13:48:39] 06SRE, 10Data-Platform-SRE (2024.04.15 - 2024.05.05), 13Patch-For-Review: Phase out cergen for Search Platform services - https://phabricator.wikimedia.org/T360439#9744788 (10BTullis) We rolled out the change to relforge. It works but the Icinga checks on certificate expiry triggered because they fire on the... [13:49:27] (03PS1) 10C. Scott Ananian: Quiet ParserMigration notice for 30 days after acknowledgement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024408 [13:49:43] (03PS4) 10Muehlenhoff: druid::broker: Switch to firewall::service for test_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1024403 [13:50:25] (03CR) 10Ladsgroup: [C:03+1] Quiet ParserMigration notice for 30 days after acknowledgement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024408 (owner: 10C. Scott Ananian) [13:50:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024403 (owner: 10Muehlenhoff) [13:54:30] (03CR) 10Subramanya Sastry: [C:03+1] Quiet ParserMigration notice for 30 days after acknowledgement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024408 (owner: 10C. Scott Ananian) [13:57:42] (03PS1) 10Muehlenhoff: druid::broker: Switch public workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024409 [13:57:42] (03PS1) 10Muehlenhoff: druid::broker: Switch analytics workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024410 [13:58:03] (03PS1) 10Peter Fischer: Shift writes to SUP, 1st batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024411 (https://phabricator.wikimedia.org/T363475) [13:58:43] (03CR) 10CI reject: [V:04-1] Shift writes to SUP, 1st batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024411 (https://phabricator.wikimedia.org/T363475) (owner: 10Peter Fischer) [14:00:03] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024409 (owner: 10Muehlenhoff) [14:02:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024410 (owner: 10Muehlenhoff) [14:03:06] (03CR) 10Btullis: [C:03+2] Add server aliases to the cirrus/cfssl proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023469 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [14:09:02] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#9744868 (10Dzahn) Ok, fair enough about the tracking task. But don't we still need some kind of task that someone can take to do the actual upgrade work? So all the sub... [14:09:20] (03CR) 10JHathaway: [C:03+1] Enable profile::auto_restarts::service for kadmind [puppet] - 10https://gerrit.wikimedia.org/r/1024333 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:09:36] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9744869 (10KFrancis) Hi all, I am confirming the NDA is complete. Thanks! [14:10:21] !log root@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - root@cumin1002" [14:10:23] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1005.eqiad.wmnet with OS bullseye [14:10:31] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9744883 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host backup1005.eqiad.wmnet with OS bullseye completed: - backup1005 (... [14:15:06] 06SRE, 06Infrastructure-Foundations, 07Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916#9744889 (10MoritzMuehlenhoff) >>! In T291916#9744868, @Dzahn wrote: > Ok, fair enough about the tracking task. But don't we still need some kind of task that someone ca... [14:15:50] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host db2155.codfw.wmnet with OS bookworm [14:16:50] (03CR) 10Herron: Revert "Revert "prometheus: Ensure TLS certificates are provided by CFSSL"" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (owner: 10Andrea Denisse) [14:21:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2155.codfw.wmnet with OS bookworm [14:25:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T352010)', diff saved to https://phabricator.wikimedia.org/P61212 and previous config saved to /var/cache/conftool/dbconfig/20240425-142520-ladsgroup.json [14:25:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:27:41] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9744922 (10Eevans) Ok, the rebuild is complete. `lang=sh-session eevans@nyx:~$ ssh aqs1013.eqiad.wmnet -- sudo mdadm --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Tue Mar 9 12:5... [14:29:21] !log installing Java 11 security updates [14:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:40:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P61213 and previous config saved to /var/cache/conftool/dbconfig/20240425-144027-ladsgroup.json [14:40:39] (03PS1) 10Brouberol: idp_test: change mpic-next scheme to http [puppet] - 10https://gerrit.wikimedia.org/r/1024412 [14:41:19] !log klausman@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-codfw: Java 11 security updates - klausman@cumin1002 [14:41:31] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1024412 (owner: 10Brouberol) [14:41:51] (03CR) 10Brouberol: [C:03+2] idp_test: change mpic-next scheme to http [puppet] - 10https://gerrit.wikimedia.org/r/1024412 (owner: 10Brouberol) [14:43:11] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for kadmind [puppet] - 10https://gerrit.wikimedia.org/r/1024333 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:44:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1234.eqiad.wmnet with reason: Host has hardware issues [14:44:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1234.eqiad.wmnet with reason: Host has hardware issues [14:48:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:50:35] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9744977 (10cmooney) >>! In T361087#9744384, @jcrespo wrote: > Booting failed (PXE): > ` > PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al > >... [14:51:30] (03PS8) 10Hashar: wmf-build: always use upstream for git submodules [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019678 [14:51:36] (03CR) 10Hashar: [C:03+2] wmf-build: always use upstream for git submodules [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019678 (owner: 10Hashar) [14:53:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 20:00:00 on db2187.codfw.wmnet with reason: Host has hardware issues [14:53:04] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on db2187.codfw.wmnet with reason: Host has hardware issues [14:55:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P61214 and previous config saved to /var/cache/conftool/dbconfig/20240425-145534-ladsgroup.json [14:58:59] (03Merged) 10jenkins-bot: wmf-build: always use upstream for git submodules [software/gerrit] (wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1019678 (owner: 10Hashar) [14:59:03] !log klausman@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-codfw: Java 11 security updates - klausman@cumin1002 [15:01:44] (03CR) 10Cwhite: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024400 (https://phabricator.wikimedia.org/T363399) (owner: 10Alexandros Kosiaris) [15:03:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2155.codfw.wmnet with reason: host reimage [15:07:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2155.codfw.wmnet with reason: host reimage [15:07:34] !log klausman@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Java 11 security updates - klausman@cumin1002 [15:07:34] (03CR) 10Dzahn: Revert "Revert "prometheus: Ensure TLS certificates are provided by CFSSL"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (owner: 10Andrea Denisse) [15:08:28] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024351 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:09:01] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024350 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:10:18] (03CR) 10Dzahn: Revert "Revert "prometheus: Ensure TLS certificates are provided by CFSSL"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (owner: 10Andrea Denisse) [15:10:32] (03CR) 10Muehlenhoff: [C:03+2] idp-build: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024351 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:10:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T352010)', diff saved to https://phabricator.wikimedia.org/P61215 and previous config saved to /var/cache/conftool/dbconfig/20240425-151041-ladsgroup.json [15:10:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [15:10:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [15:11:00] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:11:11] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:11:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9745051 (10MoritzMuehlenhoff) Will parsoidtest1001 be installed with Bullseye? scandium is currently running buster, but all the mediawiki manifests are compat... [15:11:13] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:11:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T352010)', diff saved to https://phabricator.wikimedia.org/P61216 and previous config saved to /var/cache/conftool/dbconfig/20240425-151120-ladsgroup.json [15:12:05] (03CR) 10Dzahn: Revert "Revert "prometheus: Ensure TLS certificates are provided by CFSSL"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (owner: 10Andrea Denisse) [15:12:13] (03CR) 10Muehlenhoff: [C:03+2] builder: Enable profile::auto_restarts::service for rsync [puppet] - 10https://gerrit.wikimedia.org/r/1024350 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:14:08] (03CR) 10Pppery: "This repo doesn't have jenkins-bot configured. If you want to merge this you'll need to manually v+2 and submit." [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1023926 (https://phabricator.wikimedia.org/T363215) (owner: 10Pppery) [15:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:21:01] (03PS1) 10Muehlenhoff: Remove obsolete certs for wdqs/wcqs [puppet] - 10https://gerrit.wikimedia.org/r/1024420 (https://phabricator.wikimedia.org/T360439) [15:21:52] (03CR) 10Dzahn: "fwiw, I think" [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (owner: 10Andrea Denisse) [15:22:21] (03PS1) 10Muehlenhoff: Remove obsolete dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/1024421 (https://phabricator.wikimedia.org/T360439) [15:23:06] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9745119 (10MoritzMuehlenhoff) [15:23:29] (03CR) 10Dzahn: Revert "Revert "prometheus: Ensure TLS certificates are provided by CFSSL"" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (owner: 10Andrea Denisse) [15:24:43] jouncebot nowandnext [15:24:43] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [15:24:43] In 0 hour(s) and 35 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1600) [15:25:04] I'm going to run scap sync-world to test some new code. [15:25:29] !log dancy@deploy1002 Started scap: Testing [15:26:05] !log klausman@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Java 11 security updates - klausman@cumin1002 [15:27:03] !log dancy@deploy1002 sync-world aborted: Testing (duration: 01m 33s) [15:27:07] (03CR) 10BCornwall: [C:03+2] admin: Move hghani to airflow-analytics-product-admins [puppet] - 10https://gerrit.wikimedia.org/r/1023965 (https://phabricator.wikimedia.org/T363360) (owner: 10BCornwall) [15:29:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2155.codfw.wmnet with OS bookworm [15:29:25] !log dancy@deploy1002 Started scap: Testing [15:29:36] (03CR) 10JHathaway: "thanks for the review Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [15:29:54] (03PS11) 10JHathaway: Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) [15:30:08] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9745155 (10jcrespo) >>! In T361087#9744977, @cmooney wrote: >>>! In T361087#9744384, @jcrespo wrote: >> Booting failed (PXE): >> ` >> PXELINUX 6.03 lwIP 20150819 Copyri... [15:30:11] 06SRE, 10SRE-Access-Requests, 06Movement-Insights, 13Patch-For-Review: Requesting membership in airflow-analytics-product-admins for hghani - https://phabricator.wikimedia.org/T363360#9745152 (10BCornwall) 05In progress→03Resolved This has been merged - Please wait a few minutes for changes to prop... [15:33:01] (03CR) 10CI reject: [V:04-1] Postfix profile [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [15:33:47] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@b17acd0]: (no justification provided) [15:34:00] (03CR) 10Muehlenhoff: "Ship it" [puppet] - 10https://gerrit.wikimedia.org/r/1019131 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [15:34:14] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@b17acd0]: (no justification provided) (duration: 00m 27s) [15:34:36] PROBLEM - Check whether ferm is active by checking the default input chain on mw1362 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:36:33] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9745186 (10MoritzMuehlenhoff) >>! In T361087#9745154, @jcrespo wrote: >>>! In T361087#9744977, @cmooney wrote: >>>>! In T361087#9744384, @jcrespo wrote: >>> Booting fai... [15:38:10] !log dancy@deploy1002 Finished scap: Testing (duration: 08m 44s) [15:38:51] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9745225 (10jcrespo) In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can b... [15:38:52] PROBLEM - Check whether ferm is active by checking the default input chain on mw1475 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:42:03] (03CR) 10Jsn.sherman: "This looks entirely reasonable, but I'm not sure how to reproduce the original issue (importing the test files in the phabricator task err" [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014053 (https://phabricator.wikimedia.org/T291821) (owner: 10Driedmueller) [15:55:04] (03CR) 10EoghanGaffney: [apt-staging] Package puller updates (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1021948 (owner: 10EoghanGaffney) [15:55:22] (03PS3) 10EoghanGaffney: [apt-staging] Package puller updates [puppet] - 10https://gerrit.wikimedia.org/r/1021948 [15:59:35] (03CR) 10Andrew Bogott: [C:03+2] eqiad1 openstack -> version 'bobcat' [puppet] - 10https://gerrit.wikimedia.org/r/1023480 (https://phabricator.wikimedia.org/T356287) (owner: 10Andrew Bogott) [16:00:03] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9745329 (10akosiaris) [16:00:06] jhathaway and rzl: How many deployers does it take to do Puppet request window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1600). [16:00:06] No Gerrit patches in the queue for this window AFAICS. [16:00:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9745333 (10akosiaris) [16:00:37] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9745318 (10LSobanski) Updating the host ownership in the Puppet role should also be part of this task. [16:02:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install parsoidtest1001 - https://phabricator.wikimedia.org/T363399#9745335 (10akosiaris) >>! In T363399#9745051, @MoritzMuehlenhoff wrote: > Will parsoidtest1001 be installed with Bullseye? scandium is currently running buster... [16:04:22] 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9745355 (10BCornwall) 05Stalled→03Resolved a:03BCornwall Thanks! [16:08:52] RECOVERY - Check whether ferm is active by checking the default input chain on mw1475 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:12:14] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1023954/2135/" [puppet] - 10https://gerrit.wikimedia.org/r/1023954 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [16:21:21] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9745404 (10BCornwall) Hi, @Lina_Farid_WMDE, thanks for signing that. Could you share your email address so I can get a patch in? [16:27:34] (03PS13) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [16:28:14] (03CR) 10TChin: Add datasets-config helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:29:32] (03CR) 10Dzahn: [C:03+2] releases: Enable profile::auto_restarts::service for docker/containerd [puppet] - 10https://gerrit.wikimedia.org/r/1024336 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:33:06] (03CR) 10Dzahn: [C:03+2] "looks good, tested on release1003 to manually start wmf_auto_restart_docker.service and it restarted docker" [puppet] - 10https://gerrit.wikimedia.org/r/1024336 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:34:15] !log releases1003 - docker and containerd restarted by manually starting wmf_auto_restart services [16:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:36] RECOVERY - Check whether ferm is active by checking the default input chain on mw1362 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:37:24] (03PS2) 10Dzahn: deployment_server: stop installing python-gitdb, python-git [puppet] - 10https://gerrit.wikimedia.org/r/1023955 (https://phabricator.wikimedia.org/T363415) [16:38:31] (03CR) 10Dzahn: "Ok, except this is going to make it harder to get merged since often there is some unobvious use. Just making it happen when the prod serv" [puppet] - 10https://gerrit.wikimedia.org/r/1023955 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [16:42:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023955 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [16:42:35] (03PS1) 10Dzahn: deployment_server: stop including redis::client::python [puppet] - 10https://gerrit.wikimedia.org/r/1024447 (https://phabricator.wikimedia.org/T363415) [16:57:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.22% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:59:05] (03PS1) 10BCornwall: admin: add Linda Farid to LDAP_only (nda) [puppet] - 10https://gerrit.wikimedia.org/r/1024449 (https://phabricator.wikimedia.org/T362959) [17:00:05] bd808: I, the Bot under the Fountain, call upon thee, The Deployer, to do Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1700). [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1700) [17:02:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.19% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:03:05] (03CR) 10Scott French: "Thanks again for the reviews, Hugh." [dns] - 10https://gerrit.wikimedia.org/r/1023964 (https://phabricator.wikimedia.org/T361835) (owner: 10Scott French) [17:12:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T352010)', diff saved to https://phabricator.wikimedia.org/P61218 and previous config saved to /var/cache/conftool/dbconfig/20240425-171218-ladsgroup.json [17:12:52] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:13:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [17:13:22] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [17:13:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T352010)', diff saved to https://phabricator.wikimedia.org/P61219 and previous config saved to /var/cache/conftool/dbconfig/20240425-171329-ladsgroup.json [17:48:17] (03CR) 10Dzahn: [C:04-1] "the UID is linafaridwmde" [puppet] - 10https://gerrit.wikimedia.org/r/1024449 (https://phabricator.wikimedia.org/T362959) (owner: 10BCornwall) [17:50:40] 06SRE, 10LDAP-Access-Requests, 06WMF-NDA-Requests, 13Patch-For-Review: Grant Access to NDA for lina.farid - https://phabricator.wikimedia.org/T362959#9745845 (10Dzahn) Thanks! I added Lina to WMF-NDA in Phabricator for access to private tickets. [17:57:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2161 (T352010)', diff saved to https://phabricator.wikimedia.org/P61222 and previous config saved to /var/cache/conftool/dbconfig/20240425-175739-ladsgroup.json [17:57:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [17:57:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [17:57:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:58:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T352010)', diff saved to https://phabricator.wikimedia.org/P61223 and previous config saved to /var/cache/conftool/dbconfig/20240425-175802-ladsgroup.json [18:00:05] brennen and dancy: Your horoscope predicts another MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1800). [18:03:18] o/ [18:07:20] o/ [18:08:19] !log train 1.43.0-wmf.2 (T361396) status: no current blockers, rolling to group2 [18:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:43] T361396: 1.43.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T361396 [18:08:48] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024464 (https://phabricator.wikimedia.org/T361396) [18:08:49] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024464 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [18:09:33] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024464 (https://phabricator.wikimedia.org/T361396) (owner: 10TrainBranchBot) [18:11:50] (03CR) 10Bernard Wang: "maybe we can have both configs first? and then remove it later?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) (owner: 10Bernard Wang) [18:14:14] (03PS3) 10Bernard Wang: Update wgVectorClientPrefs to wgVectorAppearance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) [18:15:16] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1041 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:17:38] PROBLEM - Check whether ferm is active by checking the default input chain on mw1467 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:18:34] PROBLEM - Check whether ferm is active by checking the default input chain on mw1464 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:20:22] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1028 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:22:38] !log nshahquinn-wmf@deploy1002 Started deploy [airflow-dags/analytics_product@0e9fd9a]: (no justification provided) [18:22:46] !log nshahquinn-wmf@deploy1002 Finished deploy [airflow-dags/analytics_product@0e9fd9a]: (no justification provided) (duration: 00m 07s) [18:23:46] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.2 refs T361396 [18:24:11] T361396: 1.43.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T361396 [18:36:30] (03CR) 10Herron: "I'd suggest updating the commit message to remove revert "revert" for clarity and also to include the related bug (not blocking)" [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (owner: 10Andrea Denisse) [18:38:13] (03PS1) 10Andrew Bogott: cloud-vps network tests: update VM names [puppet] - 10https://gerrit.wikimedia.org/r/1024470 [18:38:38] (03PS3) 10BCornwall: admin: add Lina Farid to LDAP_only (nda) [puppet] - 10https://gerrit.wikimedia.org/r/1024449 (https://phabricator.wikimedia.org/T362959) [18:39:43] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps network tests: update VM names [puppet] - 10https://gerrit.wikimedia.org/r/1024470 (owner: 10Andrew Bogott) [18:40:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:45:16] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1041 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:47:38] RECOVERY - Check whether ferm is active by checking the default input chain on mw1467 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:48:19] (03PS1) 10Andrew Bogott: cloud-vps network tests: Fix puppet sync command for puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1024472 [18:48:34] RECOVERY - Check whether ferm is active by checking the default input chain on mw1464 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:48:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:49:16] (03PS3) 10Andrea Denisse: prometheus: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (https://phabricator.wikimedia.org/T360414) [18:50:06] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps network tests: Fix puppet sync command for puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1024472 (owner: 10Andrew Bogott) [18:50:22] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1028 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:50:54] (03CR) 10Andrea Denisse: "Good idea, I've amended the commit and added info about the invalid envoy config bug, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:51:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514 (10Isaac) 03NEW [18:54:07] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) - https://phabricator.wikimedia.org/T363514#9746033 (10Isaac) @YLiou_WMF here's the task -- please sign L3 @Miriam I put this together so Yu-Ming has access to Superset -- could you please approve... [18:59:38] (03CR) 10Hashar: [C:03+1] "That other change 1020958 would not work as it is and I think using a hostname would require a hard restart of the zuul merger due to the " [puppet] - 10https://gerrit.wikimedia.org/r/1020955 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:00:45] (03CR) 10Hashar: [C:03+1] ci: switch contint manager_host from 2002 to 1002 [puppet] - 10https://gerrit.wikimedia.org/r/1020954 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:01:01] (03CR) 10Herron: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [19:01:23] (03CR) 10Hashar: [C:03+1] switch contint.wikimedia.org from contint2002 to contint1002 [dns] - 10https://gerrit.wikimedia.org/r/1020951 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:02:02] 06SRE: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9746077 (10matmarex) [19:03:14] jouncebot nowandnext [19:03:14] For the next 0 hour(s) and 56 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T1800) [19:03:14] In 0 hour(s) and 56 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T2000) [19:03:20] !log dancy@deploy1002 Started scap: Testing [19:03:48] (03CR) 10Hashar: [C:03+1] ci: disable zuul merger on contint2002 for migration [puppet] - 10https://gerrit.wikimedia.org/r/1020950 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:04:36] (03CR) 10Brennen Bearnes: [V:03+2] Delete "AM" and "PM" translations breaking search [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1023926 (https://phabricator.wikimedia.org/T363215) (owner: 10Pppery) [19:08:56] PROBLEM - Check whether ferm is active by checking the default input chain on mw1431 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:10:30] PROBLEM - Check whether ferm is active by checking the default input chain on mw1383 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:10:44] PROBLEM - Check whether ferm is active by checking the default input chain on mw1376 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:11:57] (03CR) 10Hashar: [C:03+1] "If I remember properly that `docker_version` is set to ensure we don't have a sudden upgrade of Docker happening behind the hood. If I re" [puppet] - 10https://gerrit.wikimedia.org/r/1020344 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:12:08] !log dancy@deploy1002 Finished scap: Testing (duration: 08m 47s) [19:12:14] PROBLEM - Check whether ferm is active by checking the default input chain on parse1015 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:12:41] (03PS1) 10Andrew Bogott: cloud-vps network tests: update VM name again [puppet] - 10https://gerrit.wikimedia.org/r/1024474 [19:13:04] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps network tests: update VM name again [puppet] - 10https://gerrit.wikimedia.org/r/1024474 (owner: 10Andrew Bogott) [19:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:19:46] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:24:32] (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:25:12] (03CR) 10Hashar: [C:04-1] "Using the hostname is appealing but the ferm syntax will not work in the Zuul ini configuration. I think we went with an IP address becaus" [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:25:24] (03PS2) 10Htriedman: T354456: 23 April 2024 update of ruwiki redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 [19:25:44] 06SRE: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9746152 (10EBernhardson) hmm, i can confirm this is happening. The completion index is built new every day in each datacenter. Usually they are the same, but somehow the e... [19:26:06] (03PS3) 10Htriedman: eventstreams: 23 April 2024 update of ruwiki redacted pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 (https://phabricator.wikimedia.org/T354456) [19:26:34] (03CR) 10Htriedman: "resolved @gmodena@wikimedia.org's comments" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023465 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [19:31:22] (03PS1) 10Ebernhardson: cirrus: Shift autocomplete traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) [19:32:29] (03PS1) 10Ahmon Dancy: deployment server: Run scap clean auto on a weekly basis [puppet] - 10https://gerrit.wikimedia.org/r/1024479 (https://phabricator.wikimedia.org/T363519) [19:32:57] 06SRE, 03Discovery-Search (Current work), 13Patch-For-Review: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9746167 (10Gehel) [19:33:12] !log T363516 started manual rebuild of enwiki titlesuggest indices in eqiad [19:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:37] T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516 [19:35:40] (03CR) 10Muehlenhoff: "You could use the ipresolve() Puppet function to resolve the FQDN to an IP during Puppet catalogue compilation" [puppet] - 10https://gerrit.wikimedia.org/r/1020958 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [19:38:56] RECOVERY - Check whether ferm is active by checking the default input chain on mw1431 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:40:30] RECOVERY - Check whether ferm is active by checking the default input chain on mw1383 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:40:44] RECOVERY - Check whether ferm is active by checking the default input chain on mw1376 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:42:14] RECOVERY - Check whether ferm is active by checking the default input chain on parse1015 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:44:48] (03PS1) 10Bking: elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) [19:45:08] (03CR) 10CI reject: [V:04-1] elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [19:49:31] (03PS2) 10Bking: elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) [19:49:50] (03CR) 10CI reject: [V:04-1] elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [19:51:42] (03PS3) 10Bking: elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) [19:53:22] (03PS4) 10Bking: elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) [19:57:09] (03PS5) 10Bking: elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) [19:58:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240425T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:46] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1023917 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:20:48] 06SRE, 03Discovery-Search (Current work), 13Patch-For-Review: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9746317 (10EBernhardson) Decided against shuffling traffic, rebuild is almost compete already for enwiki. I can... [20:28:29] (03Abandoned) 10Ebernhardson: cirrus: Shift autocomplete traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024478 (https://phabricator.wikimedia.org/T363516) (owner: 10Ebernhardson) [20:43:33] ACKNOWLEDGEMENT - MD RAID on aqs1014 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 12, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T363522 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:43:37] 10ops-eqiad, 06SRE: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T363522 (10ops-monitoring-bot) 03NEW [20:43:41] (03PS4) 10Bernard Wang: Update wgVectorClientPrefs to wgVectorAppearance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1023928 (https://phabricator.wikimedia.org/T362808) [20:51:36] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:54:40] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:05:56] PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 48933 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops [21:17:46] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:31:57] (03CR) 10Thcipriani: [C:03+1] deployment server: Run scap clean auto on a weekly basis [puppet] - 10https://gerrit.wikimedia.org/r/1024479 (https://phabricator.wikimedia.org/T363519) (owner: 10Ahmon Dancy) [21:34:58] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:43:01] (03PS1) 10Ayounsi: magru: update edgeuno transit IP [homer/public] - 10https://gerrit.wikimedia.org/r/1024516 (https://phabricator.wikimedia.org/T362421) [21:47:18] (03PS1) 10Fabfur: Revert "hiera: buffer memory limit increase for cp4037" [puppet] - 10https://gerrit.wikimedia.org/r/1024488 [21:48:25] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-redis-exporter@6380.service on netbox2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:58:30] 06SRE, 03Discovery-Search (Current work), 13Patch-For-Review: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9746562 (10matmarex) There's someone reporting that they're still not seeing the expected results for some queri... [21:58:59] 06SRE, 03Discovery-Search (Current work), 13Patch-For-Review: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw - https://phabricator.wikimedia.org/T363516#9746563 (10matmarex) Never mind, they just said it's fixed :) [22:05:56] PROBLEM - Disk space on prometheus1006 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/k8s 50580 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops [22:13:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:13:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:14:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:14:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51783 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:15:50] Looks like prometheus1006 has been filling up rapidly since april 3 [22:20:11] Considering it seems to be going through ~30 GiB a day I'm going to increase the LV +60G to buy some more time [22:21:48] Following the playbook, I'll be adding 60G to all in the site, so prometheus1005 and prometheus1006 [22:23:14] !log Extend prometheus1005 and prometheus1006 logical volume by an extra 60G due to disk filling up [22:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:56] RECOVERY - Disk space on prometheus1006 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=prometheus1006&var-datasource=eqiad+prometheus/ops [22:26:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T352010)', diff saved to https://phabricator.wikimedia.org/P61227 and previous config saved to /var/cache/conftool/dbconfig/20240425-222638-ladsgroup.json [22:27:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:40:26] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:41:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P61228 and previous config saved to /var/cache/conftool/dbconfig/20240425-224146-ladsgroup.json [22:43:24] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1024449 (https://phabricator.wikimedia.org/T362959) (owner: 10BCornwall) [22:48:52] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:56:27] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:56:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P61229 and previous config saved to /var/cache/conftool/dbconfig/20240425-225654-ladsgroup.json [22:58:47] (03Abandoned) 10Zabe: Revert "Enable VE on new wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921565 (owner: 10Naif212) [22:59:30] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:06:34] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:11:21] (03PS1) 10Andrew Bogott: labtesthorizon: advance to 2024-04-25-225100-dev [puppet] - 10https://gerrit.wikimedia.org/r/1024523 [23:11:36] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:12:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T352010)', diff saved to https://phabricator.wikimedia.org/P61230 and previous config saved to /var/cache/conftool/dbconfig/20240425-231201-ladsgroup.json [23:12:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:13:38] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:16:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [23:24:33] (KubernetesCalicoDown) firing: parse1002.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s&var-instance=parse1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:26:38] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:28:44] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:38:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023541 [23:38:38] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1023541 (owner: 10TrainBranchBot) [23:52:58] (03CR) 10Andrew Bogott: [C:03+2] labtesthorizon: advance to 2024-04-25-225100-dev [puppet] - 10https://gerrit.wikimedia.org/r/1024523 (owner: 10Andrew Bogott)