[00:05:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1043309 (owner: 10TrainBranchBot) [00:13:40] PROBLEM - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:18] RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [00:22:44] RECOVERY - Disk space on thanos-be1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops [00:29:46] FIRING: [2x] Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [00:36:24] RECOVERY - Disk space on thanos-be2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [00:41:07] (03PS2) 10Scott French: mediawiki: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) [00:41:50] RECOVERY - Disk space on thanos-be2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2004&var-datasource=codfw+prometheus/ops [00:47:18] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 305.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:49:43] (03CR) 10Scott French: "Thanks, Janis!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [00:55:47] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:56:04] RECOVERY - Disk space on thanos-be1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [00:57:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:57:08] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:59:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 5.817 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:59:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52067 bytes in 3.404 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:18] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 55.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:05:20] RECOVERY - Disk space on thanos-be2003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [01:06:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [01:07:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance [01:07:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T352010)', diff saved to https://phabricator.wikimedia.org/P64896 and previous config saved to /var/cache/conftool/dbconfig/20240614-010717-ladsgroup.json [01:07:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [02:53:15] (03PS2) 10Phedenskog: wmftest: Remove old performance team setup. [dns] - 10https://gerrit.wikimedia.org/r/1042919 (https://phabricator.wikimedia.org/T366669) [02:53:48] (03CR) 10Phedenskog: [C:03+1] "This is good to go now from my end." [dns] - 10https://gerrit.wikimedia.org/r/1042919 (https://phabricator.wikimedia.org/T366669) (owner: 10Phedenskog) [03:05:06] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:05:14] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:05:16] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:05:22] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:30:44] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 26283328 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:31:46] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 79040 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:36:50] RECOVERY - Host an-worker1093 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:39:44] !log cdobbins@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_eqsin [03:39:58] !log cdobbins@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_eqsin [04:28:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:28:14] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:28:16] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:28:16] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:29:46] FIRING: [2x] Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [04:32:14] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:32:16] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:32:16] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:33:02] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:44:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T352010)', diff saved to https://phabricator.wikimedia.org/P64897 and previous config saved to /var/cache/conftool/dbconfig/20240614-044440-ladsgroup.json [04:44:46] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:48:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [04:48:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [04:48:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:48:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:48:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T364069)', diff saved to https://phabricator.wikimedia.org/P64898 and previous config saved to /var/cache/conftool/dbconfig/20240614-044840-marostegui.json [04:48:45] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:51:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [04:51:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [04:51:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:51:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:51:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T367261)', diff saved to https://phabricator.wikimedia.org/P64899 and previous config saved to /var/cache/conftool/dbconfig/20240614-045129-marostegui.json [04:51:34] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [04:52:16] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:53:12] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:53:14] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:53:16] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:54:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T367261)', diff saved to https://phabricator.wikimedia.org/P64900 and previous config saved to /var/cache/conftool/dbconfig/20240614-045458-marostegui.json [04:55:47] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:58:16] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:58:16] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv6: Connect - Telxius, AS12956/IPv4: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:58:18] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:58:18] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:59:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P64901 and previous config saved to /var/cache/conftool/dbconfig/20240614-045947-ladsgroup.json [05:00:35] (03PS1) 10Marostegui: db1200: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1043482 [05:01:13] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1043482 (owner: 10Marostegui) [05:01:14] (03CR) 10Marostegui: [C:03+2] db1200: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1043482 (owner: 10Marostegui) [05:01:16] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:01:18] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:01:18] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:01:24] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:04:46] FIRING: [2x] Storage /var over 50%: Device lsw1-f5-eqiad.mgmt.eqiad.wmnet recovered from Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [05:10:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P64902 and previous config saved to /var/cache/conftool/dbconfig/20240614-051005-marostegui.json [05:14:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P64903 and previous config saved to /var/cache/conftool/dbconfig/20240614-051454-ladsgroup.json [05:25:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P64904 and previous config saved to /var/cache/conftool/dbconfig/20240614-052512-marostegui.json [05:30:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T352010)', diff saved to https://phabricator.wikimedia.org/P64905 and previous config saved to /var/cache/conftool/dbconfig/20240614-053001-ladsgroup.json [05:30:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [05:30:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:30:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance [05:30:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T352010)', diff saved to https://phabricator.wikimedia.org/P64906 and previous config saved to /var/cache/conftool/dbconfig/20240614-053023-ladsgroup.json [05:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:35:48] (03PS1) 10Muehlenhoff: thanos: Limit access to swift ring sync to Puppet 7 servers [puppet] - 10https://gerrit.wikimedia.org/r/1043513 (https://phabricator.wikimedia.org/T365798) [05:38:46] (03PS1) 10Muehlenhoff: swift::proxy: Limit access to swift ring sync to Puppet 7 servers [puppet] - 10https://gerrit.wikimedia.org/r/1043516 (https://phabricator.wikimedia.org/T365798) [05:40:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T367261)', diff saved to https://phabricator.wikimedia.org/P64907 and previous config saved to /var/cache/conftool/dbconfig/20240614-054019-marostegui.json [05:40:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [05:40:24] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [05:40:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [05:40:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T367261)', diff saved to https://phabricator.wikimedia.org/P64908 and previous config saved to /var/cache/conftool/dbconfig/20240614-054041-marostegui.json [05:45:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T367261)', diff saved to https://phabricator.wikimedia.org/P64909 and previous config saved to /var/cache/conftool/dbconfig/20240614-054555-marostegui.json [05:46:01] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [05:51:20] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 410.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [05:57:16] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 170 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:59:40] (03PS1) 10Slyngshede: data.yaml: Extend rkan to 2025-06-28 [puppet] - 10https://gerrit.wikimedia.org/r/1043533 [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240614T0600) [06:01:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P64910 and previous config saved to /var/cache/conftool/dbconfig/20240614-060102-marostegui.json [06:02:16] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 38 probes of 728 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:20] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 45.82 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:13:30] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:16:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P64911 and previous config saved to /var/cache/conftool/dbconfig/20240614-061609-marostegui.json [06:18:30] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 29 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:21:56] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1043533 (owner: 10Slyngshede) [06:22:34] (03CR) 10Slyngshede: [C:03+2] data.yaml: Extend rkan to 2025-06-28 [puppet] - 10https://gerrit.wikimedia.org/r/1043533 (owner: 10Slyngshede) [06:31:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T367261)', diff saved to https://phabricator.wikimedia.org/P64912 and previous config saved to /var/cache/conftool/dbconfig/20240614-063116-marostegui.json [06:31:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [06:31:21] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [06:31:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [06:31:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T367261)', diff saved to https://phabricator.wikimedia.org/P64913 and previous config saved to /var/cache/conftool/dbconfig/20240614-063138-marostegui.json [06:34:46] !log rebalance ganeti/C in eqiad following reboots [06:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T367261)', diff saved to https://phabricator.wikimedia.org/P64914 and previous config saved to /var/cache/conftool/dbconfig/20240614-063451-marostegui.json [06:39:20] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 311.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:40:34] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487 (10MoritzMuehlenhoff) 03NEW [06:41:36] !log dbmaint codfw s1 deploy schema change T367261 [06:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:40] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [06:49:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P64915 and previous config saved to /var/cache/conftool/dbconfig/20240614-064958-marostegui.json [06:53:23] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ping2003.codfw.wmnet [06:58:14] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240614T0700) [07:05:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P64916 and previous config saved to /var/cache/conftool/dbconfig/20240614-070505-marostegui.json [07:07:11] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:07:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Long schema change [07:07:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Long schema change [07:09:23] (03CR) 10Brouberol: [C:03+1] dse-k8s: harmonize airflow user/namespace/db names [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [07:09:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:09:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:09:56] (03CR) 10Brouberol: [C:03+1] dse-k8s: harmonize airflow user/namespace/db names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043275 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [07:09:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping2003.codfw.wmnet [07:10:06] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9891546 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ping2003.codfw.wmnet` - ping2003.codfw.wmnet (**PASS**) - Downtimed h... [07:14:21] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 38.79 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:14:59] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ping1003.eqiad.wmnet [07:15:45] (03CR) 10Brouberol: dse-k8s-services: Add net-new chart for Airflow (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [07:17:11] (03PS1) 10Muehlenhoff: Remove old ping hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1043597 (https://phabricator.wikimedia.org/T366695) [07:17:21] !log dbmaint eqiad s1 deploy schema change T367261 [07:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:25] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [07:18:29] (03CR) 10Muehlenhoff: [C:03+2] Remove old ping hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1043597 (https://phabricator.wikimedia.org/T366695) (owner: 10Muehlenhoff) [07:19:50] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:20:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T367261)', diff saved to https://phabricator.wikimedia.org/P64917 and previous config saved to /var/cache/conftool/dbconfig/20240614-072012-marostegui.json [07:20:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [07:20:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1210.eqiad.wmnet with reason: Maintenance [07:20:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T367261)', diff saved to https://phabricator.wikimedia.org/P64918 and previous config saved to /var/cache/conftool/dbconfig/20240614-072034-marostegui.json [07:23:15] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:23:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T367261)', diff saved to https://phabricator.wikimedia.org/P64919 and previous config saved to /var/cache/conftool/dbconfig/20240614-072354-marostegui.json [07:23:59] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [07:27:11] (03PS1) 10JMeybohm: Release 4.0.1 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043601 [07:31:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ping1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [07:31:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:31:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping1003.eqiad.wmnet [07:31:22] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9891578 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ping1003.eqiad.wmnet` - ping1003.eqiad.wmnet (**PASS**) - Downtimed h... [07:31:44] (03CR) 10Arnaudb: [C:03+1] installer/cephadm: specify a very large maximum size [puppet] - 10https://gerrit.wikimedia.org/r/1043165 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:32:11] (03PS1) 10Giuseppe Lavagetto: mw-parsoid: enable statsd service for mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043602 (https://phabricator.wikimedia.org/T365265) [07:32:12] (03PS1) 10Giuseppe Lavagetto: mw-parsoid: send statsd stats to the statsd services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043603 (https://phabricator.wikimedia.org/T365265) [07:33:45] (03CR) 10MVernon: [C:03+2] installer/cephadm: specify a very large maximum size [puppet] - 10https://gerrit.wikimedia.org/r/1043165 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:37:26] (03PS1) 10MVernon: installer/cephadm: increase priority of / [puppet] - 10https://gerrit.wikimedia.org/r/1043605 (https://phabricator.wikimedia.org/T279621) [07:38:18] (03PS1) 10Muehlenhoff: Deprecate system::role for remaining mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/1043606 [07:39:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P64920 and previous config saved to /var/cache/conftool/dbconfig/20240614-073902-marostegui.json [07:39:15] (03CR) 10Arnaudb: [C:03+1] installer/cephadm: increase priority of / [puppet] - 10https://gerrit.wikimedia.org/r/1043605 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:39:40] (03CR) 10JMeybohm: [C:03+1] "Yeah, SGTM. When deploying, please do so in a infra deployment window to ensure you won't get raced by a regular deploy." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:41:56] (03PS2) 10Giuseppe Lavagetto: mw-parsoid: enable statsd service for mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043602 (https://phabricator.wikimedia.org/T365265) [07:41:56] (03PS2) 10Giuseppe Lavagetto: mw-parsoid: send statsd stats to the statsd services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043603 (https://phabricator.wikimedia.org/T365265) [07:43:24] (03PS1) 10Muehlenhoff: Deprecate system::role for Kafka roles [puppet] - 10https://gerrit.wikimedia.org/r/1043607 [07:44:30] (03CR) 10Jelto: [C:03+2] sre/gitlab: tweak expression for GitLabCiJobErrors [alerts] - 10https://gerrit.wikimedia.org/r/1043086 (https://phabricator.wikimedia.org/T367341) (owner: 10Jelto) [07:46:04] (03Merged) 10jenkins-bot: sre/gitlab: tweak expression for GitLabCiJobErrors [alerts] - 10https://gerrit.wikimedia.org/r/1043086 (https://phabricator.wikimedia.org/T367341) (owner: 10Jelto) [07:51:07] 06SRE, 06Infrastructure-Foundations, 07LDAP: Split out ldap management from mwmaint - https://phabricator.wikimedia.org/T367490 (10MoritzMuehlenhoff) 03NEW [07:51:37] (03CR) 10MVernon: [C:03+2] installer/cephadm: increase priority of / [puppet] - 10https://gerrit.wikimedia.org/r/1043605 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:51:39] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:54:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P64921 and previous config saved to /var/cache/conftool/dbconfig/20240614-075408-marostegui.json [07:56:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm [07:56:17] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891647 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm [07:57:28] (03PS1) 10Muehlenhoff: Add ldap-maint[12]001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1043645 (https://phabricator.wikimedia.org/T367490) [07:59:53] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9891651 (10MoritzMuehlenhoff) 05Open→03Resolved The old ping servers have been decommed, closing. [08:00:25] (03CR) 10Muehlenhoff: [C:03+2] Add ldap-maint[12]001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1043645 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:00:32] (03PS2) 10Muehlenhoff: Add ldap-maint[12]001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1043645 (https://phabricator.wikimedia.org/T367490) [08:00:35] (03PS2) 10JMeybohm: Allow multiple update files in one go [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/643912 [08:01:26] (03CR) 10DCausse: team-search-platform: Add kafka topic alerts for new search pipeline (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772) (owner: 10Bking) [08:01:27] (03PS1) 10Muehlenhoff: Add partman globbing for ldap-maint [puppet] - 10https://gerrit.wikimedia.org/r/1043646 (https://phabricator.wikimedia.org/T367490) [08:03:07] !log dbmaint codfw s8 deploy schema change T367261 [08:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:11] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [08:08:10] (03CR) 10Muehlenhoff: [V:03+2] Add ldap-maint[12]001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1043645 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:09:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T367261)', diff saved to https://phabricator.wikimedia.org/P64922 and previous config saved to /var/cache/conftool/dbconfig/20240614-080915-marostegui.json [08:09:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [08:09:21] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [08:09:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1213.eqiad.wmnet with reason: Maintenance [08:09:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T367261)', diff saved to https://phabricator.wikimedia.org/P64923 and previous config saved to /var/cache/conftool/dbconfig/20240614-080938-marostegui.json [08:11:19] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/643912 (owner: 10JMeybohm) [08:12:00] (03CR) 10Muehlenhoff: [C:03+2] Add partman globbing for ldap-maint [puppet] - 10https://gerrit.wikimedia.org/r/1043646 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:12:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T367261)', diff saved to https://phabricator.wikimedia.org/P64924 and previous config saved to /var/cache/conftool/dbconfig/20240614-081255-marostegui.json [08:14:07] !log pfischer@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [08:14:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage [08:14:21] !log pfischer@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:15:42] (03CR) 10Marostegui: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043606 (owner: 10Muehlenhoff) [08:15:47] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage [08:19:31] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9891677 (10kamila) 05Open→03Resolved New NICs seem to be happy including overnight network testing out of sheer paranoia,... [08:21:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [08:21:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [08:24:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-maint2001.codfw.wmnet [08:24:07] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:26:24] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-maint2001.codfw.wmnet - jmm@cumin2002" [08:27:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-maint2001.codfw.wmnet - jmm@cumin2002" [08:27:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:27:22] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-maint2001.codfw.wmnet on all recursors [08:27:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-maint2001.codfw.wmnet on all recursors [08:27:53] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-maint2001.codfw.wmnet - jmm@cumin2002" [08:28:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P64925 and previous config saved to /var/cache/conftool/dbconfig/20240614-082803-marostegui.json [08:28:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-maint2001.codfw.wmnet - jmm@cumin2002" [08:30:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-maint2001.codfw.wmnet with OS bookworm [08:31:51] (03CR) 10Jelto: [C:03+1] "lgtm, this file still has a /var/lib/mailman3 hardcoded and might need an update as well: https://gerrit.wikimedia.org/r/plugins/gitiles/o" [puppet] - 10https://gerrit.wikimedia.org/r/1043127 (owner: 10EoghanGaffney) [08:35:41] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2002.codfw.wmnet with OS bookworm [08:35:57] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm completed: - moss-fe2002 (**PASS**)... [08:37:26] (03CR) 10Jelto: [C:03+1] "GitLab should not use CAS anymore, so this should be fine to clean up." [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn) [08:38:07] (03CR) 10Marostegui: [C:03+1] Deprecate system::role for remaining mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/1043606 (owner: 10Muehlenhoff) [08:38:09] (03CR) 10Giuseppe Lavagetto: [C:03+1] Call the test with the image name including tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155 (owner: 10JMeybohm) [08:39:12] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9891697 (10Aklapper) 05Resolved→03Open Reopening as it seems the second step on https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group was skipped [08:39:18] (03PS1) 10Muehlenhoff: profile::openldap::management: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043656 (https://phabricator.wikimedia.org/T367490) [08:39:39] (03CR) 10CI reject: [V:04-1] profile::openldap::management: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043656 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:43:02] (03PS2) 10Muehlenhoff: profile::openldap::management: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043656 (https://phabricator.wikimedia.org/T367490) [08:43:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P64926 and previous config saved to /var/cache/conftool/dbconfig/20240614-084310-marostegui.json [08:44:01] (03CR) 10Muehlenhoff: [C:03+1] "Just merge, I'll fold this into the next builds" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/643912 (owner: 10JMeybohm) [08:44:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Long schema change [08:44:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Long schema change [08:44:42] !log dbmaint eqiad s8 deploy schema change T367261 [08:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:49] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [08:45:13] (03CR) 10Slyngshede: [C:03+1] "LGTM, I don't really like having the holes in the "id" series, but I don't think we have even considered how to handle that." [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn) [08:45:41] (03CR) 10Giuseppe Lavagetto: [C:03+1] Release 4.0.1 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043601 (owner: 10JMeybohm) [08:47:46] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-maint2001.codfw.wmnet with reason: host reimage [08:49:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043656 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [08:51:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-maint2001.codfw.wmnet with reason: host reimage [08:51:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T364069)', diff saved to https://phabricator.wikimedia.org/P64927 and previous config saved to /var/cache/conftool/dbconfig/20240614-085113-marostegui.json [08:51:18] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:54:36] (03CR) 10FNegri: "The old value ("Infomaniak") is hardcoded at modules/apt/manifests/unattendedupgrades.pp, I think it should also be changed there, and I w" [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [08:55:16] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2001.codfw.wmnet with OS bookworm [08:55:25] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891743 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be2001.codfw.wmnet with OS bookworm [08:55:47] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:57:08] (03CR) 10Majavah: "I don't think this will exactly work as expected, as that specific `apt-get update` exec is triggered only when adding a new osbpo reposit" [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [08:58:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T367261)', diff saved to https://phabricator.wikimedia.org/P64928 and previous config saved to /var/cache/conftool/dbconfig/20240614-085817-marostegui.json [08:58:19] (03CR) 10Arnaudb: [C:03+1] Deprecate system::role for remaining mariadb roles [puppet] - 10https://gerrit.wikimedia.org/r/1043606 (owner: 10Muehlenhoff) [08:58:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [08:58:22] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [08:58:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [08:59:19] (03CR) 10DCausse: "makes sense, I think that would be a bit safer to have a dedicated retry logic for 429 that ensures that we're OK retrying more rather tha" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [09:00:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [09:00:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [09:01:16] (03CR) 10JMeybohm: [C:03+1] mw-parsoid: enable statsd service for mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043602 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [09:01:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [09:01:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [09:03:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [09:03:26] (03CR) 10Slyngshede: [C:03+2] profile::openldap::management: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043656 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [09:03:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [09:03:37] (03CR) 10JMeybohm: [C:04-1] mw-parsoid: send statsd stats to the statsd services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043603 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [09:03:38] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1043656 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [09:04:46] FIRING: Storage /var over 50%: Alert for device lsw1-f6-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [09:04:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [09:04:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2123.codfw.wmnet with reason: Maintenance [09:04:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2123 (T367261)', diff saved to https://phabricator.wikimedia.org/P64929 and previous config saved to /var/cache/conftool/dbconfig/20240614-090457-marostegui.json [09:05:02] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [09:06:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P64930 and previous config saved to /var/cache/conftool/dbconfig/20240614-090620-marostegui.json [09:06:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-maint2001.codfw.wmnet with OS bookworm [09:06:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-maint2001.codfw.wmnet [09:08:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T367261)', diff saved to https://phabricator.wikimedia.org/P64931 and previous config saved to /var/cache/conftool/dbconfig/20240614-090835-marostegui.json [09:10:14] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.hadoop.reboot-workers (exit_code=97) for Hadoop analytics cluster [09:11:07] (03PS3) 10Filippo Giunchedi: eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563) [09:11:24] (03CR) 10Filippo Giunchedi: [C:03+2] eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:12:15] (03Merged) 10jenkins-bot: eventstreams: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043076 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:13:20] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [09:14:18] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [09:17:31] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:20:29] (03PS2) 10Filippo Giunchedi: apertium: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) [09:20:33] (03CR) 10Filippo Giunchedi: [C:03+2] apertium: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:21:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P64932 and previous config saved to /var/cache/conftool/dbconfig/20240614-092127-marostegui.json [09:21:41] (03Merged) 10jenkins-bot: apertium: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043090 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:21:56] (03PS3) 10Clément Goubert: mw-parsoid: send statsd stats to the statsd services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043603 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [09:22:07] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [09:22:54] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [09:23:06] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply [09:23:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P64933 and previous config saved to /var/cache/conftool/dbconfig/20240614-092342-marostegui.json [09:23:52] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [09:25:00] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply [09:25:33] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [09:25:53] (03CR) 10Clément Goubert: mw-parsoid: send statsd stats to the statsd services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043603 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [09:26:20] (03PS2) 10Filippo Giunchedi: zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) [09:26:20] (03PS1) 10JMeybohm: prometheus::k8s: Keep envoy ratelimit metrics [puppet] - 10https://gerrit.wikimedia.org/r/1043667 (https://phabricator.wikimedia.org/T362310) [09:26:23] (03CR) 10Filippo Giunchedi: [C:03+2] zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:27:06] (03PS2) 10JMeybohm: prometheus::k8s: Keep envoy ratelimit metrics [puppet] - 10https://gerrit.wikimedia.org/r/1043667 (https://phabricator.wikimedia.org/T362310) [09:27:10] (03Merged) 10jenkins-bot: zotero: enable mesh tracing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043089 (https://phabricator.wikimedia.org/T320563) (owner: 10Filippo Giunchedi) [09:27:25] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043667 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [09:27:31] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:29:32] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/zotero: apply [09:29:59] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [09:30:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [09:30:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [09:31:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ldap-maint1001.eqiad.wmnet [09:31:10] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:31:33] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/zotero: apply [09:31:49] (03CR) 10Muehlenhoff: [C:03+2] profile::openldap::management: Add support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043656 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [09:31:56] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [09:34:31] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:34:36] (03CR) 10EoghanGaffney: [C:03+1] idp: drop gitlab-new.wikimedia.org service ID [puppet] - 10https://gerrit.wikimedia.org/r/1043181 (owner: 10Dzahn) [09:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:36:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T364069)', diff saved to https://phabricator.wikimedia.org/P64934 and previous config saved to /var/cache/conftool/dbconfig/20240614-093634-marostegui.json [09:36:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:36:40] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:36:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:36:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T364069)', diff saved to https://phabricator.wikimedia.org/P64935 and previous config saved to /var/cache/conftool/dbconfig/20240614-093657-marostegui.json [09:37:27] !log upgrade and restart dbprov[12]00[3456] [09:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:45] (03CR) 10Muehlenhoff: [C:03+2] Cleanup puppetmaster preseed config [puppet] - 10https://gerrit.wikimedia.org/r/1043114 (owner: 10Muehlenhoff) [09:38:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P64936 and previous config saved to /var/cache/conftool/dbconfig/20240614-093849-marostegui.json [09:42:01] (03CR) 10Clément Goubert: [C:03+2] mw-parsoid: enable statsd service for mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043602 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [09:42:53] (03Merged) 10jenkins-bot: mw-parsoid: enable statsd service for mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043602 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [09:43:12] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-maint1001.eqiad.wmnet - jmm@cumin2002" [09:43:35] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: GitLab to new version [09:43:41] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [09:44:41] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [09:44:55] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [09:45:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ldap-maint1001.eqiad.wmnet - jmm@cumin2002" [09:45:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:45:10] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ldap-maint1001.eqiad.wmnet on all recursors [09:45:11] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [09:45:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ldap-maint1001.eqiad.wmnet on all recursors [09:47:01] (03PS1) 10Muehlenhoff: New openldap::management role [puppet] - 10https://gerrit.wikimedia.org/r/1043680 (https://phabricator.wikimedia.org/T367490) [09:47:23] (03CR) 10CI reject: [V:04-1] New openldap::management role [puppet] - 10https://gerrit.wikimedia.org/r/1043680 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [09:51:35] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 144 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:52:00] (03PS2) 10Muehlenhoff: New openldap::management role [puppet] - 10https://gerrit.wikimedia.org/r/1043680 (https://phabricator.wikimedia.org/T367490) [09:53:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T367261)', diff saved to https://phabricator.wikimedia.org/P64937 and previous config saved to /var/cache/conftool/dbconfig/20240614-095356-marostegui.json [09:54:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [09:54:01] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [09:54:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2128.codfw.wmnet with reason: Maintenance [09:54:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:54:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [09:54:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2128 (T367261)', diff saved to https://phabricator.wikimedia.org/P64938 and previous config saved to /var/cache/conftool/dbconfig/20240614-095434-marostegui.json [09:58:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T367261)', diff saved to https://phabricator.wikimedia.org/P64939 and previous config saved to /var/cache/conftool/dbconfig/20240614-095809-marostegui.json [09:58:41] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:59:11] (03CR) 10Hnowlan: [C:03+1] "lgtm, one nit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [09:59:37] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-maint1001.eqiad.wmnet - jmm@cumin2002" [10:00:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ldap-maint1001.eqiad.wmnet - jmm@cumin2002" [10:07:11] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1043680 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [10:08:39] (03CR) 10Hnowlan: [C:03+1] Revert^2 "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043195 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [10:13:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P64940 and previous config saved to /var/cache/conftool/dbconfig/20240614-101316-marostegui.json [10:16:59] (03PS1) 10Muehlenhoff: Remove mcrouter CA setup [puppet] - 10https://gerrit.wikimedia.org/r/1043699 (https://phabricator.wikimedia.org/T365798) [10:17:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ldap-maint1001.eqiad.wmnet with OS bookworm [10:17:52] (03CR) 10Muehlenhoff: [C:03+2] New openldap::management role [puppet] - 10https://gerrit.wikimedia.org/r/1043680 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [10:20:22] (03PS3) 10EoghanGaffney: lists: Switch mailman_root for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1043127 [10:20:28] (03CR) 10CI reject: [V:04-1] Remove mcrouter CA setup [puppet] - 10https://gerrit.wikimedia.org/r/1043699 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:22:46] (03PS4) 10Clément Goubert: mw-parsoid: send statsd stats to the statsd services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043603 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [10:22:46] (03PS1) 10Clément Goubert: mw-debug: Fix mcrouter address [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043703 [10:22:47] (03PS1) 10Clément Goubert: mw-on-k8s: Deploy statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) [10:22:48] (03PS1) 10Clément Goubert: mw-jobrunner: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043705 (https://phabricator.wikimedia.org/T365265) [10:22:52] (03PS1) 10Clément Goubert: mw-api-ext: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043706 (https://phabricator.wikimedia.org/T365265) [10:22:56] (03PS1) 10Clément Goubert: mw-api-int: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043707 (https://phabricator.wikimedia.org/T365265) [10:23:00] (03PS1) 10Clément Goubert: mw-web: send statsd data to the exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043708 (https://phabricator.wikimedia.org/T365265) [10:23:09] (03CR) 10Brouberol: Deprecate system::role for Kafka roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043607 (owner: 10Muehlenhoff) [10:24:00] (03PS1) 10Muehlenhoff: Apply openldap::maintenance role to ldap-maint* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1043709 (https://phabricator.wikimedia.org/T367490) [10:24:36] (03PS2) 10Muehlenhoff: Remove mcrouter CA setup [puppet] - 10https://gerrit.wikimedia.org/r/1043699 (https://phabricator.wikimedia.org/T365798) [10:25:06] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be2001.codfw.wmnet with OS bookworm [10:26:05] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2928/co" [puppet] - 10https://gerrit.wikimedia.org/r/1043127 (owner: 10EoghanGaffney) [10:28:21] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe2002.codfw.wmnet with OS bookworm [10:28:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P64941 and previous config saved to /var/cache/conftool/dbconfig/20240614-102823-marostegui.json [10:28:31] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891925 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm [10:28:55] (03CR) 10Hnowlan: [C:04-1] [WIP] create a shellbox deployment for videoscalers (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [10:30:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage [10:31:18] (03CR) 10Jelto: [C:03+1] "lgtm now!" [puppet] - 10https://gerrit.wikimedia.org/r/1043127 (owner: 10EoghanGaffney) [10:32:42] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Switch mailman_root for new hosts [puppet] - 10https://gerrit.wikimedia.org/r/1043127 (owner: 10EoghanGaffney) [10:32:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043699 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:33:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe2002.codfw.wmnet with reason: host reimage [10:34:02] (03PS2) 10Muehlenhoff: Deprecate system::role for Kafka roles [puppet] - 10https://gerrit.wikimedia.org/r/1043607 [10:36:08] (03CR) 10Clément Goubert: [C:03+2] mw-parsoid: send statsd stats to the statsd services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043603 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [10:36:51] (03Merged) 10jenkins-bot: mw-parsoid: send statsd stats to the statsd services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043603 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [10:37:23] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [10:39:17] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [10:41:32] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9891940 (10WDoranWMF) Approved [10:43:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T367261)', diff saved to https://phabricator.wikimedia.org/P64942 and previous config saved to /var/cache/conftool/dbconfig/20240614-104330-marostegui.json [10:43:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:43:34] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [10:43:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [10:43:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2157 (T367261)', diff saved to https://phabricator.wikimedia.org/P64943 and previous config saved to /var/cache/conftool/dbconfig/20240614-104352-marostegui.json [10:43:55] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe2002.codfw.wmnet with OS bookworm [10:44:09] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe2002.codfw.wmnet with OS bookworm completed: - moss-fe2002 (**PASS**)... [10:45:09] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2001.codfw.wmnet with OS bookworm [10:45:23] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be2001.codfw.wmnet with OS bookworm [10:47:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T367261)', diff saved to https://phabricator.wikimedia.org/P64945 and previous config saved to /var/cache/conftool/dbconfig/20240614-104742-marostegui.json [10:48:19] (03PS1) 10EoghanGaffney: lists: Add mailman_root parameter to mailman3 class [puppet] - 10https://gerrit.wikimedia.org/r/1043714 [10:49:59] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1043714 (owner: 10EoghanGaffney) [10:52:33] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1043714 (owner: 10EoghanGaffney) [10:52:59] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: sync [10:53:13] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: sync [10:53:28] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: sync [10:54:21] (03CR) 10EoghanGaffney: [V:03+1 C:03+2] lists: Add mailman_root parameter to mailman3 class [puppet] - 10https://gerrit.wikimedia.org/r/1043714 (owner: 10EoghanGaffney) [10:54:21] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: sync [10:54:28] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s2 [10:54:33] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s7 [10:54:37] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-be2001.codfw.wmnet with OS bookworm [10:54:48] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891956 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be2001.codfw.wmnet with OS bookworm executed with errors: - moss-be2001 (... [10:54:53] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [10:54:54] (03CR) 10MVernon: [C:03+2] apus: setup for codfw apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1043115 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:55:24] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [10:55:36] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-parsoid: sync [10:55:36] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on clouddb1018.eqiad.wmnet with reason: T366555 [10:55:39] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1018.eqiad.wmnet with reason: T366555 [10:56:40] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: sync [10:56:42] (03CR) 10JMeybohm: [C:03+1] mw-debug: Fix mcrouter address [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043703 (owner: 10Clément Goubert) [10:59:57] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2001.codfw.wmnet with OS bookworm [11:00:13] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be2001.codfw.wmnet with OS bookworm [11:00:47] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-be2001.codfw.wmnet with OS bookworm [11:01:00] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891969 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be2001.codfw.wmnet with OS bookworm executed with errors: - moss-be2001 (... [11:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P64946 and previous config saved to /var/cache/conftool/dbconfig/20240614-110249-marostegui.json [11:02:51] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2001.codfw.wmnet with OS bookworm [11:02:54] !log restart backup* hosts [11:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:02] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9891973 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be2001.codfw.wmnet with OS bookworm [11:04:55] (03CR) 10Clément Goubert: [C:03+2] mw-debug: Fix mcrouter address [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043703 (owner: 10Clément Goubert) [11:05:46] (03Merged) 10jenkins-bot: mw-debug: Fix mcrouter address [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043703 (owner: 10Clément Goubert) [11:06:23] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:06:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:10:47] FIRING: [2x] JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:15:53] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1043667 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [11:17:38] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ldap-maint1001.eqiad.wmnet with reason: host reimage [11:17:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P64947 and previous config saved to /var/cache/conftool/dbconfig/20240614-111756-marostegui.json [11:18:19] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1018.eqiad.wmnet with reason: T366555 [11:18:21] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1018.eqiad.wmnet with reason: T366555 [11:20:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ldap-maint1001.eqiad.wmnet with reason: host reimage [11:21:17] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1018.eqiad.wmnet [11:23:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T352010)', diff saved to https://phabricator.wikimedia.org/P64948 and previous config saved to /var/cache/conftool/dbconfig/20240614-112357-ladsgroup.json [11:24:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:25:42] (03PS1) 10Hnowlan: service: add basic config for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) [11:26:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:28:24] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [11:28:26] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [11:29:14] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9891993 (10SLyngshede-WMF) I've run a test build, Java 21 is a hard requirement, it cannot be older or newer. Otherwise the overlay upgrade contains only minor changes. I have not tested the function... [11:33:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T367261)', diff saved to https://phabricator.wikimedia.org/P64949 and previous config saved to /var/cache/conftool/dbconfig/20240614-113303-marostegui.json [11:33:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [11:33:08] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [11:33:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [11:33:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T367261)', diff saved to https://phabricator.wikimedia.org/P64950 and previous config saved to /var/cache/conftool/dbconfig/20240614-113325-marostegui.json [11:36:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P64951 and previous config saved to /var/cache/conftool/dbconfig/20240614-113654-ladsgroup.json [11:37:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ldap-maint1001.eqiad.wmnet with OS bookworm [11:37:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-maint1001.eqiad.wmnet [11:37:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T367261)', diff saved to https://phabricator.wikimedia.org/P64952 and previous config saved to /var/cache/conftool/dbconfig/20240614-113712-marostegui.json [11:39:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P64953 and previous config saved to /var/cache/conftool/dbconfig/20240614-113904-ladsgroup.json [11:39:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [11:39:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Maintenance [11:40:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2218 (T352010)', diff saved to https://phabricator.wikimedia.org/P64954 and previous config saved to /var/cache/conftool/dbconfig/20240614-114002-ladsgroup.json [11:40:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:42:28] (03PS1) 10Muehlenhoff: nginx: Drop workaround for history Puppet bug [puppet] - 10https://gerrit.wikimedia.org/r/1043735 [11:44:32] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 23 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:49:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043735 (owner: 10Muehlenhoff) [11:52:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P64955 and previous config saved to /var/cache/conftool/dbconfig/20240614-115159-ladsgroup.json [11:52:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P64956 and previous config saved to /var/cache/conftool/dbconfig/20240614-115220-marostegui.json [11:54:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P64957 and previous config saved to /var/cache/conftool/dbconfig/20240614-115411-ladsgroup.json [11:55:49] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [12:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240614T0700) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: Time to do the GitLab version upgrades deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240614T1200). [12:00:37] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T367457#9892051 (10VRiley-WMF) →14Duplicate dup:03T362033 [12:00:38] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9892053 (10VRiley-WMF) [12:01:22] !log jelto@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab2002.wikimedia.org with reason: GitLab to new version [12:05:35] 10ops-eqiad, 06DC-Ops: hw troubleshooting: server fails to reboot for clouddb1018.eqiad.wmnet - https://phabricator.wikimedia.org/T367499 (10fnegri) 03NEW [12:06:05] (03PS1) 10Hashar: tox: pin style dependencies to avoid CI failures [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043748 [12:07:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P64958 and previous config saved to /var/cache/conftool/dbconfig/20240614-120704-ladsgroup.json [12:07:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P64959 and previous config saved to /var/cache/conftool/dbconfig/20240614-120727-marostegui.json [12:08:06] !log fnegri@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host clouddb1018.eqiad.wmnet [12:08:36] (03CR) 10Hashar: Call the test with the image name including tag (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043155 (owner: 10JMeybohm) [12:08:57] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on clouddb1018.eqiad.wmnet with reason: hardware issues T367499 [12:09:00] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on clouddb1018.eqiad.wmnet with reason: hardware issues T367499 [12:09:02] T367499: hw troubleshooting: server fails to reboot for clouddb1018.eqiad.wmnet - https://phabricator.wikimedia.org/T367499 [12:09:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T352010)', diff saved to https://phabricator.wikimedia.org/P64960 and previous config saved to /var/cache/conftool/dbconfig/20240614-120918-ladsgroup.json [12:09:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [12:09:23] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:09:34] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [12:14:33] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations: Update CAS to 7.0 - https://phabricator.wikimedia.org/T367487#9892110 (10MoritzMuehlenhoff) I'll look into a Java 21 backport for Bookworm. [12:14:38] (03PS1) 10Majavah: P:cephadm::controller: don't crash if single host location lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/1043750 [12:15:47] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:15] (03PS1) 10MVernon: cephadm::controller - check if management_hostname findable in netbox data [puppet] - 10https://gerrit.wikimedia.org/r/1043752 (https://phabricator.wikimedia.org/T279621) [12:16:38] (03CR) 10CI reject: [V:04-1] cephadm::controller - check if management_hostname findable in netbox data [puppet] - 10https://gerrit.wikimedia.org/r/1043752 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [12:20:27] (03PS2) 10MVernon: cephadm::controller - check if management_hostname findable in netbox data [puppet] - 10https://gerrit.wikimedia.org/r/1043752 (https://phabricator.wikimedia.org/T279621) [12:20:47] (03CR) 10CI reject: [V:04-1] cephadm::controller - check if management_hostname findable in netbox data [puppet] - 10https://gerrit.wikimedia.org/r/1043752 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [12:21:30] (03PS3) 10MVernon: cephadm::controller - check if management_hostname findable in netbox data [puppet] - 10https://gerrit.wikimedia.org/r/1043752 (https://phabricator.wikimedia.org/T279621) [12:21:54] (03CR) 10CI reject: [V:04-1] cephadm::controller - check if management_hostname findable in netbox data [puppet] - 10https://gerrit.wikimedia.org/r/1043752 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [12:22:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1230 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P64961 and previous config saved to /var/cache/conftool/dbconfig/20240614-122210-ladsgroup.json [12:22:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T367261)', diff saved to https://phabricator.wikimedia.org/P64962 and previous config saved to /var/cache/conftool/dbconfig/20240614-122233-marostegui.json [12:22:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [12:22:38] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [12:22:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [12:22:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T367261)', diff saved to https://phabricator.wikimedia.org/P64963 and previous config saved to /var/cache/conftool/dbconfig/20240614-122255-marostegui.json [12:23:17] (03PS4) 10MVernon: cephadm::controller - check if management_hostname findable in netbox data [puppet] - 10https://gerrit.wikimedia.org/r/1043752 (https://phabricator.wikimedia.org/T279621) [12:23:54] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host moss-be2001.codfw.wmnet with OS bookworm [12:24:53] (03PS1) 10Muehlenhoff: Remove Pontoon support for Puppet 5 puppet masters [puppet] - 10https://gerrit.wikimedia.org/r/1043757 (https://phabricator.wikimedia.org/T365798) [12:25:08] (03CR) 10Majavah: [C:03+1] cephadm::controller - check if management_hostname findable in netbox data [puppet] - 10https://gerrit.wikimedia.org/r/1043752 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [12:25:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T367261)', diff saved to https://phabricator.wikimedia.org/P64964 and previous config saved to /var/cache/conftool/dbconfig/20240614-122530-marostegui.json [12:27:10] (03CR) 10MVernon: [C:03+2] cephadm::controller - check if management_hostname findable in netbox data [puppet] - 10https://gerrit.wikimedia.org/r/1043752 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [12:31:24] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, this is replaced by code in sandbox/filippo/pontoon-puppetserver which I'll send out for review when I'm back from vacation" [puppet] - 10https://gerrit.wikimedia.org/r/1043757 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [12:33:09] (03CR) 10Alexandros Kosiaris: [C:03+1] "LoL, yes." [puppet] - 10https://gerrit.wikimedia.org/r/1043735 (owner: 10Muehlenhoff) [12:40:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P64966 and previous config saved to /var/cache/conftool/dbconfig/20240614-124036-marostegui.json [12:44:34] (03PS1) 10MVernon: cephadm: handle host details being unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1043762 (https://phabricator.wikimedia.org/T279621) [12:45:08] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:15:00 on gitlab2002.wikimedia.org with reason: GitLab upgrade [12:45:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on gitlab2002.wikimedia.org with reason: GitLab upgrade [12:47:35] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 72 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:49:05] (03CR) 10Majavah: [C:03+1] cephadm: handle host details being unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1043762 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [12:49:23] (03CR) 10MVernon: [C:03+2] cephadm: handle host details being unavailable [puppet] - 10https://gerrit.wikimedia.org/r/1043762 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [12:50:37] (03CR) 10Bking: [C:03+2] dse-k8s: harmonize airflow user/namespace/db names [puppet] - 10https://gerrit.wikimedia.org/r/1043277 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [12:51:21] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 3498 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:51:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2001.codfw.wmnet with OS bookworm [12:52:10] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9892292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be2001.codfw.wmnet with OS bookworm [12:52:21] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 105367 bytes in 0.573 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [12:53:23] (03CR) 10JMeybohm: [C:03+2] prometheus::k8s: Keep envoy ratelimit metrics [puppet] - 10https://gerrit.wikimedia.org/r/1043667 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [12:53:35] (03CR) 10Bking: [C:03+2] dse-k8s: harmonize airflow user/namespace/db names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043275 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [12:54:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2001.codfw.wmnet with reason: host reimage [12:55:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P64967 and previous config saved to /var/cache/conftool/dbconfig/20240614-125543-marostegui.json [12:56:30] (03Merged) 10jenkins-bot: dse-k8s: harmonize airflow user/namespace/db names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043275 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [12:57:47] (03PS1) 10Jelto: aptrepo: bump gitlab-runner and gitlab-ce to 17.0 [puppet] - 10https://gerrit.wikimedia.org/r/1043764 (https://phabricator.wikimedia.org/T365675) [12:58:22] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be2001.codfw.wmnet with reason: host reimage [12:58:35] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:58:45] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:58:51] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:59:56] (03CR) 10Muehlenhoff: [C:03+2] Apply openldap::maintenance role to ldap-maint* hosts [puppet] - 10https://gerrit.wikimedia.org/r/1043709 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [13:00:57] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9892336 (10elukey) Note for me - this is an example of snippet generated by the provision cookbook to instruct the D... [13:02:35] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:05:15] !log restart db1150, db1171 [13:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:30] 10SRE-tools, 06Infrastructure-Foundations, 10Observability-Alerting, 10Spicerack: sre.hosts.downtime, and any other maintenance processes, should use auto-extending silences - https://phabricator.wikimedia.org/T367466#9892352 (10MatthewVernon) I had not heard of that tool, but it does sound useful! [13:07:04] (03PS1) 10Slyngshede: P::idm::docker Docker installation of Bitu on cloudweb-dev [puppet] - 10https://gerrit.wikimedia.org/r/1043769 [13:08:43] (03PS1) 10Muehlenhoff: Add ldap-admins settings to new ldap-maint role [puppet] - 10https://gerrit.wikimedia.org/r/1043770 (https://phabricator.wikimedia.org/T367490) [13:10:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T367261)', diff saved to https://phabricator.wikimedia.org/P64968 and previous config saved to /var/cache/conftool/dbconfig/20240614-131051-marostegui.json [13:10:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [13:10:55] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [13:11:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [13:11:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T367261)', diff saved to https://phabricator.wikimedia.org/P64969 and previous config saved to /var/cache/conftool/dbconfig/20240614-131113-marostegui.json [13:13:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T367261)', diff saved to https://phabricator.wikimedia.org/P64970 and previous config saved to /var/cache/conftool/dbconfig/20240614-131339-marostegui.json [13:17:42] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be2001.codfw.wmnet with OS bookworm [13:17:52] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9892367 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be2001.codfw.wmnet with OS bookworm completed: - moss-be2001 (**PASS**)... [13:18:26] (03Abandoned) 10Majavah: P:cephadm::controller: don't crash if single host location lookup fails [puppet] - 10https://gerrit.wikimedia.org/r/1043750 (owner: 10Majavah) [13:19:05] (03CR) 10Muehlenhoff: [C:03+2] Add ldap-admins settings to new ldap-maint role [puppet] - 10https://gerrit.wikimedia.org/r/1043770 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [13:19:35] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 65 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:20:21] (03CR) 10Jelto: [C:03+1] "lgtm. If we have to migrate we can introduce -new name again if needed" [puppet] - 10https://gerrit.wikimedia.org/r/1043181 (owner: 10Dzahn) [13:21:58] (03PS1) 10Majavah: hieradata: Move cloudvirt1034 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043775 (https://phabricator.wikimedia.org/T364457) [13:22:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512 (10cmooney) 03NEW p:05Triage→03Medium [13:22:12] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9892418 (10cmooney) [13:22:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Get test host connected to codfw row c/d lsw's - https://phabricator.wikimedia.org/T367512#9892417 (10cmooney) [13:22:15] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudvirt1034.eqiad.wmnet with reason: reimage and move to OVS [13:22:28] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudvirt1034.eqiad.wmnet with reason: reimage and move to OVS [13:23:05] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1034.eqiad.wmnet with OS bookworm [13:23:46] RESOLVED: SystemdUnitFailed: wmf_auto_restart_exim4.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:35] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:24:43] !log restart db1216, db1225, db1240, db1245 [13:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:50] (03CR) 10Majavah: [C:03+2] hieradata: Move cloudvirt1034 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1043775 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [13:28:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2002.codfw.wmnet with OS bookworm [13:28:38] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9892445 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be2002.codfw.wmnet with OS bookworm [13:28:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P64971 and previous config saved to /var/cache/conftool/dbconfig/20240614-132847-marostegui.json [13:30:39] 06SRE, 10Cassandra, 06Data-Persistence: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9892453 (10Eevans) [13:31:13] (03PS1) 10Elukey: cli: modify get_distro_name to return the version id [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) [13:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:35:58] (03PS14) 10Bking: team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772) [13:36:09] (03CR) 10Slyngshede: cli: modify get_distro_name to return the version id (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [13:36:40] (03CR) 10Bking: team-search-platform: Add kafka topic alerts for new search pipeline (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1043198 (https://phabricator.wikimedia.org/T349772) (owner: 10Bking) [13:39:47] (03CR) 10Elukey: cli: modify get_distro_name to return the version id (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [13:40:48] (03CR) 10Slyngshede: cli: modify get_distro_name to return the version id (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [13:41:57] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage [13:42:07] (03PS1) 10Muehlenhoff: Enable account check on new ldap-main host in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1043783 (https://phabricator.wikimedia.org/T367490) [13:43:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P64972 and previous config saved to /var/cache/conftool/dbconfig/20240614-134354-marostegui.json [13:44:03] (03PS2) 10Elukey: cli: modify get_distro_name to return the version id [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) [13:44:26] (03CR) 10Elukey: cli: modify get_distro_name to return the version id (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [13:44:42] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1034.eqiad.wmnet with reason: host reimage [13:47:05] (03CR) 10Muehlenhoff: [C:03+2] Enable account check on new ldap-main host in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1043783 (https://phabricator.wikimedia.org/T367490) (owner: 10Muehlenhoff) [13:47:37] (03PS1) 10Urbanecm: Growth: Enable CommunityConfiguration on arwiki, eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043784 (https://phabricator.wikimedia.org/T364895) [13:47:53] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2002.codfw.wmnet with reason: host reimage [13:49:15] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new ldap-maint hosts - jmm@cumin2002 - T367490" [13:49:21] T367490: Split out ldap management from mwmaint - https://phabricator.wikimedia.org/T367490 [13:50:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be2002.codfw.wmnet with reason: host reimage [13:52:22] !log restart db2139, db2141 [13:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:59] (03CR) 10Elukey: "Thanks for fixing the CR Janis! I am adding Tobias to review it and test it, when I filed it I just checked the InferenceService CRD but I" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [13:55:04] (03Abandoned) 10Elukey: services: set up TLS validation experiment for sessionstore in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017320 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:57:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED [13:59:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T367261)', diff saved to https://phabricator.wikimedia.org/P64973 and previous config saved to /var/cache/conftool/dbconfig/20240614-135900-marostegui.json [13:59:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [13:59:05] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [13:59:15] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [13:59:51] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [14:01:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2211.codfw.wmnet with reason: Maintenance [14:01:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T367261)', diff saved to https://phabricator.wikimedia.org/P64974 and previous config saved to /var/cache/conftool/dbconfig/20240614-140125-marostegui.json [14:03:46] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Update SPF records as needed - https://phabricator.wikimedia.org/T366113#9892614 (10jhathaway) 05Open→03Resolved spf records updated [14:04:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T367261)', diff saved to https://phabricator.wikimedia.org/P64975 and previous config saved to /var/cache/conftool/dbconfig/20240614-140404-marostegui.json [14:04:09] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [14:04:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new ldap-maint hosts - jmm@cumin2002 - T367490" [14:05:02] T367490: Split out ldap management from mwmaint - https://phabricator.wikimedia.org/T367490 [14:05:28] (03CR) 10Brouberol: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1043607 (owner: 10Muehlenhoff) [14:06:04] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9892624 (10herron) 05Open→03Resolved done [14:08:25] (03PS7) 10Arnaudb: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) [14:08:25] (03CR) 10Arnaudb: "@marostegui@wikimedia.org there is a few helper methods that would benefit from your review, they are stored in InstanceBase and InstanceM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [14:08:41] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:09:39] (03CR) 10CI reject: [V:04-1] mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [14:10:21] !log mvernon@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [14:10:25] (03PS8) 10Arnaudb: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) [14:11:12] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1034.eqiad.wmnet with OS bookworm [14:11:17] !log mvernon@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin2002" [14:11:24] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be2002.codfw.wmnet with OS bookworm [14:11:34] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9892638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be2002.codfw.wmnet with OS bookworm completed: - moss-be2002 (**PASS**)... [14:12:03] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bookworm [14:12:15] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9892641 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be2003.codfw.wmnet with OS bookworm [14:15:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED [14:16:39] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED [14:17:06] (03CR) 10CI reject: [V:04-1] mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [14:18:37] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:19:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P64976 and previous config saved to /var/cache/conftool/dbconfig/20240614-141911-marostegui.json [14:21:54] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517 (10jhathaway) 03NEW [14:24:53] (03CR) 10Dzahn: [C:03+1] aptrepo: bump gitlab-runner and gitlab-ce to 17.0 [puppet] - 10https://gerrit.wikimedia.org/r/1043764 (https://phabricator.wikimedia.org/T365675) (owner: 10Jelto) [14:25:33] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9892687 (10jhathaway) [14:30:15] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9892697 (10eoghan) >>! In T331706#9875114, @Ladsgroup wrote: > Overall looks good. Just noting that rebuilding index will take a very long t... [14:31:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage [14:34:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage [14:34:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P64978 and previous config saved to /var/cache/conftool/dbconfig/20240614-143418-marostegui.json [14:35:58] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9892750 (10Dzahn) 05Stalled→03In progress [14:36:07] (03PS1) 10Dreamrimmer: [Wikitech] Remove namespace 666 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043797 (https://phabricator.wikimedia.org/T367254) [14:43:03] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9892774 (10KOfori) a:05WDoranWMF→03None Approved. [14:44:29] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521 (10eoghan) 03NEW [14:49:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T367261)', diff saved to https://phabricator.wikimedia.org/P64979 and previous config saved to /var/cache/conftool/dbconfig/20240614-144925-marostegui.json [14:49:26] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9892838 (10eoghan) I've created a sub-task for the migration itself so users and community members can follow the migration itself more easi... [14:49:30] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [14:50:57] (03PS1) 10Jcrespo: dbbackups: Upgrade db1245 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1043802 (https://phabricator.wikimedia.org/T360751) [14:51:46] (03PS4) 10Jcrespo: dbbackups: Remove all production references to db2102 [puppet] - 10https://gerrit.wikimedia.org/r/1040117 (https://phabricator.wikimedia.org/T366892) [14:51:54] (03PS2) 10Jcrespo: dbbackups: Upgrade db1245 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1043802 (https://phabricator.wikimedia.org/T360751) [14:52:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T364069)', diff saved to https://phabricator.wikimedia.org/P64980 and previous config saved to /var/cache/conftool/dbconfig/20240614-145206-marostegui.json [14:52:12] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:52:40] (03PS1) 10Elukey: redfish: add property for storage manager URI [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043804 (https://phabricator.wikimedia.org/T365372) [14:52:44] (03CR) 10Jcrespo: [C:03+2] dbbackups: Remove all production references to db2102 [puppet] - 10https://gerrit.wikimedia.org/r/1040117 (https://phabricator.wikimedia.org/T366892) (owner: 10Jcrespo) [14:53:21] (03PS3) 10Jcrespo: dbbackups: Upgrade db1245 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1043802 (https://phabricator.wikimedia.org/T360751) [14:54:34] !log upgrade db1245 to mariadb 10.6 T360751 [14:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:38] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be2003.codfw.wmnet with OS bookworm [14:54:42] T360751: Upgrade backup sources to MariaDB 10.6 - https://phabricator.wikimedia.org/T360751 [14:54:49] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9893290 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be2003.codfw.wmnet with OS bookworm completed: - moss-be2003 (**PASS**)... [14:56:00] (03PS8) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [14:56:38] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9893295 (10eoghan) [14:56:40] (03PS9) 10Elukey: WIP: sre.hosts.provison: add BIOS/Mgmt-console support for Supermicro [cookbooks] - 10https://gerrit.wikimedia.org/r/1037806 (https://phabricator.wikimedia.org/T365372) [14:57:32] (03CR) 10Andrew Bogott: "I wouldn't expect thorough documentation about these repos, they're pretty much just Zigo and Arturo." [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [14:57:53] (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade db1245 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1043802 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [14:59:12] (03PS4) 10Andrew Bogott: Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) [14:59:12] (03PS1) 10Andrew Bogott: unattended upgrades: change the Origin for openstack client libs [puppet] - 10https://gerrit.wikimedia.org/r/1043806 (https://phabricator.wikimedia.org/T366028) [15:00:02] (03CR) 10Majavah: [C:03+1] unattended upgrades: change the Origin for openstack client libs [puppet] - 10https://gerrit.wikimedia.org/r/1043806 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [15:02:14] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [15:02:52] (03CR) 10Majavah: [C:03+1] "Andrew has convinced me this is necessary for this specific case and will work with a cloud-wide cumin he's planning to run." [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [15:03:32] (03CR) 10Andrew Bogott: [C:03+2] unattended upgrades: change the Origin for openstack client libs [puppet] - 10https://gerrit.wikimedia.org/r/1043806 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [15:04:44] (03CR) 10Andrew Bogott: [C:03+2] Pass --allow-releaseinfo-change to apt-get for openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/1043203 (https://phabricator.wikimedia.org/T366028) (owner: 10Andrew Bogott) [15:06:26] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 306.97 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:07:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P64981 and previous config saved to /var/cache/conftool/dbconfig/20240614-150713-marostegui.json [15:08:45] FIRING: [3x] JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:25] (03PS8) 10Andrew Bogott: add domain param to openstack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/868814 (https://phabricator.wikimedia.org/T321349) [15:10:35] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9893380 (10Ladsgroup) Wrote this in tech news: >Mailing lists will be unavailable for roughly two hours on Tuesday 10:00 UTC t... [15:11:18] checking the bacula monitoring, sometimes it gets a bit crazy [15:11:30] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9893386 (10jhathaway) [15:12:10] (03PS6) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) [15:13:34] (03PS13) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [15:15:24] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [15:16:42] (03CR) 10Brouberol: dse-k8s-services: Add net-new chart for Airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [15:19:57] (03PS1) 10MVernon: cephadm: install lvm2 on all target nodes, not just osds [puppet] - 10https://gerrit.wikimedia.org/r/1043809 (https://phabricator.wikimedia.org/T279621) [15:21:55] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [15:21:56] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [15:22:08] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9893445 (10eoghan) p:05Triage→03High [15:22:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P64982 and previous config saved to /var/cache/conftool/dbconfig/20240614-152220-marostegui.json [15:25:10] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4039.ulsfo.wmnet [15:25:43] (03PS14) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [15:26:52] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [15:26:53] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [15:26:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9893454 (10Jhancock.wm) @elukey I had to set the mgmt ip and to set the hostname. I have not changed any other settings yet. including the password. I will email that to... [15:27:10] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [15:27:11] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [15:27:12] (03PS1) 10Hnowlan: DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) [15:27:13] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be1002.eqiad.wmnet with OS bookworm [15:27:26] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:27:26] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9893456 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be1002.eqiad.wmnet with OS bookworm [15:27:55] (03CR) 10CI reject: [V:04-1] DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [15:29:07] (03PS15) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [15:29:19] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [15:29:21] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [15:29:25] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9893459 (10jhathaway) [15:29:38] (03PS1) 10Hashar: Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) [15:30:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9893463 (10elukey) >>! In T365167#9893454, @Jhancock.wm wrote: > @elukey I had to set the mgmt ip and to set the hostname. I have not changed any other settings yet. inc... [15:31:14] (03PS16) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [15:31:33] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [15:31:34] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [15:32:10] (03PS17) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [15:32:18] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [15:32:19] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [15:33:45] FIRING: [3x] JobUnavailable: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:58] (03PS1) 10Hnowlan: Add records for shellbox-video service [dns] - 10https://gerrit.wikimedia.org/r/1043815 (https://phabricator.wikimedia.org/T357309) [15:34:00] (03CR) 10Hashar: [C:04-1] "That is merging ur `stable-3.10` which is ahead by 32 commits:" [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [15:35:36] (03PS1) 10Hnowlan: Add shellbox-video discovery [dns] - 10https://gerrit.wikimedia.org/r/1043817 (https://phabricator.wikimedia.org/T357309) [15:35:45] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:37:13] !log mvernon@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-be1002.eqiad.wmnet with OS bookworm [15:37:21] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9893475 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1002.eqiad.wmnet with OS bookworm executed with errors: - moss-be1002 (... [15:37:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T364069)', diff saved to https://phabricator.wikimedia.org/P64984 and previous config saved to /var/cache/conftool/dbconfig/20240614-153727-marostegui.json [15:37:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:37:32] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:37:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [15:37:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be1002.eqiad.wmnet with OS bookworm [15:38:03] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9893478 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be1002.eqiad.wmnet with OS bookworm [15:38:45] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:07] (03CR) 10Klausman: [C:03+1] "Thanks for making this patch!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [15:40:22] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9893483 (10BCornwall) [15:40:43] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 793 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:41:14] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9893489 (10BCornwall) [15:41:35] (03CR) 10CI reject: [V:04-1] Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [15:41:37] (03CR) 10Elukey: "Np! Could you take care of the testing and rollout??" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [15:42:02] (03CR) 10Klausman: [C:03+1] "Will do!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [15:42:13] (03PS1) 10BCornwall: Set cp4039 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043819 (https://phabricator.wikimedia.org/T364891) [15:44:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED [15:44:17] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9893496 (10Jhancock.wm) when would be a good time to coordinate this one for maintenance? I can upgrade the idrac version and resea... [15:46:00] (03CR) 10CDobbins: [C:03+2] Set cp4039 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043819 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [15:47:06] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: hw troubleshooting: server fails to reboot for clouddb1018.eqiad.wmnet - https://phabricator.wikimedia.org/T367499#9893507 (10fnegri) [15:48:49] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS bullseye [15:54:42] (03CR) 10Hashar: [C:04-1] "The lfs plugin fails due to PluginCommand. https://gerrit-review.googlesource.com/c/plugins/lfs/+/430179" [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [15:55:42] 06SRE, 10fundraising-tech-ops, 06Infrastructure-Foundations, 10Mail: Update fundraising mail / firewall settings to use new production mx-in hosts - https://phabricator.wikimedia.org/T367573 (10Dwisehaupt) 03NEW [15:55:43] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1002.eqiad.wmnet with reason: host reimage [15:58:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1002.eqiad.wmnet with reason: host reimage [16:00:57] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4039.ulsfo.wmnet with OS bullseye [16:00:58] (03PS9) 10Hnowlan: shellbox-video: initial helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [16:01:04] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9893585 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4039.ulsfo.wmnet with OS bullseye execu... [16:01:17] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4039.ulsfo.wmnet with OS bullseye [16:01:25] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9893590 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4039.ulsfo.wmnet with OS bullseye [16:02:30] FIRING: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:03:19] (03PS2) 10Hnowlan: Add records for shellbox-video service [dns] - 10https://gerrit.wikimedia.org/r/1043815 (https://phabricator.wikimedia.org/T357309) [16:03:31] (03CR) 10Krinkle: [C:03+1] noc: fail with a 404 when the selected wiki is nonexistent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037587 (owner: 10DCausse) [16:07:30] FIRING: [4x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:30] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 418.98 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:12:30] RESOLVED: [4x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:04] (03PS1) 10JHathaway: vrts_aliases: generate file as the postfix user [puppet] - 10https://gerrit.wikimedia.org/r/1043825 (https://phabricator.wikimedia.org/T325406) [16:15:45] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043825 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [16:18:27] Is someone able to give permission for a Friday backport? Asking in #wikimedia-releng @dduvall @brennen [16:19:13] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1002.eqiad.wmnet with OS bookworm [16:19:26] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9893677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1002.eqiad.wmnet with OS bookworm completed: - moss-be1002 (**PASS**)... [16:20:07] Jdlrobson: ok by me. I’ll be at my keyboard in 10 [16:20:23] ^ jan_drewniak [16:22:23] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage [16:24:00] Jdlrobson: ok great do we have a backport patch? [16:25:27] jan_drewniak: doing that now [16:25:29] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4039.ulsfo.wmnet with reason: host reimage [16:25:53] (03PS1) 10Jdlrobson: For now scope hatnote and infobox styles [extensions/WikimediaMessages] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043827 (https://phabricator.wikimedia.org/T367462) [16:25:56] ^ JD|cloud [16:25:58] ^ jan_drewniak [16:26:08] (sorry for the accidental bing JD) [16:29:32] Ok, starting the backport [16:31:06] !log starting friday backport for T367462 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1043827 [16:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:10] T367462: Responsive Vector uses hatnote.less and infobox.less at all resolutions - https://phabricator.wikimedia.org/T367462 [16:32:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043827 (https://phabricator.wikimedia.org/T367462) (owner: 10Jdlrobson) [16:36:31] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:47:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4039.ulsfo.wmnet with OS bullseye [16:47:28] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9893774 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4039.ulsfo.wmnet with OS bullseye completed: - cp4039 (**PASS... [16:54:32] (03Merged) 10jenkins-bot: For now scope hatnote and infobox styles [extensions/WikimediaMessages] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1043827 (https://phabricator.wikimedia.org/T367462) (owner: 10Jdlrobson) [16:54:51] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4039.ulsfo.wmnet [16:54:59] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1043827|For now scope hatnote and infobox styles (T367462)]] [16:55:06] (03PS2) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) [16:55:15] (03CR) 10CI reject: [V:04-1] Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson) [16:55:40] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9893849 (10BCornwall) [16:57:36] !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Backport for [[gerrit:1043827|For now scope hatnote and infobox styles (T367462)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:57:59] Jdlrobson: ok the patch is finally on mwdebug [16:58:05] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9893856 (10cmooney) [16:59:19] (03PS3) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) [16:59:28] (03CR) 10CI reject: [V:04-1] Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson) [17:00:55] before https://usercontent.irccloud-cdn.com/file/Gj8nMogH/Screenshot%202024-06-14%20at%2010.00.43%E2%80%AFAM.png [17:01:12] after https://usercontent.irccloud-cdn.com/file/HjWuF01V/Screenshot%202024-06-14%20at%2010.01.07%E2%80%AFAM.png [17:01:13] (03CR) 10Dzahn: [C:03+2] idp: drop gitlab-new.wikimedia.org service ID [puppet] - 10https://gerrit.wikimedia.org/r/1043181 (owner: 10Dzahn) [17:01:14] yep it looks like it is having the desired effect jan_drewniak [17:01:25] 🤪 [17:01:47] please sync! [17:01:58] !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Continuing with sync [17:02:04] syncing! [17:02:49] (03CR) 10Dzahn: [C:03+2] admin: add Andrew Otto to approvers for analytics-privatedate-users [puppet] - 10https://gerrit.wikimedia.org/r/1041735 (owner: 10Dzahn) [17:03:17] (03CR) 10Dzahn: [C:03+2] "approver approved by existing approver" [puppet] - 10https://gerrit.wikimedia.org/r/1041735 (owner: 10Dzahn) [17:11:06] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1043827|For now scope hatnote and infobox styles (T367462)]] (duration: 16m 06s) [17:11:11] T367462: Responsive Vector uses hatnote.less and infobox.less at all resolutions - https://phabricator.wikimedia.org/T367462 [17:13:30] Jdlrobson: alright we're done! [17:13:38] thanks jan_drewniak much appreciated! [17:15:41] (03PS1) 10Scott French: mediawiki-dev: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043846 (https://phabricator.wikimedia.org/T362978) [17:20:38] (03PS2) 10Scott French: mediawiki-dev: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043846 (https://phabricator.wikimedia.org/T362978) [17:21:19] thx for handling, jan_drewniak. [17:23:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED [17:30:13] (03PS1) 10Majavah: P:wmcs::backy2: Drop wmde-dashboards configuration [puppet] - 10https://gerrit.wikimedia.org/r/1043851 [17:31:42] (03CR) 10Scott French: "Hey Janis - Let me know what you think about the way I've gone about this. Happy to reconsider the defaults for whether the restricted-com" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043846 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:34:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-ctrl2003.mgmt.codfw.wmnet with reboot policy FORCED [17:35:01] (03CR) 10Scott French: "I should also mention: I've not added support for SYS_PTRACE on the app container, as I see no evidence of it configuring slow logs. That " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043846 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:41:39] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 433.97 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:52:31] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 453.95 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:52:45] FIRING: [2x] Outbound discards: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [18:09:07] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9894098 (10Papaul) @kamila 2003 is ready for. [18:09:39] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:09:54] (03CR) 10Majavah: [C:03+2] P:wmcs::backy2: Drop wmde-dashboards configuration [puppet] - 10https://gerrit.wikimedia.org/r/1043851 (owner: 10Majavah) [18:18:32] (03PS4) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) [18:22:43] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043864 [18:23:14] (03PS1) 10CDobbins: Set cp4040 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043865 (https://phabricator.wikimedia.org/T364891) [18:24:31] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 49.80 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:30:07] (03CR) 10BCornwall: [C:03+1] Set cp4040 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043865 (https://phabricator.wikimedia.org/T364891) (owner: 10CDobbins) [18:41:35] (03PS1) 10JHathaway: vrts_aliases: uniq [puppet] - 10https://gerrit.wikimedia.org/r/1043868 (https://phabricator.wikimedia.org/T325406) [18:41:56] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043868 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:44:06] (03CR) 10CDobbins: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1043865 (https://phabricator.wikimedia.org/T364891) (owner: 10CDobbins) [18:46:49] (03CR) 10JHathaway: [C:03+2] vrts_aliases: generate file as the postfix user [puppet] - 10https://gerrit.wikimedia.org/r/1043825 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:46:59] (03CR) 10JHathaway: [C:03+2] vrts_aliases: uniq [puppet] - 10https://gerrit.wikimedia.org/r/1043868 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [18:52:48] (03CR) 10CDobbins: [C:03+2] Set cp4040 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043865 (https://phabricator.wikimedia.org/T364891) (owner: 10CDobbins) [18:54:45] !log cdobbins@cumin1002 conftool action : set/pooled=no; selector: name=4040.ulsfo.wmnet [18:59:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server fails to reboot for clouddb1018.eqiad.wmnet - https://phabricator.wikimedia.org/T367499#9894294 (10Marostegui) Sometimes it means the host is stuck at a memory check - should be visible onsite. [18:59:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:03] 06SRE, 06Infrastructure-Foundations, 10Mail: Postfix inbound rollout sequence, mx-in - https://phabricator.wikimedia.org/T367517#9894300 (10jhathaway) [19:00:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:28] !log cdobbins@cumin1002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS bullseye [19:00:38] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894303 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin1002 for host cp4040.ulsfo.wmnet with OS bullseye [19:00:53] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:17:31] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 348.94 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:17:43] RECOVERY - Host rdb1014 is UP: PING WARNING - Packet loss = 60%, RTA = 33.50 ms [19:17:47] PROBLEM - SSH on rdb1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:24:07] PROBLEM - Host rdb1014 is DOWN: PING CRITICAL - Packet loss = 100% [19:24:25] (03CR) 10Krinkle: [C:03+2] password: Document wmgPasswordSecretKey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034905 (https://phabricator.wikimedia.org/T150647) (owner: 10Krinkle) [19:26:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [19:26:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [19:26:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T352010)', diff saved to https://phabricator.wikimedia.org/P64987 and previous config saved to /var/cache/conftool/dbconfig/20240614-192643-ladsgroup.json [19:26:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:27:23] !log cdobbins@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4040.ulsfo.wmnet with OS bullseye [19:27:34] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894360 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin1002 for host cp4040.ulsfo.wmnet with OS bullseye ex... [19:27:54] !log cdobbins@cumin1002 START - Cookbook sre.hosts.reimage for host cp4040.ulsfo.wmnet with OS bullseye [19:28:03] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894362 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cdobbins@cumin1002 for host cp4040.ulsfo.wmnet with OS bullseye [19:37:03] RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:39:29] (03PS1) 10Jdlrobson: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) [19:39:42] (03CR) 10Jdlrobson: Disable quick surveys using deprecated configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson) [19:47:45] FIRING: [3x] Outbound discards: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [19:49:11] !log cdobbins@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [19:52:34] !log cdobbins@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4040.ulsfo.wmnet with reason: host reimage [19:55:54] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9894449 (10Eevans) The array rebuild is complete: ` eevans@aqs1013:~$ sudo mdadm --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Thu May 9 14:23:21 2024 Raid... [20:10:32] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894473 (10RobH) [20:11:37] (03PS3) 10NMW03: Enable local uploads for Gilaki Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042430 (https://phabricator.wikimedia.org/T364673) [20:12:45] RESOLVED: [3x] Outbound discards: Device asw2-b-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [20:14:32] !log cdobbins@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4040.ulsfo.wmnet with OS bullseye [20:14:35] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894482 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cdobbins@cumin1002 for host cp4040.ulsfo.wmnet with OS bullseye completed: - cp4040 (**P... [20:22:24] !log cdobbins@cumin1002 conftool action : set/pooled=yes; selector: name=4040.ulsfo.wmnet [20:25:02] 10ops-codfw, 06Data-Platform-SRE, 06DC-Ops: Low priority: Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598 (10bking) 03NEW [20:27:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T352010)', diff saved to https://phabricator.wikimedia.org/P64988 and previous config saved to /var/cache/conftool/dbconfig/20240614-202717-ladsgroup.json [20:27:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:40:17] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894553 (10CDobbins) [20:42:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P64989 and previous config saved to /var/cache/conftool/dbconfig/20240614-204224-ladsgroup.json [20:50:56] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9894567 (10bking) [20:57:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218', diff saved to https://phabricator.wikimedia.org/P64990 and previous config saved to /var/cache/conftool/dbconfig/20240614-205731-ladsgroup.json [20:58:39] (03PS1) 10BCornwall: Set cp4041 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043888 (https://phabricator.wikimedia.org/T364891) [20:58:55] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9894621 (10bking) a:03Papaul [21:07:23] (03CR) 10CDobbins: [C:03+2] Set cp4041 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043888 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [21:12:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2218 (T352010)', diff saved to https://phabricator.wikimedia.org/P64991 and previous config saved to /var/cache/conftool/dbconfig/20240614-211239-ladsgroup.json [21:12:45] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:31:45] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4041.ulsfo.wmnet [21:33:33] !log restart swift-proxy on ms-fe1010 T360913 [21:33:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:37] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [21:33:49] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4040.ulsfo.wmnet [21:34:43] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:41:07] (03CR) 10BCornwall: [C:03+2] Set cp4041 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1043888 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [21:45:40] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9894715 (10Papaul) The System Configuration Check operation resulted in the following issue: Comm Error: Backplane is the error showing in the IDRAC logs and the server is stuck at the DE... [21:46:54] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS bullseye [21:47:05] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894722 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4041.ulsfo.wmnet with OS bullseye [21:48:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [21:49:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [21:49:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T364069)', diff saved to https://phabricator.wikimedia.org/P64992 and previous config saved to /var/cache/conftool/dbconfig/20240614-214910-marostegui.json [21:49:14] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:55:49] (03PS1) 10Eevans: cassandra: drop (unused) aqs role [puppet] - 10https://gerrit.wikimedia.org/r/1043894 (https://phabricator.wikimedia.org/T313877) [21:58:46] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:02:59] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4041.ulsfo.wmnet with OS bullseye [22:03:06] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4041.ulsfo.wmnet with OS bullseye execu... [22:03:15] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS bullseye [22:03:22] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4041.ulsfo.wmnet with OS bullseye [22:15:31] (03PS1) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) [22:15:44] (03CR) 10CI reject: [V:04-1] codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [22:17:15] (03PS2) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) [22:17:37] (03CR) 10CI reject: [V:04-1] codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [22:20:32] (03PS3) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) [22:20:53] (03CR) 10CI reject: [V:04-1] codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [22:22:47] (03PS4) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) [22:24:38] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage [22:25:56] 10ops-eqsin, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into eqsin text cp50(1[789]|2[01234] - https://phabricator.wikimedia.org/T365763#9894944 (10RobH) [22:27:08] (03PS5) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) [22:27:39] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage [22:33:09] !log mnz@deploy1002 Started deploy [airflow-dags/research@5e1cd80]: (no justification provided) [22:33:40] !log mnz@deploy1002 Finished deploy [airflow-dags/research@5e1cd80]: (no justification provided) (duration: 00m 31s) [22:35:40] (03PS6) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) [22:36:01] (03CR) 10CI reject: [V:04-1] codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [22:37:45] (03PS7) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) [22:41:11] (03CR) 10CI reject: [V:04-1] codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [22:46:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:49:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:26] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4041.ulsfo.wmnet with OS bullseye [22:50:31] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4041.ulsfo.wmnet with OS bullseye completed: - cp404... [22:53:06] (03PS8) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) [22:53:28] (03CR) 10CI reject: [V:04-1] codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [22:55:25] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4041.ulsfo.wmnet [22:55:54] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9894983 (10BCornwall) [22:56:07] (03PS9) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) [23:06:50] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active - Telia, AS1299/IPv4: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:09:09] !log mnz@deploy1002 Started deploy [airflow-dags/research@ee5a291]: (no justification provided) [23:09:39] !log mnz@deploy1002 Finished deploy [airflow-dags/research@ee5a291]: (no justification provided) (duration: 00m 30s) [23:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1043948 [23:38:24] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1043948 (owner: 10TrainBranchBot)