[00:39:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991809 [00:39:19] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991809 (owner: 10TrainBranchBot) [00:46:59] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991809 (owner: 10TrainBranchBot) [01:13:25] (03CR) 10Cwhite: [C: 03+1] icinga: remove legacy check_nagios_paging [puppet] - 10https://gerrit.wikimedia.org/r/991801 (https://phabricator.wikimedia.org/T321808) (owner: 10Filippo Giunchedi) [01:39:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:39:19] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:14:19] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:25:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T352010)', diff saved to https://phabricator.wikimedia.org/P55079 and previous config saved to /var/cache/conftool/dbconfig/20240120-032542-ladsgroup.json [03:25:48] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:40:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P55080 and previous config saved to /var/cache/conftool/dbconfig/20240120-034049-ladsgroup.json [03:44:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:46:15] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:55:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P55081 and previous config saved to /var/cache/conftool/dbconfig/20240120-035555-ladsgroup.json [04:11:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T352010)', diff saved to https://phabricator.wikimedia.org/P55082 and previous config saved to /var/cache/conftool/dbconfig/20240120-041102-ladsgroup.json [04:11:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [04:11:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [04:11:19] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:11:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1249 (T352010)', diff saved to https://phabricator.wikimedia.org/P55083 and previous config saved to /var/cache/conftool/dbconfig/20240120-041124-ladsgroup.json [09:07:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T352010)', diff saved to https://phabricator.wikimedia.org/P55084 and previous config saved to /var/cache/conftool/dbconfig/20240120-090751-ladsgroup.json [09:08:00] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:22:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P55085 and previous config saved to /var/cache/conftool/dbconfig/20240120-092257-ladsgroup.json [09:38:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P55086 and previous config saved to /var/cache/conftool/dbconfig/20240120-093804-ladsgroup.json [09:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T352010)', diff saved to https://phabricator.wikimedia.org/P55087 and previous config saved to /var/cache/conftool/dbconfig/20240120-095311-ladsgroup.json [09:53:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [09:53:16] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:53:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [09:55:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:00:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:11:29] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:11:35] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [13:13:29] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:13:49] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:19:17] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: CRITICAL - Destination Unreachable (2403:b100:3001:9::2) [13:19:20] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:20] (JobUnavailable) firing: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [14:44:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2099.codfw.wmnet with reason: Maintenance [14:59:20] (JobUnavailable) firing: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:44] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [15:21:44] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [15:41:44] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [15:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [15:52:06] (03PS1) 10Zabe: Make af_actor and afh_actor accessible in Wiki Replicas [puppet] - 10https://gerrit.wikimedia.org/r/991923 (https://phabricator.wikimedia.org/T337921) [15:56:58] (03PS2) 10Zabe: Make af_actor and afh_actor accessible in Wiki Replicas [puppet] - 10https://gerrit.wikimedia.org/r/991923 (https://phabricator.wikimedia.org/T337921) [16:15:14] (03CR) 10Matěj Suchánek: Make af_actor and afh_actor accessible in Wiki Replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991923 (https://phabricator.wikimedia.org/T337921) (owner: 10Zabe) [16:17:08] (03PS3) 10Zabe: Make af_actor and afh_actor accessible in Wiki Replicas [puppet] - 10https://gerrit.wikimedia.org/r/991923 (https://phabricator.wikimedia.org/T337921) [16:17:16] (03CR) 10Zabe: Make af_actor and afh_actor accessible in Wiki Replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991923 (https://phabricator.wikimedia.org/T337921) (owner: 10Zabe) [16:48:23] (03PS1) 10Majavah: maintain-dbusers: Fix passing parameters to delete API call [puppet] - 10https://gerrit.wikimedia.org/r/991924 [16:51:43] (03PS2) 10Majavah: maintain-dbusers: Fix passing parameters to delete API call [puppet] - 10https://gerrit.wikimedia.org/r/991924 [16:55:45] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: Fix passing parameters to delete API call [puppet] - 10https://gerrit.wikimedia.org/r/991924 (owner: 10Majavah) [16:59:37] (03PS3) 10Majavah: maintain-dbusers: Fix passing parameters to delete API call [puppet] - 10https://gerrit.wikimedia.org/r/991924 [17:03:38] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: Fix passing parameters to delete API call [puppet] - 10https://gerrit.wikimedia.org/r/991924 (owner: 10Majavah) [17:04:48] (03PS4) 10Majavah: maintain-dbusers: Fix passing parameters to delete API call [puppet] - 10https://gerrit.wikimedia.org/r/991924 [17:20:37] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 223.91 ms [17:20:41] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:20:57] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:22:33] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 224.70 ms [17:24:20] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:24:45] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 242.32 ms [18:38:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:43:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [19:13:46] (03PS1) 10Zabe: Stop setting wgShowIPinHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991930 (https://phabricator.wikimedia.org/T355479) [19:28:31] (03PS1) 10Dzahn: phabricator: switch phab1004 to migration role for syncing data [puppet] - 10https://gerrit.wikimedia.org/r/991934 (https://phabricator.wikimedia.org/T334519) [19:33:31] (03CR) 10DannyS712: [C: 03+1] Stop setting wgShowIPinHeader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991930 (https://phabricator.wikimedia.org/T355479) (owner: 10Zabe) [19:48:30] (03PS1) 10Dzahn: phabricator: switch phab1004 back to production role [puppet] - 10https://gerrit.wikimedia.org/r/991937 (https://phabricator.wikimedia.org/T334519) [19:52:50] (03PS2) 10Dzahn: phabricator: use same db server regardless of DC of phab server [puppet] - 10https://gerrit.wikimedia.org/r/989537 (https://phabricator.wikimedia.org/T334519) [19:53:02] (03PS1) 10Dzahn: phabricator: revert changes to DB server settings [puppet] - 10https://gerrit.wikimedia.org/r/991939 (https://phabricator.wikimedia.org/T334519) [19:54:54] (03PS1) 10Dzahn: phabricator: switch active_server back to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/991940 (https://phabricator.wikimedia.org/T334519) [19:56:45] (03PS1) 10Dzahn: phabricator: switch phab server back to phab1004 [dns] - 10https://gerrit.wikimedia.org/r/991941 (https://phabricator.wikimedia.org/T334519) [20:01:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [20:01:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2106.codfw.wmnet with reason: Maintenance [20:01:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T352010)', diff saved to https://phabricator.wikimedia.org/P55089 and previous config saved to /var/cache/conftool/dbconfig/20240120-200154-ladsgroup.json [20:01:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:04:41] !log start of phab/phorge bullseye update window - T334519 [20:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:45] T334519: upgrade phab (phorge) hosts to bullseye - https://phabricator.wikimedia.org/T334519 [20:22:09] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on phab1004.eqiad.wmnet with reason: OS upgrade [20:22:22] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1004.eqiad.wmnet with reason: OS upgrade [20:23:13] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on phabricator.wikimedia.org with reason: OS upgrade [20:23:14] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on phabricator.wikimedia.org with reason: OS upgrade [20:24:36] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab.wmfusercontent.org with reason: OS upgrade [20:32:23] !log phabricator going down for maintenance [20:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:08] Good luck [20:34:22] (03CR) 10Dzahn: [C: 03+2] phabricator: use same db server regardless of DC of phab server [puppet] - 10https://gerrit.wikimedia.org/r/989537 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [20:36:40] !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 to pick up database changes [20:37:13] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-phabricator-repos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:33] !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 to pick up database changes (duration: 00m 53s) [20:38:26] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2002.codfw.wmnet with reason: maintenance [20:38:40] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: maintenance [20:40:15] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:47:57] (03PS1) 10Dzahn: Revert "phabricator: use same db server regardless of DC of phab server" [puppet] - 10https://gerrit.wikimedia.org/r/991849 [20:48:20] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: use same db server regardless of DC of phab server" [puppet] - 10https://gerrit.wikimedia.org/r/991849 (owner: 10Dzahn) [20:48:29] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "phabricator: use same db server regardless of DC of phab server" [puppet] - 10https://gerrit.wikimedia.org/r/991849 (owner: 10Dzahn) [20:51:40] (03PS1) 10Dzahn: Revert "Revert "phabricator: use same db server regardless of DC of phab server"" [puppet] - 10https://gerrit.wikimedia.org/r/991850 [20:52:56] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "phabricator: use same db server regardless of DC of phab server"" [puppet] - 10https://gerrit.wikimedia.org/r/991850 (owner: 10Dzahn) [20:58:47] (03CR) 10Dzahn: [C: 03+2] phabricator: switch active server from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/991649 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [21:02:11] !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 to pick up db config changes (redux) [21:03:46] !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 to pick up db config changes (redux) (duration: 01m 35s) [21:08:01] (03CR) 10Dzahn: [C: 03+2] switch phabricator server to codfw [dns] - 10https://gerrit.wikimedia.org/r/989535 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [21:08:07] (03PS3) 10Dzahn: switch phabricator server to codfw [dns] - 10https://gerrit.wikimedia.org/r/989535 (https://phabricator.wikimedia.org/T334519) [21:08:16] (03CR) 10Dzahn: switch phabricator server to codfw [dns] - 10https://gerrit.wikimedia.org/r/989535 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [21:24:51] (03CR) 10Dzahn: [C: 03+2] phabricator: switch phab1004 to migration role for syncing data [puppet] - 10https://gerrit.wikimedia.org/r/991934 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [21:27:14] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host phab1004.eqiad.wmnet with OS bullseye [21:27:24] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host phab1004.eqiad.wmnet with OS bullseye [21:31:34] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host phab1004.eqiad.wmnet with OS bullseye [21:33:29] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab2002.codfw.wmnet with reason: deployment [21:33:32] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: deployment [21:43:54] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on phab1004.eqiad.wmnet with reason: host reimage [21:46:46] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on phab1004.eqiad.wmnet with reason: host reimage [22:01:59] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: OS upgrade [22:02:02] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: OS upgrade [22:02:15] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host phab1004.eqiad.wmnet with OS bullseye [22:02:20] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on phab.wmfusercontent.org with reason: OS upgrade [22:02:23] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab.wmfusercontent.org with reason: OS upgrade [22:02:30] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on phabricator.wikimedia.org with reason: OS upgrade [22:02:31] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on phabricator.wikimedia.org with reason: OS upgrade [22:18:21] (03CR) 10Dzahn: [C: 03+2] phabricator: revert changes to DB server settings [puppet] - 10https://gerrit.wikimedia.org/r/991939 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [22:18:36] (03PS2) 10Dzahn: phabricator: revert changes to DB server settings [puppet] - 10https://gerrit.wikimedia.org/r/991939 (https://phabricator.wikimedia.org/T334519) [22:19:26] (03CR) 10Dzahn: [V: 03+2] phabricator: revert changes to DB server settings [puppet] - 10https://gerrit.wikimedia.org/r/991939 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [22:22:11] !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 to pick up db config revert [22:23:06] !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 to pick up db config revert (duration: 00m 55s) [22:23:14] (03PS2) 10Dzahn: phabricator: switch active_server back to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/991940 (https://phabricator.wikimedia.org/T334519) [22:24:18] (03CR) 10Dzahn: [C: 03+2] phabricator: switch active_server back to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/991940 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [22:24:23] (03CR) 10Dzahn: [V: 03+2 C: 03+2] phabricator: switch active_server back to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/991940 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [22:27:05] !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 to pick up db config revert (part 2) [22:27:22] Hi OPS, Phab seems to be down (https://phabricator.wikimedia.org/) reporting error: Database host "m3-master.codfw.wmnet:3306" is configured as a master, but is replicating another host. This is dangerous and can mangle or destroy data. Only replicas should be replicating. Stop replication on the host or adjust configuration. [22:28:00] !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: deploy to phab2002 to pick up db config revert (part 2) (duration: 00m 54s) [22:28:00] xaosflux: we're in an upgrade window, downtime is expected. [22:28:05] xaosflux: thank you, we are upgrading it right now [22:28:35] Thank you, messaging seemed urgent. Good luck with upgrade. [22:29:06] (03CR) 10Dzahn: [C: 03+2] phabricator: switch phab1004 back to production role [puppet] - 10https://gerrit.wikimedia.org/r/991937 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn) [22:29:07] thanks! :) [22:33:38] Oh that is why I can't access phab xd [22:33:39] PROBLEM - Check systemd state on phab2002 is CRITICAL: CRITICAL - degraded: The following units failed: phd.service,rsync-phabricator-home-dirs.service,rsync-phabricator-repos.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:01] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: deployment [22:34:25] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: deployment [22:39:31] !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: initial deploy to re-imaged phab1004 [22:39:42] !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: initial deploy to re-imaged phab1004 (duration: 00m 10s) [22:44:58] !log brennen@deploy2002 Started deploy [phabricator/deployment@24a2a2a]: initial deploy to re-imaged phab1004 [22:45:09] !log brennen@deploy2002 Finished deploy [phabricator/deployment@24a2a2a]: initial deploy to re-imaged phab1004 (duration: 00m 10s) [23:10:30] !log brennen@deploy2002 Installing scap version "latest" for 1 hosts [23:13:59] PROBLEM - PHD should be running on phab1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args php ./phd-daemon, UID = 920 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [23:35:56] (03PS1) 10Dzahn: phabricator: temp add scap user creation in phab migration class [puppet] - 10https://gerrit.wikimedia.org/r/991945 [23:37:07] (03CR) 10CI reject: [V: 04-1] phabricator: temp add scap user creation in phab migration class [puppet] - 10https://gerrit.wikimedia.org/r/991945 (owner: 10Dzahn) [23:44:25] (03PS1) 10Dzahn: phabricator: set scap_manage_user to true on role level [puppet] - 10https://gerrit.wikimedia.org/r/991946 [23:50:17] (03PS2) 10Dzahn: phabricator: temp add scap user creation in phab migration class [puppet] - 10https://gerrit.wikimedia.org/r/991945 [23:51:06] (03CR) 10Thcipriani: [C: 03+1] phabricator: temp add scap user creation in phab migration class [puppet] - 10https://gerrit.wikimedia.org/r/991945 (owner: 10Dzahn) [23:51:18] (03CR) 10Dzahn: [C: 03+2] phabricator: temp add scap user creation in phab migration class [puppet] - 10https://gerrit.wikimedia.org/r/991945 (owner: 10Dzahn) [23:53:14] (03PS1) 10Dzahn: Revert "phabricator: temp add scap user creation in phab migration class" [puppet] - 10https://gerrit.wikimedia.org/r/991851 [23:53:23] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "phabricator: temp add scap user creation in phab migration class" [puppet] - 10https://gerrit.wikimedia.org/r/991851 (owner: 10Dzahn) [23:58:18] !log phab1004 - chown -R scap:scap /var/lib/scap [23:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log