[00:00:45] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:01:41] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db2175.codfw.wmnet - db2182.codfw.wmnet - https://phabricator.wikimedia.org/T306849 (10RobH) [00:01:49] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10RobH) [00:07:12] PROBLEM - mediawiki-installation DSH group on mw2412 is CRITICAL: Host mw2412 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:59:22] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T0100) [01:25:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:50:45] (JobUnavailable) firing: (5) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:32] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:06:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:06:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.9 [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/785966 [02:07:32] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.9 [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/785966 (owner: 10TrainBranchBot) [02:09:16] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:07] (03CR) 10jerkins-bot: [V: 04-1] Branch commit for wmf/1.39.0-wmf.9 [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/785966 (owner: 10TrainBranchBot) [02:30:24] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:36:21] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10RLazarus) p:05Triage→03High [02:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:44:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:52:40] (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:14:33] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Neutron networking not working for cloudnet200[5,6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306861 (10Andrew) [03:18:09] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Neutron networking not working for cloudnet200[5,6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306861 (10Andrew) @ayounsi I have barely investigated this but I'm guessing that there's some kind of switch binding that needs to be done f... [03:20:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/installcloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) [03:21:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/installcloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Andrew) You're right, these will need public IPs (but with luck we'll free up the old ones shortly after these go online) [03:24:02] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:24:40] PROBLEM - Query Service HTTP Port on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:26:06] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.077 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:54] RECOVERY - Query Service HTTP Port on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [03:34:34] PROBLEM - puppet last run on ml-staging-ctrl2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [03:50:39] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:15:14] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Neutron networking not working for cloudnet200[5,6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306861 (10Papaul) @Andrew the racking task for the cloudnet nodes said to setup "cloud-gw-transport and cloud-instance-transport" on the sec... [04:15:57] (03CR) 10Majavah: [C: 03+2] "retrying" [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/785966 (owner: 10TrainBranchBot) [04:27:40] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:32:29] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.9 [core] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/785966 (owner: 10TrainBranchBot) [04:38:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:38:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:42:59] (03CR) 10Ntubotu: "> Change has been successfully merged into the git repository." [core] (wmf/1.23wmf20) - 10https://gerrit.wikimedia.org/r/123454 (owner: 10MaxSem) [04:53:17] (03PS2) 10Marostegui: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/785602 (https://phabricator.wikimedia.org/T306417) [04:53:22] (03PS2) 10Marostegui: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/785603 (https://phabricator.wikimedia.org/T306417) [04:53:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s2 T306417 [04:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:53:41] T306417: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T306417 [04:53:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s2 T306417 [04:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1162 with weight 0 T306417', diff saved to https://phabricator.wikimedia.org/P26498 and previous config saved to /var/cache/conftool/dbconfig/20220426-045406-root.json [04:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:22] marostegui: do you want me to do parts of it? [04:54:26] Amir1 nah, it is fine [04:54:29] thanks though [04:54:48] ^^ [04:56:14] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:12:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/785602 (https://phabricator.wikimedia.org/T306417) (owner: 10Marostegui) [05:14:47] PROBLEM - Host cr2-eqord is DOWN: PING CRITICAL - Packet loss = 100% [05:14:56] (03PS1) 10Marostegui: db1122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/786162 (https://phabricator.wikimedia.org/T306417) [05:15:22] ^ looking [05:15:32] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [05:16:06] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 21.91 ms [05:17:40] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 216 probes of 670 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:18:06] (03CR) 10Marostegui: [C: 03+2] db1122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/786162 (https://phabricator.wikimedia.org/T306417) (owner: 10Marostegui) [05:18:18] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 250 probes of 763 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:18:20] PROBLEM - Host cr2-eqord IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:18:36] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:19:54] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 103 probes of 754 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:19:56] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 498 probes of 679 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:24:02] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 79 probes of 670 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:24:32] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 9 probes of 763 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:26:08] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 8 probes of 754 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:26:10] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 62 probes of 679 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:28:46] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:41:27] (03PS1) 10Marostegui: switchover-tmpl.py: Adjust timeout [software] - 10https://gerrit.wikimedia.org/r/786173 [05:41:46] (03CR) 10Ladsgroup: [C: 03+2] switchover-tmpl.py: Adjust timeout [software] - 10https://gerrit.wikimedia.org/r/786173 (owner: 10Marostegui) [05:42:14] (03Merged) 10jenkins-bot: switchover-tmpl.py: Adjust timeout [software] - 10https://gerrit.wikimedia.org/r/786173 (owner: 10Marostegui) [05:48:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10RhinosF1) [05:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:04] kormat, marostegui, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T0600). [06:00:17] !log Starting s2 eqiad failover from db1122 to db1162 - T306417 [06:00:20] o/ [06:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:23] T306417: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T306417 [06:00:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s2 eqiad as read-only for maintenance - T306417', diff saved to https://phabricator.wikimedia.org/P26500 and previous config saved to /var/cache/conftool/dbconfig/20220426-060033-marostegui.json [06:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1162 to s2 primary and set section read-write T306417', diff saved to https://phabricator.wikimedia.org/P26501 and previous config saved to /var/cache/conftool/dbconfig/20220426-060058-marostegui.json [06:01:00] all done [06:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:02] testing now [06:01:18] rhttps://it.wikipedia.org/w/index.php?title=Utente:Ladsgroup/Sandbox&action=history [06:01:20] https://it.wikipedia.org/w/index.php?title=Utente:Ladsgroup/Sandbox&action=history [06:01:24] can edit [06:01:32] same yeah [06:02:31] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/785603 (https://phabricator.wikimedia.org/T306417) (owner: 10Marostegui) [06:03:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1122 T306417', diff saved to https://phabricator.wikimedia.org/P26502 and previous config saved to /var/cache/conftool/dbconfig/20220426-060344-root.json [06:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 is current s2 master, should not be in API T306417', diff saved to https://phabricator.wikimedia.org/P26503 and previous config saved to /var/cache/conftool/dbconfig/20220426-060602-marostegui.json [06:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:07] T306417: Switchover s2 master (db1122 -> db1162) - https://phabricator.wikimedia.org/T306417 [06:07:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1100, s5 master from API', diff saved to https://phabricator.wikimedia.org/P26504 and previous config saved to /var/cache/conftool/dbconfig/20220426-060734-marostegui.json [06:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:13] (03PS1) 10Marostegui: switchover-tmpl.py: Add depooling comment [software] - 10https://gerrit.wikimedia.org/r/786175 [06:11:44] (03CR) 10Marostegui: [C: 03+2] switchover-tmpl.py: Add depooling comment [software] - 10https://gerrit.wikimedia.org/r/786175 (owner: 10Marostegui) [06:12:18] (03Merged) 10jenkins-bot: switchover-tmpl.py: Add depooling comment [software] - 10https://gerrit.wikimedia.org/r/786175 (owner: 10Marostegui) [06:12:40] (03PS1) 10Ladsgroup: db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/786176 (https://phabricator.wikimedia.org/T302185) [06:13:09] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/786176 (https://phabricator.wikimedia.org/T302185) (owner: 10Ladsgroup) [06:14:46] !log dbmaint s2@eqiad T298557 [06:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:50] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:15:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1109.eqiad.wmnet with reason: Maintenance [06:15:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1109.eqiad.wmnet with reason: Maintenance [06:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1109 (T302185)', diff saved to https://phabricator.wikimedia.org/P26505 and previous config saved to /var/cache/conftool/dbconfig/20220426-061519-ladsgroup.json [06:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:23] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [06:16:15] !log dbmaint s2@eqiad T300381 [06:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:20] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:21:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1109.eqiad.wmnet with OS bullseye [06:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:32] PROBLEM - SSH on wtp1035.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:29:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1109.eqiad.wmnet with reason: host reimage [06:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:25] (03CR) 10Urbanecm: [C: 03+1] [beta] Reopen beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785925 (https://phabricator.wikimedia.org/T306833) (owner: 10Gergő Tisza) [06:30:36] (03CR) 10Urbanecm: [C: 03+1] "LGTM, should do the trick." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785925 (https://phabricator.wikimedia.org/T306833) (owner: 10Gergő Tisza) [06:32:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1109.eqiad.wmnet with reason: host reimage [06:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:07] RECOVERY - puppet last run on ml-staging-ctrl2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:45:30] !log imported scap 4.7.0 to stretch-/buster-/bullseye-wikimedia - T306827 [06:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:34] T306827: Deploy Scap version 4.7.0 - https://phabricator.wikimedia.org/T306827 [06:46:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1109.eqiad.wmnet with OS bullseye [06:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T302185)', diff saved to https://phabricator.wikimedia.org/P26506 and previous config saved to /var/cache/conftool/dbconfig/20220426-065112-ladsgroup.json [06:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:17] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [06:51:43] !log jayme@deploy1002 Started deploy [restbase/deploy@0205f1d] (dev-cluster): (no justification provided) [06:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:47] !log jayme@deploy1002 Finished deploy [restbase/deploy@0205f1d] (dev-cluster): (no justification provided) (duration: 03m 05s) [06:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1, awight, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:01:30] !log dbmaint s2@eqiad T298554 [07:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:35] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:06:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P26507 and previous config saved to /var/cache/conftool/dbconfig/20220426-070617-ladsgroup.json [07:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:44] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:15:45] RECOVERY - Host cr2-eqord is UP: PING OK - Packet loss = 0%, RTA = 175.97 ms [07:15:52] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/2 UP : OSPFv3: 1/1 UP : 2 v2 P2P interfaces vs. 1 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:16:28] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:17:20] RECOVERY - OSPF status on cr2-eqord is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:20:13] PROBLEM - Host cr2-eqord is DOWN: PING CRITICAL - Packet loss = 100% [07:20:34] (03PS1) 10Elukey: Add calico BGP peering settings for ml-serve100[5-8] [deployment-charts] - 10https://gerrit.wikimedia.org/r/786264 (https://phabricator.wikimedia.org/T306649) [07:21:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109', diff saved to https://phabricator.wikimedia.org/P26508 and previous config saved to /var/cache/conftool/dbconfig/20220426-072122-ladsgroup.json [07:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:00] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:22:23] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10Patch-For-Review: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) [07:22:44] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:08] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:24:09] RECOVERY - Host cr2-eqord is UP: PING OK - Packet loss = 0%, RTA = 56.76 ms [07:24:54] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:52] RECOVERY - Host cr2-eqord IPv6 is UP: PING OK - Packet loss = 0%, RTA = 57.29 ms [07:26:10] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 99 probes of 754 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:29:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast1003.wikimedia.org [07:29:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast3005.wikimedia.org [07:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:56] (03PS1) 10Majavah: hieradata: swap remaining ldap-labs names to ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/786265 (https://phabricator.wikimedia.org/T295150) [07:32:24] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 7 probes of 754 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:36:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1109 (T302185)', diff saved to https://phabricator.wikimedia.org/P26509 and previous config saved to /var/cache/conftool/dbconfig/20220426-073627-ladsgroup.json [07:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:33] T302185: Upgrade s8 to Bullseye - https://phabricator.wikimedia.org/T302185 [07:36:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3005.wikimedia.org [07:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast1003.wikimedia.org [07:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:45] (03PS1) 10JMeybohm: Fix permissions/ownership of helm directories [puppet] - 10https://gerrit.wikimedia.org/r/786269 (https://phabricator.wikimedia.org/T305729) [07:38:48] (03PS1) 10JMeybohm: Clean up helm2 specific code and environment variable [puppet] - 10https://gerrit.wikimedia.org/r/786270 [07:44:47] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34957/console" [puppet] - 10https://gerrit.wikimedia.org/r/786270 (owner: 10JMeybohm) [07:47:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2004.codfw.wmnet [07:48:00] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [07:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:55:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:26] (03PS2) 10Awight: [beta] Stash all logs for the Kartographer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785852 (https://phabricator.wikimedia.org/T304813) [07:56:04] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw2287.codfw.wmnet [07:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:35] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw2288.codfw.wmnet [07:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:42] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw2289.codfw.wmnet [07:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:56] ^ above hosts were not pooled again because of hardware failure of mw2286 in same cookbook run and failed cookbook [08:03:49] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:40] (03Abandoned) 10Hashar: Stop including backports in Stretch images [puppet] - 10https://gerrit.wikimedia.org/r/610050 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff) [08:04:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host testvm2004.codfw.wmnet [08:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:32] (03PS1) 10Muehlenhoff: Add testvm2004 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/786271 [08:08:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1122.eqiad.wmnet with OS bullseye [08:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:16] 10SRE, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) 05Open→03Resolved Closing this task, the remaining bits will be cleaned out when Stretch is removed completely. [08:10:35] 10SRE, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10MoritzMuehlenhoff) [08:11:16] (03CR) 10Muehlenhoff: [C: 03+2] Add testvm2004 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/786271 (owner: 10Muehlenhoff) [08:12:46] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:14:26] (03PS3) 10Giuseppe Lavagetto: varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) [08:15:35] can someone with access here set me as clinic duty in the topic? thank you! [08:15:56] tried asking ChanServ for op and I've been denied [08:16:04] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:16:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1122.eqiad.wmnet with reason: host reimage [08:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1122.eqiad.wmnet with reason: host reimage [08:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:48] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1113.eqiad.wmnet with reason: Rebooting for T303174 [08:21:50] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1113.eqiad.wmnet with reason: Rebooting for T303174 [08:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:55] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3315 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P26510 and previous config saved to /var/cache/conftool/dbconfig/20220426-082155-kormat.json [08:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:11] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3316 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P26511 and previous config saved to /var/cache/conftool/dbconfig/20220426-082210-kormat.json [08:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host deploy1002.eqiad.wmnet [08:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:00] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26512 and previous config saved to /var/cache/conftool/dbconfig/20220426-082559-kormat.json [08:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:25] (03CR) 10David Caro: [C: 03+1] "Let me know when/how you want to deploy this around, might cause a bit of downtime (for the cinder service, not so user-facing)" [puppet] - 10https://gerrit.wikimedia.org/r/785840 (https://phabricator.wikimedia.org/T297268) (owner: 10Majavah) [08:29:00] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:30:58] marostegui: when you have a second could you set me as on clinic duty in topic ? thanks! [08:31:25] godog: sure [08:31:36] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=1) [08:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:45] I thought I had access with chanserv, turns out I don't [08:31:48] marostegui: thank you <3 [08:31:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host deploy1002.eqiad.wmnet [08:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1122.eqiad.wmnet with OS bullseye [08:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:09] !log installing testvm2004 T306499 [08:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:13] T306499: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 [08:34:23] (03PS1) 10David Caro: openstack: remove ussuri files [puppet] - 10https://gerrit.wikimedia.org/r/786274 (https://phabricator.wikimedia.org/T218426) [08:36:15] (03CR) 10Filippo Giunchedi: cache::haproxy: Log emergency messages to disk (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784256 (https://phabricator.wikimedia.org/T306236) (owner: 10Vgutierrez) [08:39:15] (03CR) 10Marostegui: [C: 03+1] Add fix_img_major_mime_null_T306560.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/784762 (https://phabricator.wikimedia.org/T306560) (owner: 10Ladsgroup) [08:41:04] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26513 and previous config saved to /var/cache/conftool/dbconfig/20220426-084103-kormat.json [08:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:16] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34958/console" [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [08:43:13] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw229[7-9].codfw.wmnet [08:43:15] !log pool name=mw229[7-9].codfw.wmnet, manual icinga recheck green after reboot [08:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:37] (03PS1) 10Marostegui: Revert "db1122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/785931 [08:44:12] (03CR) 10Marostegui: [C: 03+2] Revert "db1122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/785931 (owner: 10Marostegui) [08:44:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 1%: After reimage', diff saved to https://phabricator.wikimedia.org/P26514 and previous config saved to /var/cache/conftool/dbconfig/20220426-084437-root.json [08:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:21] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-cluster [08:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:01] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) The DNS Name field in Netbox is an FQDN, the same Netbox UI help message for the field is: `Hostname or FQDN (not ca... [08:54:33] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Boldly resolving, though @eigyan please reopen if something is amiss and/or access is not working as expected [08:56:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26515 and previous config saved to /var/cache/conftool/dbconfig/20220426-085607-kormat.json [08:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:27] (03CR) 10WMDE-Fisch: [C: 03+1] [beta] Stash all logs for the Kartographer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785852 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [08:58:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10fgiunchedi) Boldly resolving, though @drochford please let us know and reopen is something is amiss! [08:59:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10fgiunchedi) 05Open→03Resolved [08:59:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P26516 and previous config saved to /var/cache/conftool/dbconfig/20220426-085941-root.json [08:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:48] (03Abandoned) 10Hashar: Introduce lint command [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/731149 (https://phabricator.wikimedia.org/T283855) (owner: 10Hashar) [09:03:52] (03Abandoned) 10Hashar: Be strict on undefined variables such as seed_image [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/747060 (https://phabricator.wikimedia.org/T297619) (owner: 10Hashar) [09:03:54] (03CR) 10Vgutierrez: varnish: add new-version dynamic request filter template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [09:07:25] (03PS1) 10Kevin Bazira: ml-services: add ukwiki & viwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786276 (https://phabricator.wikimedia.org/T301415) [09:08:38] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Boldly resolving, @jmads please reopen if something is amiss and let us know! [09:10:09] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1146.eqiad.wmnet with reason: Rebooting for T303174 [09:10:10] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1146.eqiad.wmnet with reason: Rebooting for T303174 [09:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3312 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P26517 and previous config saved to /var/cache/conftool/dbconfig/20220426-091015-kormat.json [09:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:34] (03PS3) 10Cathal Mooney: Move elements from CR BGP policy and group config to separate files [homer/public] - 10https://gerrit.wikimedia.org/r/785284 [09:11:11] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3315 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26518 and previous config saved to /var/cache/conftool/dbconfig/20220426-091111-kormat.json [09:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26519 and previous config saved to /var/cache/conftool/dbconfig/20220426-091115-kormat.json [09:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:08] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10jbond) Im not sure i understand this response. The value entered which caused an error was `ns-recursor0.openstack.codfw1... [09:13:30] (03PS4) 10Cathal Mooney: Move elements from CR BGP policy and group config to separate files [homer/public] - 10https://gerrit.wikimedia.org/r/785284 [09:13:42] 10SRE, 10Infrastructure-Foundations, 10LDAP-Access-Requests: Grant Access to ldap/wmf for Nathillard - https://phabricator.wikimedia.org/T305978 (10fgiunchedi) @NHillard-WMF hello, thank you for the extensive testing. Is your access working now ? Also as an additional data point, does https://alerts.wikimedi... [09:14:36] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-fe1012.eqiad.wmnet [09:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P26520 and previous config saved to /var/cache/conftool/dbconfig/20220426-091445-root.json [09:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:18] 10SRE: Allow Wikimedia Maps usage on a private project for an university. - https://phabricator.wikimedia.org/T306467 (10fgiunchedi) p:05Triage→03Medium [09:15:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10fgiunchedi) p:05Triage→03Medium [09:15:47] (03CR) 10Elukey: [C: 03+2] ml-services: add ukwiki & viwiki editquality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/786276 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [09:15:54] Hi, related to T301600: Can we purge all articles from the cache for a specific mediawiki (in our case ka.wikipedia.org) ? [09:15:55] T301600: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 [09:16:09] 10SRE, 10Traffic, 10Developer Productivity, 10Performance-Team (Radar): Let X-Analytics response header pass through with WikimediaDebug - https://phabricator.wikimedia.org/T305794 (10fgiunchedi) p:05Triage→03Medium [09:16:17] (03CR) 10Cathal Mooney: [C: 03+2] Move elements from CR BGP policy and group config to separate files [homer/public] - 10https://gerrit.wikimedia.org/r/785284 (owner: 10Cathal Mooney) [09:16:40] .7 [09:16:43] uff :) [09:17:19] (03Merged) 10jenkins-bot: Move elements from CR BGP policy and group config to separate files [homer/public] - 10https://gerrit.wikimedia.org/r/785284 (owner: 10Cathal Mooney) [09:20:18] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe1012.eqiad.wmnet [09:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:55] (03PS3) 10WMDE-Fisch: [beta] Stash all logs for the Kartographer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785852 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [09:23:09] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe1012.eqiad.wmnet [09:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:20] !log mvernon@cumin1001 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe1012.eqiad.wmnet [09:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:59] !log Reconfigure CR routers following bgp policy changes (no-op) CR785284 [09:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:30] (03CR) 10Giuseppe Lavagetto: [V: 03+1] varnish: add new-version dynamic request filter template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [09:24:53] 10SRE-tools, 10Infrastructure-Foundations: Cumin should group similar SSH errors - https://phabricator.wikimedia.org/T306490 (10Majavah) 05Open→03Declined Makes sense. Thanks! [09:25:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2093.codfw.wmnet,dborch1001.wikimedia.org with reason: Rebooting db1115 T303174 [09:25:04] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2093.codfw.wmnet,dborch1001.wikimedia.org with reason: Rebooting db1115 T303174 [09:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:13] (03PS4) 10Giuseppe Lavagetto: varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) [09:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:32] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1115.eqiad.wmnet with reason: Rebooting for T303174 [09:25:34] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1115.eqiad.wmnet with reason: Rebooting for T303174 [09:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:20] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26521 and previous config saved to /var/cache/conftool/dbconfig/20220426-092619-kormat.json [09:26:22] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2001.codfw.wmnet [09:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:13] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:12] jouncebot: now [09:28:12] No deployments scheduled for the next 3 hour(s) and 31 minute(s) [09:28:27] * WMDE-Fisch going to merge a beta cluster config change [09:28:29] (03PS5) 10Jbond: P:mail: also exclude posfix aliases from vtr router [puppet] - 10https://gerrit.wikimedia.org/r/785870 [09:29:00] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:06] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:29:07] (03CR) 10WMDE-Fisch: [C: 03+2] [beta] Stash all logs for the Kartographer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785852 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [09:29:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P26522 and previous config saved to /var/cache/conftool/dbconfig/20220426-092949-root.json [09:29:51] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:03] (03Merged) 10jenkins-bot: [beta] Stash all logs for the Kartographer extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785852 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [09:30:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2001.codfw.wmnet [09:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:50] * WMDE-Fisch done [09:31:50] 10SRE-swift-storage: swift wmf/rewrite.py middleware broken on bullseye (and its test suite doesn't work either) - https://phabricator.wikimedia.org/T305942 (10MatthewVernon) New failures: ` Apr 26 09:25:15 ms-fe1012 proxy-server: Error: An error occurred: #012Traceback (most recent call last):#012 File "/usr/l... [09:31:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf2002.codfw.wmnet [09:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:03] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:32] (03CR) 10Vgutierrez: [C: 03+1] varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [09:32:56] !log kevinbazira@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:32:58] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1146.eqiad.wmnet with reason: Rebooting for T303174 [09:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:59] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1146.eqiad.wmnet with reason: Rebooting for T303174 [09:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:08] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb={CREATE,PATCH,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:33:09] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) Sure, but they could cause various unwanted issues in different contexes, like not matching the fingerprint in the k... [09:33:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3314 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P26523 and previous config saved to /var/cache/conftool/dbconfig/20220426-093314-kormat.json [09:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:36] (03PS6) 10Jbond: P:mail: also exclude posfix aliases from vtr router [puppet] - 10https://gerrit.wikimedia.org/r/785870 [09:33:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:34:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:22] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2001 is CRITICAL: instance=10.192.32.33 verb={CREATE,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:34:32] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb={CREATE,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:34:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: add new-version dynamic request filter template [puppet] - 10https://gerrit.wikimedia.org/r/778543 (https://phabricator.wikimedia.org/T305606) (owner: 10Giuseppe Lavagetto) [09:35:06] jouncebot: nowandnext [09:35:06] No deployments scheduled for the next 3 hour(s) and 24 minute(s) [09:35:06] In 3 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1300) [09:35:15] (03CR) 10Urbanecm: [C: 03+2] "merging, docs-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770502 (owner: 10Gergő Tisza) [09:35:56] (03Merged) 10jenkins-bot: Add a note about tox requirements for changing logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/770502 (owner: 10Gergő Tisza) [09:36:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf2002.codfw.wmnet [09:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:22] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1012.eqiad.wmnet [09:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:07] * urbanecm done [09:37:34] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:37:40] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:37:50] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26524 and previous config saved to /var/cache/conftool/dbconfig/20220426-093750-kormat.json [09:37:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:35] (03CR) 10Jbond: [C: 03+1] hieradata: swap remaining ldap-labs names to ldap-rw [puppet] - 10https://gerrit.wikimedia.org/r/786265 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [09:38:58] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:39:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:39:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1001.eqiad.wmnet [09:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:18] (03CR) 10Jbond: "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/785870 (owner: 10Jbond) [09:41:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26525 and previous config saved to /var/cache/conftool/dbconfig/20220426-094123-kormat.json [09:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:02] <_joe_> some CP hosts will alert now [09:42:06] <_joe_> it's my fault, fixing now [09:42:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1001.eqiad.wmnet [09:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P26526 and previous config saved to /var/cache/conftool/dbconfig/20220426-094453-root.json [09:44:55] (03CR) 10Jbond: [C: 03+2] P:mail: also exclude posfix aliases from vtr router [puppet] - 10https://gerrit.wikimedia.org/r/785870 (owner: 10Jbond) [09:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:18] PROBLEM - ganeti-confd running on ganeti-test2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-confd), command name ganeti-confd https://wikitech.wikimedia.org/wiki/Ganeti [09:45:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti-test2001.codfw.wmnet with OS bullseye [09:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:53] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti-test2001.codfw.wmnet with OS bullseye [09:47:21] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1012.eqiad.wmnet [09:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:20] PROBLEM - cassandra-b CQL 10.64.32.145:9042 on aqs1012 is CRITICAL: connect to address 10.64.32.145 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [09:50:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host webperf1002.eqiad.wmnet [09:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:20] RECOVERY - cassandra-b CQL 10.64.32.145:9042 on aqs1012 is OK: TCP OK - 0.000 second response time on 10.64.32.145 port 9042 https://phabricator.wikimedia.org/T93886 [09:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:51:04] (03PS1) 10Vgutierrez: Release 8.0.8-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/786282 (https://phabricator.wikimedia.org/T304835) [09:51:24] !log nokafor@deploy1002 Started deploy [airflow-dags/analytics@9dbd5bc]: (no justification provided) [09:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:28] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.8-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/786282 (https://phabricator.wikimedia.org/T304835) (owner: 10Vgutierrez) [09:51:31] !log nokafor@deploy1002 Finished deploy [airflow-dags/analytics@9dbd5bc]: (no justification provided) (duration: 00m 07s) [09:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:39] that was fast :) [09:51:55] (03PS1) 10Giuseppe Lavagetto: cache-frontend: fix confd template [puppet] - 10https://gerrit.wikimedia.org/r/786283 [09:52:23] <_joe_> vgutierrez: jerkins knows you [09:52:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host webperf1002.eqiad.wmnet [09:52:28] (03PS2) 10Vgutierrez: Release 8.0.8-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/786282 (https://phabricator.wikimedia.org/T304835) [09:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26528 and previous config saved to /var/cache/conftool/dbconfig/20220426-095254-kormat.json [09:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:01] PROBLEM - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp2031 is CRITICAL: File not found: /etc/varnish/requestctl-filters.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:56:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cache-frontend: fix confd template [puppet] - 10https://gerrit.wikimedia.org/r/786283 (owner: 10Giuseppe Lavagetto) [09:56:27] !log kormat@cumin1001 dbctl commit (dc=all): 'db1113:3316 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26530 and previous config saved to /var/cache/conftool/dbconfig/20220426-095627-kormat.json [09:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:56] 10SRE, 10SRE-tools, 10DNS, 10Infrastructure-Foundations, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10jbond) 05Open→03Stalled As per an offline conversation with @Volans. newer versions of netbox allow us to preform [[ ht... [09:59:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1122 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P26531 and previous config saved to /var/cache/conftool/dbconfig/20220426-095957-root.json [10:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti-test2001.codfw.wmnet with reason: host reimage [10:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1122 into API', diff saved to https://phabricator.wikimedia.org/P26532 and previous config saved to /var/cache/conftool/dbconfig/20220426-100031-marostegui.json [10:00:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:57] PROBLEM - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp4029 is CRITICAL: File not found: /etc/varnish/requestctl-filters.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:01:16] _joe_: ^^ [10:03:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti-test2001.codfw.wmnet with reason: host reimage [10:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:07] PROBLEM - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp5007 is CRITICAL: File not found: /etc/varnish/requestctl-filters.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:05:36] <_joe_> uhm [10:05:48] <_joe_> that's quite strange vgutierrez [10:05:59] <_joe_> but yes I still have puppet disabled [10:06:03] <_joe_> it will be fixed soon [10:06:20] ack [10:07:19] RECOVERY - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp2031 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:07:23] <_joe_> yep [10:07:33] <_joe_> that was me forcing a puppet run on that host :) [10:07:34] (03PS1) 10Jcrespo: mediabackup: Clone localy the mediawiki-config repo [puppet] - 10https://gerrit.wikimedia.org/r/786285 (https://phabricator.wikimedia.org/T305446) [10:07:59] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26533 and previous config saved to /var/cache/conftool/dbconfig/20220426-100758-kormat.json [10:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:31] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:41] PROBLEM - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp4027 is CRITICAL: File not found: /etc/varnish/requestctl-filters.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:08:42] (03CR) 10Jcrespo: [C: 03+2] mediabackup: Clone localy the mediawiki-config repo [puppet] - 10https://gerrit.wikimedia.org/r/786285 (https://phabricator.wikimedia.org/T305446) (owner: 10Jcrespo) [10:10:05] RECOVERY - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp4027 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:13:27] RECOVERY - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp4029 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:13:55] RECOVERY - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp5007 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:14:19] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:27] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host ms-backup2002.codfw.wmnet with OS bullseye [10:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:09] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1013.eqiad.wmnet [10:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:32] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:23:04] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3312 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26534 and previous config saved to /var/cache/conftool/dbconfig/20220426-102303-kormat.json [10:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26535 and previous config saved to /var/cache/conftool/dbconfig/20220426-102307-kormat.json [10:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:36] PROBLEM - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp3060 is CRITICAL: File not found: /etc/varnish/requestctl-filters.inc.vcl https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:24:50] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786289 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [10:25:41] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1013.eqiad.wmnet [10:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:05] 10SRE: role_contacts (service owners) as a custom puppet fact - https://phabricator.wikimedia.org/T306830 (10jbond) > use cumin to ask "what is the kernel version of all machines owned by $subteam" or "which hosts owned by $subteam are still on buster" As we pass this value as a paramter to profile::contacts we... [10:28:42] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup2002.codfw.wmnet with reason: host reimage [10:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:12] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup2002.codfw.wmnet with reason: host reimage [10:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:04] (03PS1) 10MVernon: swift: wmf/rewrite.py py2->3 HTTPMessage changes [puppet] - 10https://gerrit.wikimedia.org/r/786290 (https://phabricator.wikimedia.org/T305942) [10:33:36] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:51] (03CR) 10jerkins-bot: [V: 04-1] swift: wmf/rewrite.py py2->3 HTTPMessage changes [puppet] - 10https://gerrit.wikimedia.org/r/786290 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [10:34:52] 10SRE: role_contacts (service owners) as a custom puppet fact - https://phabricator.wikimedia.org/T306830 (10MoritzMuehlenhoff) And if that syntax is too cumbersome in the day-to-day we could add a few Cumin aliases? like A:hosts-data-persistence and A:hosts-infrastructure-foundations or similar? [10:34:54] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10Patch-For-Review: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) Created https://gerrit.wikimedia.org/r/786264 to kick off the discussion about the next steps, let... [10:36:07] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:00] 10SRE, 10DNS, 10Traffic, 10Wikimedia Enterprise: 301 redirect setup for wikimediaenterprise - https://phabricator.wikimedia.org/T302756 (10Protsack.stephan) [10:38:12] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26536 and previous config saved to /var/cache/conftool/dbconfig/20220426-103811-kormat.json [10:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:21] (03PS2) 10MVernon: swift: wmf/rewrite.py py2->3 HTTPMessage changes [puppet] - 10https://gerrit.wikimedia.org/r/786290 (https://phabricator.wikimedia.org/T305942) [10:38:34] (03CR) 10WMDE-Fisch: [C: 03+1] Watch for mapdata cache misses in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786289 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [10:40:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti-test2001.codfw.wmnet with OS bullseye [10:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:43] 10SRE: Upgrade ganeti-test to Bullseye - https://phabricator.wikimedia.org/T306499 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti-test2001.codfw.wmnet with OS bullseye completed: - ganeti-test2001 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled P... [10:42:44] (03CR) 10MVernon: "I've lightly tested this fix on ms-fe1012." [puppet] - 10https://gerrit.wikimedia.org/r/786290 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [10:43:43] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-backup2002.codfw.wmnet with OS bullseye [10:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:40] !log Reconfigre routing policy lsw1-f3-eqiad, rename policies to use lower-case [10:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:53] 10SRE-swift-storage, 10Patch-For-Review: swift wmf/rewrite.py middleware broken on bullseye (and its test suite doesn't work either) - https://phabricator.wikimedia.org/T305942 (10MatthewVernon) That patch fixes the above error; I've found another... [10:45:08] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host ms-backup1002.eqiad.wmnet with OS bullseye [10:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/786290 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [10:48:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/786282 (https://phabricator.wikimedia.org/T304835) (owner: 10Vgutierrez) [10:48:54] (03CR) 10Vgutierrez: cache::haproxy: Log emergency messages to disk (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/784256 (https://phabricator.wikimedia.org/T306236) (owner: 10Vgutierrez) [10:50:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/784256 (https://phabricator.wikimedia.org/T306236) (owner: 10Vgutierrez) [10:50:49] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM when the time comes" [puppet] - 10https://gerrit.wikimedia.org/r/785927 (owner: 10Cwhite) [10:51:59] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Watch for mapdata cache misses in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786289 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [10:53:01] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/785921 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [10:53:03] (03CR) 10MVernon: [C: 03+2] swift: wmf/rewrite.py py2->3 HTTPMessage changes [puppet] - 10https://gerrit.wikimedia.org/r/786290 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [10:53:14] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Log emergency messages to disk [puppet] - 10https://gerrit.wikimedia.org/r/784256 (https://phabricator.wikimedia.org/T306236) (owner: 10Vgutierrez) [10:53:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26537 and previous config saved to /var/cache/conftool/dbconfig/20220426-105315-kormat.json [10:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:59] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:05] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: host reimage [10:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:41] RECOVERY - Confd template for /etc/varnish/requestctl-filters.inc.vcl on cp3060 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:57:48] !log Reconfigre routing policy lsw1-e3-eqiad, rename policies to use lower-case [10:57:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:19] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:00:31] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: host reimage [11:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:20] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1014.eqiad.wmnet [11:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:05] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:07] !log Reconfigre routing policy lsw1-f2-eqiad, rename policies to use lower-case [11:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:19] !log kormat@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26538 and previous config saved to /var/cache/conftool/dbconfig/20220426-110819-kormat.json [11:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host aqs1014.eqiad.wmnet [11:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:49] !log Reconfigre routing policy lsw1-e2-eqiad, rename policies to use lower-case [11:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:26] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [11:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:34] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-backup1002.eqiad.wmnet with OS bullseye [11:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:07] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host ms-backup2001.codfw.wmnet with OS bullseye [11:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:56] !log Reconfigre routing policy lsw1-e1-eqiad, rename policies to use lower-case [11:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:05] (03CR) 10Jbond: "lgtm, some minor comments/questions inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [11:17:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1170.eqiad.wmnet with reason: Rebooting for T303174 [11:17:36] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1170.eqiad.wmnet with reason: Rebooting for T303174 [11:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:41] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3312 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P26539 and previous config saved to /var/cache/conftool/dbconfig/20220426-111741-kormat.json [11:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:52] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3317 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P26540 and previous config saved to /var/cache/conftool/dbconfig/20220426-111751-kormat.json [11:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:55] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:16] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26541 and previous config saved to /var/cache/conftool/dbconfig/20220426-112215-kormat.json [11:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:26] !log Reconfigre routing policy lsw1-f1-eqiad, rename policies to use lower-case [11:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:23:04] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti-test2001.codfw.wmnet [11:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:21] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup2001.codfw.wmnet with reason: host reimage [11:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:54] PROBLEM - configured eth on ganeti-test2001 is CRITICAL: public reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [11:29:12] (03PS1) 10Cathal Mooney: Add automation templates for EVPN switch overlay BGP [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) [11:29:44] (03CR) 10jerkins-bot: [V: 04-1] Add automation templates for EVPN switch overlay BGP [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:30:48] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup2001.codfw.wmnet with reason: host reimage [11:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:31] (03PS2) 10Cathal Mooney: Add automation templates for EVPN switch overlay BGP [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) [11:33:50] (03CR) 10jerkins-bot: [V: 04-1] Add automation templates for EVPN switch overlay BGP [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:34:05] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host ms-backup1001.eqiad.wmnet with OS bullseye [11:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:19] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26542 and previous config saved to /var/cache/conftool/dbconfig/20220426-113719-kormat.json [11:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:24] (03PS3) 10Cathal Mooney: Add automation templates for EVPN switch overlay BGP [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) [11:40:56] (03CR) 10jerkins-bot: [V: 04-1] Add automation templates for EVPN switch overlay BGP [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:42:07] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-backup2001.codfw.wmnet with OS bullseye [11:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:18] (03PS4) 10Cathal Mooney: Add automation templates for EVPN switch overlay BGP [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) [11:46:06] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: host reimage [11:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:19] (03CR) 10Cathal Mooney: [C: 03+2] Add automation templates for EVPN switch overlay BGP [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:46:52] (03Merged) 10jenkins-bot: Add automation templates for EVPN switch overlay BGP [homer/public] - 10https://gerrit.wikimedia.org/r/786296 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [11:47:06] (03CR) 10David Caro: "Neat, almost there, some comments 😊" [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [11:49:19] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: host reimage [11:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:52:23] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26543 and previous config saved to /var/cache/conftool/dbconfig/20220426-115223-kormat.json [11:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:27] (03PS2) 10WMDE-Fisch: Watch for mapdata cache misses in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786289 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [11:59:27] RECOVERY - configured eth on ganeti-test2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:00:46] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-backup1001.eqiad.wmnet with OS bullseye [12:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:29] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [12:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:43] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [12:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/785921 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [12:07:27] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26544 and previous config saved to /var/cache/conftool/dbconfig/20220426-120727-kormat.json [12:07:28] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: aggregate exporter 'up' metrics [puppet] - 10https://gerrit.wikimedia.org/r/784635 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [12:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:31] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26545 and previous config saved to /var/cache/conftool/dbconfig/20220426-120731-kormat.json [12:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host aqs1015.eqiad.wmnet [12:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:08] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, 10Patch-For-Review: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) @elukey thanks for the patch, certainly looks ok to me, if indeed it works in terms of the Calico... [12:22:35] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26546 and previous config saved to /var/cache/conftool/dbconfig/20220426-122235-kormat.json [12:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:51] RECOVERY - SSH on wtp1035.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:24:42] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host aqs1015.eqiad.wmnet [12:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:04] (03PS3) 10Cathal Mooney: Add ml-serve100[5-8] to the ml-serve-eqiad k8s BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [12:29:38] (03CR) 10jerkins-bot: [V: 04-1] Add ml-serve100[5-8] to the ml-serve-eqiad k8s BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [12:30:03] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:31:16] 10SRE, 10DBA, 10Security: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) [12:31:19] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:32:10] 10SRE, 10DBA: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) [12:32:13] PROBLEM - configured eth on ganeti-test2001 is CRITICAL: public reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [12:32:18] 10SRE, 10DBA: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) p:05Triage→03Medium [12:32:37] 10SRE, 10DBA: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) [12:33:12] (03PS4) 10Cathal Mooney: Add ml-serve100[5-8] to the ml-serve-eqiad k8s BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [12:33:23] PROBLEM - puppet last run on ml-staging-ctrl2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:33:46] (03CR) 10jerkins-bot: [V: 04-1] Add ml-serve100[5-8] to the ml-serve-eqiad k8s BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [12:34:04] 10SRE, 10DBA: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) [12:35:39] (03PS5) 10Cathal Mooney: Add ml-serve100[5-8] to the ml-serve-eqiad k8s BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [12:36:59] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 60%, RTA = 0.82 ms [12:37:35] (03CR) 10Cathal Mooney: [C: 03+2] Add ml-serve100[5-8] to the ml-serve-eqiad k8s BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [12:37:40] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26547 and previous config saved to /var/cache/conftool/dbconfig/20220426-123740-kormat.json [12:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:08] (03Merged) 10jenkins-bot: Add ml-serve100[5-8] to the ml-serve-eqiad k8s BGP neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [12:43:10] (03PS1) 10Kormat: ProductionServices: Promote pc1014 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) [12:44:11] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8-1wm6 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/786282 (https://phabricator.wikimedia.org/T304835) (owner: 10Vgutierrez) [12:44:23] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host archiva1002.wikimedia.org [12:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:25] 10SRE, 10DBA, 10Patch-For-Review: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) [12:46:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host archiva1002.wikimedia.org [12:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:57] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:47:27] PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:48:10] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1004.eqiad.wmnet [12:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:21] jouncebot: nowandnext [12:48:22] No deployments scheduled for the next 0 hour(s) and 11 minute(s) [12:48:22] In 0 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1300) [12:48:35] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:50:21] PROBLEM - BGP status on lsw1-f2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv4: Connect - kubernetes-ml-eqiad, AS64606/IPv6: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:50:45] PROBLEM - BGP status on lsw1-f3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64606/IPv6: Connect - kubernetes-ml-eqiad, AS64606/IPv4: Connect - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:52:44] !log kormat@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P26550 and previous config saved to /var/cache/conftool/dbconfig/20220426-125244-kormat.json [12:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1004.eqiad.wmnet [12:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:21] (03CR) 10Marostegui: "The rack location aren't updated, I am fine with that if this will be for a short time. But in case pc1011 doesn't come back, we should up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) (owner: 10Kormat) [12:57:04] (03PS2) 10Kormat: ProductionServices: Promote pc1014 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) [12:57:42] (03CR) 10Kormat: ProductionServices: Promote pc1014 to primary of pc1. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) (owner: 10Kormat) [12:59:03] (03CR) 10Marostegui: [C: 04-1] "I just realised you are depooling pc2011 not pc1011" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) (owner: 10Kormat) [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1300). [13:00:04] tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:24] I’m in a meeting – I assume tgr / tgr_ can self-serve? :) [13:00:24] hey tgr! do you want to self-serve? [13:00:41] hey Lucas_WMDE! happy meeting :) [13:00:41] sure, i can [13:01:05] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:01:05] (y) [13:01:11] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:01:12] (03PS3) 10Kormat: ProductionServices: Promote pc1014 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) [13:01:43] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:02:06] (03CR) 10Kormat: ProductionServices: Promote pc1014 to primary of pc1. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) (owner: 10Kormat) [13:02:49] (03CR) 10Marostegui: [C: 03+1] ProductionServices: Promote pc1014 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) (owner: 10Kormat) [13:03:29] RECOVERY - configured eth on ganeti-test2001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:07:20] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1005.eqiad.wmnet [13:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:58] (03PS1) 10Cathal Mooney: Modify Ganeti addnode.py script function when detecting bridge status [cookbooks] - 10https://gerrit.wikimedia.org/r/786304 [13:08:44] (03CR) 10Elukey: "Thanks Cathal!" [homer/public] - 10https://gerrit.wikimedia.org/r/784703 (https://phabricator.wikimedia.org/T306545) (owner: 10Elukey) [13:10:53] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:11:12] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/786304 (owner: 10Cathal Mooney) [13:11:25] (03CR) 10Gergő Tisza: [C: 03+2] [beta] Reopen beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785925 (https://phabricator.wikimedia.org/T306833) (owner: 10Gergő Tisza) [13:12:06] (03Merged) 10jenkins-bot: [beta] Reopen beta eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785925 (https://phabricator.wikimedia.org/T306833) (owner: 10Gergő Tisza) [13:13:32] (03CR) 10Cathal Mooney: [C: 03+2] Modify Ganeti addnode.py script function when detecting bridge status [cookbooks] - 10https://gerrit.wikimedia.org/r/786304 (owner: 10Cathal Mooney) [13:13:40] (03CR) 10Gergő Tisza: [C: 03+2] Backport video landing page changes [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785950 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [13:14:01] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1005.eqiad.wmnet [13:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:13] tgr: can you ping me when you've finished deploying, please? i have something i want to deploy, and i don't want to step on your toes [13:15:12] kormat: if it's not in wmf.8, go ahead; the patch is going through CI, will take a while [13:15:39] tgr: it's a mediawiki-config change. i'm not in any rush, so i'd rather wait [13:16:27] as you wish, but it will take an hour at least [13:16:39] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:16:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:16:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:08] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+1] Modify Ganeti addnode.py script function when detecting bridge status [cookbooks] - 10https://gerrit.wikimedia.org/r/786304 (owner: 10Cathal Mooney) [13:17:11] (03Merged) 10jenkins-bot: Modify Ganeti addnode.py script function when detecting bridge status [cookbooks] - 10https://gerrit.wikimedia.org/r/786304 (owner: 10Cathal Mooney) [13:17:26] (03CR) 10Gergő Tisza: [C: 03+2] Add Link: Add 'excluded sections' task setting [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785926 (https://phabricator.wikimedia.org/T304150) (owner: 10Gergő Tisza) [13:18:03] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: make slightly better at handling codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/786307 [13:18:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [13:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:30] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [13:18:37] (03PS2) 10Elukey: Add calico BGP peering settings for ml-serve100[5-8] [deployment-charts] - 10https://gerrit.wikimedia.org/r/786264 (https://phabricator.wikimedia.org/T306649) [13:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:05] tgr: oh really.. ok. i'll do mine now then. wish me luck! ;) [13:19:20] (03CR) 10Kormat: [C: 03+2] ProductionServices: Promote pc1014 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) (owner: 10Kormat) [13:20:08] (03Merged) 10jenkins-bot: ProductionServices: Promote pc1014 to primary of pc1. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786300 (https://phabricator.wikimedia.org/T306892) (owner: 10Kormat) [13:21:19] 10SRE, 10DBA, 10Patch-For-Review: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) [13:21:28] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Set pc1014 as pc1 primary T306892 (duration: 01m 07s) [13:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:44] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Rebooting pc1011 T306892 [13:21:45] T306892: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 [13:21:48] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc[2011,2014].codfw.wmnet,pc[1011,1014].eqiad.wmnet with reason: Rebooting pc1011 T306892 [13:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:30] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on pc1011.eqiad.wmnet with reason: Rebooting for T303174 [13:23:31] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on pc1011.eqiad.wmnet with reason: Rebooting for T303174 [13:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:19] 10SRE, 10DBA, 10Patch-For-Review: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) [13:24:24] 10SRE, 10DBA, 10Patch-For-Review: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) [13:24:44] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks.py: make slightly better at handling codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/786307 (owner: 10Andrew Bogott) [13:25:49] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:26:39] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [13:26:41] (03CR) 10Klausman: [C: 03+1] Add calico BGP peering settings for ml-serve100[5-8] [deployment-charts] - 10https://gerrit.wikimedia.org/r/786264 (https://phabricator.wikimedia.org/T306649) (owner: 10Elukey) [13:27:00] (03CR) 10Elukey: [C: 03+2] Add calico BGP peering settings for ml-serve100[5-8] [deployment-charts] - 10https://gerrit.wikimedia.org/r/786264 (https://phabricator.wikimedia.org/T306649) (owner: 10Elukey) [13:27:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:27:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:14] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [13:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:30] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:42] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:29:43] RECOVERY - BGP status on lsw1-f2-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:09] RECOVERY - BGP status on lsw1-f3-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:30:49] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:31:13] (03PS1) 10Kormat: Revert "ProductionServices: Promote pc1014 to primary of pc1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785936 [13:31:19] RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 4, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:32:23] PROBLEM - PHP7 jobrunner on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:32:51] (03CR) 10Kormat: [C: 03+2] Revert "ProductionServices: Promote pc1014 to primary of pc1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785936 (owner: 10Kormat) [13:33:27] (03PS1) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [13:33:43] PROBLEM - PHP7 rendering on mw1445 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:34:03] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10jbond) >>! In T67270#7832446, @Ladsgroup wrote: > @jbond In the meantime, maybe we can add a rule to lint -1ing any new puppet/or otherwise file t... [13:34:09] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [13:34:38] (03Merged) 10jenkins-bot: Revert "ProductionServices: Promote pc1014 to primary of pc1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785936 (owner: 10Kormat) [13:35:59] (03PS2) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [13:36:33] !log kormat@deploy1002 Synchronized wmf-config/ProductionServices.php: Set pc1011 as pc1 primary T306892 (duration: 01m 37s) [13:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:39] T306892: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 [13:36:51] RECOVERY - PHP7 jobrunner on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [13:36:56] (03Merged) 10jenkins-bot: Backport video landing page changes [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785950 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [13:37:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:37:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:37:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:21] (03PS1) 10MVernon: swift: wmf/rewrite.py say 400 earlier if passed bad UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/786311 (https://phabricator.wikimedia.org/T305942) [13:38:46] (03Merged) 10jenkins-bot: Add Link: Add 'excluded sections' task setting [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785926 (https://phabricator.wikimedia.org/T304150) (owner: 10Gergő Tisza) [13:38:50] (03PS1) 10Giuseppe Lavagetto: requestctl: preserve rules ordering in `requestctl commit` [software/conftool] - 10https://gerrit.wikimedia.org/r/786312 [13:38:52] (03PS1) 10Giuseppe Lavagetto: New version 2.1.3 [software/conftool] - 10https://gerrit.wikimedia.org/r/786313 [13:38:59] (03CR) 10jerkins-bot: [V: 04-1] swift: wmf/rewrite.py say 400 earlier if passed bad UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/786311 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [13:40:31] RECOVERY - PHP7 rendering on mw1445 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:41:11] PROBLEM - PHP7 jobrunner on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:41:38] tgr: alright, i'm finished screwing with production. it's all yours again :) [13:41:50] thanks! [13:42:00] (03PS2) 10MVernon: swift: wmf/rewrite.py say 400 earlier if passed bad UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/786311 (https://phabricator.wikimedia.org/T305942) [13:42:27] 10SRE, 10DBA: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 (10Kormat) 05Open→03Resolved [13:43:21] PROBLEM - PHP7 rendering on mw1438 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:45:27] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 236, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:42] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1006.eqiad.wmnet [13:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:55] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: wmf/rewrite.py say 400 earlier if passed bad UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/786311 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [13:49:00] (03PS3) 10Majavah: P:openstack::encapi: add tls for write endpoint [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) [13:49:57] PROBLEM - PHP7 rendering on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:50:05] PROBLEM - PHP7 jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [13:51:00] (JobUnavailable) firing: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:51:40] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) [13:53:39] !log tgr@deploy1002 Started scap: (no justification provided) [13:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:50] (03PS4) 10Majavah: P:openstack::encapi: add tls for write endpoint [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) [13:54:01] (03CR) 10MVernon: [C: 03+2] swift: wmf/rewrite.py say 400 earlier if passed bad UTF-8 [puppet] - 10https://gerrit.wikimedia.org/r/786311 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [13:54:35] (03PS1) 10Kormat: pc1014: Move to pc2. [puppet] - 10https://gerrit.wikimedia.org/r/786317 (https://phabricator.wikimedia.org/T303174) [13:56:17] !log tgr@deploy1002 Scap failed!: 8/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. [13:56:17] !log tgr@deploy1002 scap failed: RuntimeError Scap failed!: 8/9 canaries failed their endpoint checks(https://en.wikipedia.org). WARNING: canaries have not been rolled back. (duration: 02m 37s) [13:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:31] (03PS5) 10Majavah: P:openstack::encapi: add tls for write endpoint [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) [13:56:34] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) Should we call this done, or should we leave it open pending an outcome on {T305358}? Many thanks again for all your support with this request @JMeybohm. [13:56:53] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host druid1006.eqiad.wmnet [13:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:32] (03CR) 10Jbond: [C: 03+1] "LGTM all nits and comments are minor/picky or unrelated to your changes so feel free to ignore them" [puppet] - 10https://gerrit.wikimedia.org/r/779936 (https://phabricator.wikimedia.org/T305589) (owner: 10Ssingh) [13:57:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:57:46] 10SRE, 10Data-Catalog, 10Data-Engineering, 10serviceops, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) Please keep this open as it is absolutely in a hacky state currently (DNS + service::catalog wise) [13:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:57:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:57:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:34] (03PS6) 10Majavah: P:openstack::encapi: add tls for write endpoint [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) [14:02:35] apparently I don't know how to revert submodule commits [14:02:36] (03PS2) 10Majavah: P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) [14:03:18] the bacc command just gives "fatal: bad object" [14:03:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:23] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:47] manually I end up with "nothing to commit, working tree clean" [14:03:57] urbanecm: are you still around by any chance? [14:04:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:11] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:56] (03PS1) 10Klausman: Switch ML staging control plane to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/786319 (https://phabricator.wikimedia.org/T302195) [14:06:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:45] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:52] (03CR) 10Herron: [C: 03+1] profile: re-enable grafana db sync post 8.x upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785927 (owner: 10Cwhite) [14:07:54] (03PS2) 10Klausman: hiera: Switch ML staging control plane to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/786319 (https://phabricator.wikimedia.org/T302195) [14:08:50] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34960/console" [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [14:08:53] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34961/console" [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [14:09:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:09:42] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:10] I guess I can just interactive-rebase away the commits that need to be reverted, in the short term [14:11:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:02] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti-test2001.codfw.wmnet to ganeti-test01.svc.codfw.wmnet [14:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:13:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:13:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:39] RECOVERY - PHP7 jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:15:52] !ops anyone around with a good command of git? [14:16:08] tgr: what's up [14:16:18] (!_ops pings chanops, but i guess this works too) [14:16:46] I tried to deploy two GrowthExperiments patches together, the canary broke, and I can't figure out how to revert it [14:16:55] RECOVERY - PHP7 rendering on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:17:08] tgr: so you want me to revert both patches you merged in the window? [14:17:49] I did a bunch of "git reset --hard @~" in the mediawiki dir, so that's on the last good commit now (having the two bad commits + a revert there would be ideal) [14:18:12] the submodule still contains those two commits [14:18:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/786312 (owner: 10Giuseppe Lavagetto) [14:18:33] urbanecm: yeah, for now please revert both [14:18:36] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/786313 (owner: 10Giuseppe Lavagetto) [14:18:53] tgr: fixed (git reset --hard in submodule itself) [14:18:58] syncing now [14:19:13] RECOVERY - PHP7 jobrunner on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [14:19:21] ah. I thought git submodule update does that? [14:19:54] the scap that broke was a sync-world but I think a normal sync should be fine for restoring functionality [14:20:06] it was only needed due to i18n [14:20:10] ack ack [14:20:40] (03PS1) 10Urbanecm: Revert "Add Link: Add 'excluded sections' task setting" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785937 (https://phabricator.wikimedia.org/T304150) [14:20:46] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Add Link: Add 'excluded sections' task setting" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785937 (https://phabricator.wikimedia.org/T304150) (owner: 10Urbanecm) [14:20:57] (03PS1) 10Klausman: Add service IP for ML staging k8s ctrl plane [dns] - 10https://gerrit.wikimedia.org/r/786320 (https://phabricator.wikimedia.org/T302195) [14:21:01] (03PS1) 10Urbanecm: Revert "Backport video landing page changes" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785938 (https://phabricator.wikimedia.org/T303785) [14:21:07] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Backport video landing page changes" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785938 (https://phabricator.wikimedia.org/T303785) (owner: 10Urbanecm) [14:21:10] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.8/extensions/GrowthExperiments/: REVERT: Failed backports (duration: 01m 40s) [14:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:00] tgr: hopefully should be reverted (in prod, gerrit and staging dir) [14:22:11] (03CR) 10Elukey: [C: 03+1] Add service IP for ML staging k8s ctrl plane [dns] - 10https://gerrit.wikimedia.org/r/786320 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:22:12] thanks urbanecm! [14:22:18] lemme sync a README now w/o --force to ensure canaries pass now [14:22:55] (03CR) 10Klausman: [C: 03+2] Add service IP for ML staging k8s ctrl plane [dns] - 10https://gerrit.wikimedia.org/r/786320 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:23:09] the bacc revert instructions for extensions are incorrect, right? https://deploy-commands.toolforge.org/bacc/785950 I think it's missing a step to actually change the submodule code? [14:23:24] tgr: no problem. tbh, i usually revert extension backports from gerrit (and bypassing CI with V+2), as it's...easier (and less error-prone) [14:23:37] in theory git submodule update should do the trick, not 100% sure why it doesn't [14:23:45] duh. didn't even think of that. [14:24:28] okay, canaries pass according to scap. i guess we're done? [14:24:39] !log urbanecm@deploy1002 Synchronized README: no op (duration: 02m 11s) [14:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:43] !log klausman@cumin1001 START - Cookbook sre.dns.netbox [14:25:00] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34962/console" [puppet] - 10https://gerrit.wikimedia.org/r/786319 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:29] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] hiera: Switch ML staging control plane to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/786319 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [14:25:56] there is a single big error spike, looks like the canary prevented the code from getting any real traffic. Thanks, I think we are good for now. I'll debug later, gotta catch the meeting. [14:26:04] see you there :) [14:28:18] !log klausman@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:45] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:38:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:38:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:38:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:05] RECOVERY - PHP7 rendering on mw1438 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:43:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:43:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:02] !log upload trafficserver 8.0.8-1wm6 to apt.wm.o (buster) - T304835 [14:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:23] (03PS1) 10JMeybohm: Update miscweb relates records for use with k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/786322 (https://phabricator.wikimedia.org/T305358) [14:49:07] !log upgrading trafficserver to 8.0.8-1wm6 on cp4026 - T304835 [14:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:18] (03PS1) 10JMeybohm: Remove miscweb discovery resources [puppet] - 10https://gerrit.wikimedia.org/r/786323 (https://phabricator.wikimedia.org/T305358) [14:52:22] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1007.eqiad.wmnet [14:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:26] (03PS2) 10JMeybohm: Update miscweb relates records for use with k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/786322 (https://phabricator.wikimedia.org/T305358) [14:56:22] !log upgrading trafficserver to 8.0.8-1wm6 on cp4032 - T304835 [14:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:20] (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.4.1 [software/homer] - 10https://gerrit.wikimedia.org/r/786325 [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:08] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/786325 (owner: 10Cathal Mooney) [15:04:47] (03CR) 10Cathal Mooney: [C: 03+2] CHANGELOG: add changelogs for release v0.4.1 [software/homer] - 10https://gerrit.wikimedia.org/r/786325 (owner: 10Cathal Mooney) [15:08:08] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.4.1 [software/homer] - 10https://gerrit.wikimedia.org/r/786325 (owner: 10Cathal Mooney) [15:09:25] (03CR) 10Klausman: [C: 03+2] hiera: Switch ML staging control plane to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/786319 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:10:05] (03CR) 10Ssingh: [V: 03+1] dnsrecursor: refactor module (see detailed commit message) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779936 (https://phabricator.wikimedia.org/T305589) (owner: 10Ssingh) [15:10:25] (03CR) 10Ladsgroup: [C: 03+2] Add fix_img_major_mime_null_T306560.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/784762 (https://phabricator.wikimedia.org/T306560) (owner: 10Ladsgroup) [15:11:09] (03Merged) 10jenkins-bot: Add fix_img_major_mime_null_T306560.py [software/schema-changes] - 10https://gerrit.wikimedia.org/r/784762 (https://phabricator.wikimedia.org/T306560) (owner: 10Ladsgroup) [15:11:48] (03CR) 10Ladsgroup: "it looks good to me and thank you so much for doing it but my knowledge of ruby is not good enough to properly review this." [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [15:12:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1007.eqiad.wmnet [15:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:24] !log Restarting pybal on lvs2010 to pick up change 786319 (ML staging k8s service setup) [15:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:27] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.72:6443]) https://wikitech.wikimedia.org/wiki/PyBal [15:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:22:59] (03PS2) 10JMeybohm: Remove miscweb discovery resources [puppet] - 10https://gerrit.wikimedia.org/r/786323 (https://phabricator.wikimedia.org/T305358) [15:24:32] !log klausman@puppetmaster1001 conftool action : set/pooled=yes,weight=10; selector: name=ml-staging-ctrl2001 [15:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:40] !log klausman@puppetmaster1001 conftool action : set/pooled=yes,weight=10; selector: name=ml-staging-ctrl2002 [15:24:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:56] (03PS1) 10Cathal Mooney: Release v0.4.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786329 [15:26:26] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786329 (owner: 10Cathal Mooney) [15:27:50] (03CR) 10Cathal Mooney: [C: 03+2] Release v0.4.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786329 (owner: 10Cathal Mooney) [15:27:54] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Release v0.4.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786329 (owner: 10Cathal Mooney) [15:27:58] !log klausman@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=ml-staging-ctrl2001.codfw.wmnet [15:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:02] !log klausman@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=ml-staging-ctrl2002.codfw.wmnet [15:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:24] (03CR) 10Cwhite: profile: re-enable grafana db sync post 8.x upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785927 (owner: 10Cwhite) [15:30:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:30:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [15:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298556)', diff saved to https://phabricator.wikimedia.org/P26557 and previous config saved to /var/cache/conftool/dbconfig/20220426-153039-ladsgroup.json [15:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:44] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [15:31:44] (03PS1) 10David Caro: wmcs.codfw1: use the correct memcached port for the exporter [puppet] - 10https://gerrit.wikimedia.org/r/786330 [15:31:55] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:32:01] ^^ klausman [15:32:10] excellent [15:32:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298556)', diff saved to https://phabricator.wikimedia.org/P26558 and previous config saved to /var/cache/conftool/dbconfig/20220426-153253-ladsgroup.json [15:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:01] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34963/console" [puppet] - 10https://gerrit.wikimedia.org/r/786330 (owner: 10David Caro) [15:34:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:34:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:32] !log Restarting pybal on lvs2009 to pick up change 786319 (ML staging k8s service setup) [15:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:34:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T306560)', diff saved to https://phabricator.wikimedia.org/P26559 and previous config saved to /var/cache/conftool/dbconfig/20220426-153449-ladsgroup.json [15:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:55] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [15:35:48] (03CR) 10Ahmon Dancy: [C: 03+1] Fix permissions/ownership of helm directories [puppet] - 10https://gerrit.wikimedia.org/r/786269 (https://phabricator.wikimedia.org/T305729) (owner: 10JMeybohm) [15:37:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T306560)', diff saved to https://phabricator.wikimedia.org/P26560 and previous config saved to /var/cache/conftool/dbconfig/20220426-153720-ladsgroup.json [15:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:35] !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.4.1 - cmooney@cumin1001 [15:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:13] !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.4.1 - cmooney@cumin1001 [15:42:13] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10MatthewVernon) [15:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:26] 10SRE-swift-storage, 10Patch-For-Review: swift wmf/rewrite.py middleware broken on bullseye (and its test suite doesn't work either) - https://phabricator.wikimedia.org/T305942 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon I think this is now all working satisfactorily (ms-fe1012 is now pooled in... [15:45:13] (03PS2) 10Klausman: labs: Add dummy token for istio-cni on ML staging k8s [labs/private] - 10https://gerrit.wikimedia.org/r/775823 [15:45:53] (03PS3) 10Klausman: labs: Add dummy token for istio-cni on ML staging k8s [labs/private] - 10https://gerrit.wikimedia.org/r/775823 [15:47:39] (03PS1) 10Ladsgroup: Set actor migration to read new for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786341 (https://phabricator.wikimedia.org/T275246) [15:47:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26561 and previous config saved to /var/cache/conftool/dbconfig/20220426-154758-ladsgroup.json [15:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:50] (03PS1) 10BryanDavis: toolhub: Bump container version to 2022-04-21-215651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/786342 (https://phabricator.wikimedia.org/T279713) [15:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:52:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P26562 and previous config saved to /var/cache/conftool/dbconfig/20220426-155226-ladsgroup.json [15:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:23] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2022-04-21-215651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/786342 (https://phabricator.wikimedia.org/T279713) (owner: 10BryanDavis) [15:58:08] !log dancy@deploy1002 Started deploy [restbase/deploy@0205f1d] (dev-cluster): (no justification provided) [15:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:59] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2022-04-21-215651-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/786342 (https://phabricator.wikimedia.org/T279713) (owner: 10BryanDavis) [16:00:04] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1600). Please do the needful. [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:00:04] bd808: May I have your attention please! Toolhub. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1600) [16:00:20] o/ [16:00:51] !log dancy@deploy1002 Finished deploy [restbase/deploy@0205f1d] (dev-cluster): (no justification provided) (duration: 02m 43s) [16:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:25] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [16:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P26563 and previous config saved to /var/cache/conftool/dbconfig/20220426-160303-ladsgroup.json [16:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:17] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [16:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:33] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Neutron networking not working for cloudnet200[5,6]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306861 (10Papaul) a:05Papaul→03Andrew [16:03:53] RECOVERY - puppet last run on ml-staging-ctrl2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:04:28] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [16:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:03] (03CR) 10David Caro: "Got a question, otherwise LGTM (if the answer is "it will not" or "yes, but we don't care" feel free to merge)." [puppet] - 10https://gerrit.wikimedia.org/r/786307 (owner: 10Andrew Bogott) [16:06:16] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [16:06:25] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P26564 and previous config saved to /var/cache/conftool/dbconfig/20220426-160731-ladsgroup.json [16:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:47] RECOVERY - Disk space on ml-staging-ctrl2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2001&var-datasource=codfw+prometheus/ops [16:09:49] !log dancy@deploy1002 Started deploy [restbase/deploy@0205f1d] (dev-cluster): testing [16:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:06] !log dancy@deploy1002 Finished deploy [restbase/deploy@0205f1d] (dev-cluster): testing (duration: 00m 17s) [16:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:48] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [16:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:37] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:27] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10cloud-services-team (Kanban): Decom cloudcephmon200[2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306840 (10Papaul) [16:13:41] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [16:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:41] RECOVERY - puppet last run on ml-staging-ctrl2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:16:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [16:16:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:38] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10cloud-services-team (Kanban): Decom cloudcephmon200[2,3]-dev.codfw.wmnet - https://phabricator.wikimedia.org/T306840 (10Papaul) 05Open→03Resolved complete [16:17:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Decom cloudservices200[2,3]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306669 (10Papaul) [16:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298556)', diff saved to https://phabricator.wikimedia.org/P26566 and previous config saved to /var/cache/conftool/dbconfig/20220426-161808-ladsgroup.json [16:18:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:18:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:14] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [16:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298556)', diff saved to https://phabricator.wikimedia.org/P26567 and previous config saved to /var/cache/conftool/dbconfig/20220426-161816-ladsgroup.json [16:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): Decom cloudservices200[2,3]-dev.wikimedia.org - https://phabricator.wikimedia.org/T306669 (10Papaul) 05Open→03Resolved complete [16:18:52] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [16:19:51] (03CR) 10Herron: [C: 03+1] profile: re-enable grafana db sync post 8.x upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785927 (owner: 10Cwhite) [16:20:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298556)', diff saved to https://phabricator.wikimedia.org/P26568 and previous config saved to /var/cache/conftool/dbconfig/20220426-162029-ladsgroup.json [16:20:32] (03CR) 10Jelto: [C: 03+1] conftool-date: add mw2412 through mw2419 as new appservers [puppet] - 10https://gerrit.wikimedia.org/r/785918 (https://phabricator.wikimedia.org/T290192) (owner: 10Dzahn) [16:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:45] (JobUnavailable) resolved: (4) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:22:08] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [16:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T306560)', diff saved to https://phabricator.wikimedia.org/P26569 and previous config saved to /var/cache/conftool/dbconfig/20220426-162236-ladsgroup.json [16:22:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [16:22:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [16:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:41] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [16:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T306560)', diff saved to https://phabricator.wikimedia.org/P26570 and previous config saved to /var/cache/conftool/dbconfig/20220426-162244-ladsgroup.json [16:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:56] !log Toolhub upgrade to 18d94d and post-deploy data migrations complete [16:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:38] (03CR) 10Dzahn: [C: 03+2] conftool-date: add mw2412 through mw2419 as new appservers [puppet] - 10https://gerrit.wikimedia.org/r/785918 (https://phabricator.wikimedia.org/T290192) (owner: 10Dzahn) [16:24:28] (03CR) 10David Caro: "LGTM, one question though." [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [16:25:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T306560)', diff saved to https://phabricator.wikimedia.org/P26571 and previous config saved to /var/cache/conftool/dbconfig/20220426-162517-ladsgroup.json [16:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:35] RECOVERY - Disk space on ml-staging-ctrl2002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-staging-ctrl2002&var-datasource=codfw+prometheus/ops [16:26:35] 10SRE, 10ops-codfw, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T306843 (10Papaul) [16:27:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:20] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb1021.eqiad.wmnet with reason: Upgrade to bullseye [16:28:22] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1021.eqiad.wmnet with reason: Upgrade to bullseye [16:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:24] jouncebot: nowandnext [16:28:24] For the next 0 hour(s) and 31 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1600) [16:28:25] For the next 0 hour(s) and 31 minute(s): Toolhub (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1600) [16:28:25] In 1 hour(s) and 31 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1800) [16:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:51] (03CR) 10Ladsgroup: [C: 03+2] Set actor migration to read new for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786341 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [16:29:14] Amir1: adding a few new appservers in codfw in conftool right now. this means "add to scap groups" [16:29:20] but not starting yet, right [16:29:28] sure, I'll be quick [16:29:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [16:29:39] (03Merged) 10jenkins-bot: Set actor migration to read new for medium wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786341 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [16:29:45] 10SRE, 10ops-codfw, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T306843 (10Papaul) 05Open→03Resolved complete [16:29:47] ah, ok [16:30:10] I need to merge but let me set them to "inactive" asap [16:30:11] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-staging-ctrl2002.codfw.wmnet [16:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:21] inactive = not in scap [16:32:09] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:786341|Set actor migration to read new for medium wikis (T275246)]] (duration: 02m 01s) [16:32:11] Amir1, mutante: any reason to wait on starting train prep? [16:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:16] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [16:32:21] I am done [16:33:16] brennen: no, the only thing that could happen right now is that during the actual sync you get timeouts from some mw2*. I am waiting for conftool-data to sync [16:33:45] RECOVERY - Check systemd state on ml-staging-ctrl2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:06] mutante: ack, thx. [16:35:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:35:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:35:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26572 and previous config saved to /var/cache/conftool/dbconfig/20220426-163535-ladsgroup.json [16:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:36] !log razzi@cumin1001 START - Cookbook sre.hosts.reimage for host clouddb1021.eqiad.wmnet with OS bullseye [16:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:20] (03PS1) 10Ebernhardson: Add wbsearchentities profiles for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786347 (https://phabricator.wikimedia.org/T306644) [16:36:45] !log klausman@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ml-staging-ctrl2002.codfw.wmnet [16:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:28] (03CR) 10jerkins-bot: [V: 04-1] Add wbsearchentities profiles for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786347 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [16:40:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P26573 and previous config saved to /var/cache/conftool/dbconfig/20220426-164022-ladsgroup.json [16:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:35] (03PS2) 10Ebernhardson: Add wbsearchentities profiles for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786347 (https://phabricator.wikimedia.org/T306644) [16:43:03] (03PS1) 10Brennen Bearnes: testwikis wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786349 [16:43:05] (03CR) 10Brennen Bearnes: [C: 03+2] testwikis wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786349 (owner: 10Brennen Bearnes) [16:43:44] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786349 (owner: 10Brennen Bearnes) [16:44:41] !log brennen@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.9 refs T305215 [16:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:47] T305215: 1.39.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T305215 [16:46:24] !log brennen@deploy1002 deploy-promote aborted: (duration: 03m 22s) [16:46:24] !log brennen@deploy1002 stage-train aborted: (duration: 06m 04s) [16:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:09] !log forgot SCAP=scap environment variable, re-running testwiki sync [16:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:20] !log brennen@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.9 refs T305215 [16:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:50:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P26574 and previous config saved to /var/cache/conftool/dbconfig/20220426-165040-ladsgroup.json [16:50:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:50:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:17] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1021.eqiad.wmnet with reason: host reimage [16:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:01] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [16:53:27] PROBLEM - SSH on druid1006.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:53:58] brennen: ^ root-owned files in staging.. wondering if we need to fix that [16:54:13] but that is deploy2002 [16:54:34] maybe sync related [16:54:59] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1021.eqiad.wmnet with reason: host reimage [16:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P26575 and previous config saved to /var/cache/conftool/dbconfig/20220426-165526-ladsgroup.json [16:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:53] yeah, sync-related says dancy, should be self-correcting. [16:57:02] yep, sounds like it. ACK [16:57:05] ty [17:00:02] brennen: I don't see the conftool-data synced on config-master ..BUT .. the hosts are in conftool and by default are "inactive" which means "not in scap / "dsh" groups" so for a deployer like you..nothing should happen at all. [17:00:38] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/786351 [17:01:28] mutante: ack, thanks. [17:03:14] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [17:03:31] (03CR) 10Ahmon Dancy: [C: 03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/786351 (owner: 10Ahmon Dancy) [17:04:16] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/786351 (owner: 10Ahmon Dancy) [17:04:26] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:05:06] (03PS3) 10Ebernhardson: Add wbsearchentities profiles for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786347 (https://phabricator.wikimedia.org/T306644) [17:05:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298556)', diff saved to https://phabricator.wikimedia.org/P26576 and previous config saved to /var/cache/conftool/dbconfig/20220426-170545-ladsgroup.json [17:05:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1110.eqiad.wmnet with reason: Maintenance [17:05:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1110.eqiad.wmnet with reason: Maintenance [17:05:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:52] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [17:05:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298556)', diff saved to https://phabricator.wikimedia.org/P26577 and previous config saved to /var/cache/conftool/dbconfig/20220426-170553-ladsgroup.json [17:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:15] did just get some timeouts for codfw hosts. [17:08:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298556)', diff saved to https://phabricator.wikimedia.org/P26578 and previous config saved to /var/cache/conftool/dbconfig/20220426-170807-ladsgroup.json [17:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:26] brennen: narff.. were they all starting with mw24* [17:08:46] 2412 - 2419, right [17:08:58] it should not happen though [17:09:17] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1021.eqiad.wmnet with OS bullseye [17:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:32] mutante: mw2258, mw2366, mw2253, mw2309, parse2012.codfw.wmnet, ... [17:10:00] list continues for a bit - you can see it in deploy1002:~brennen/1.39.0-wmf.9.log [17:10:03] brennen: oh.. that is NOT what I was doing.. [17:10:10] hrm [17:10:20] 32 failures total on sync-apaches [17:10:25] we have had reboots of codfw hosts though [17:10:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T306560)', diff saved to https://phabricator.wikimedia.org/P26579 and previous config saved to /var/cache/conftool/dbconfig/20220426-171032-ladsgroup.json [17:10:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:10:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:38] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:11:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:17] all codfw hosts i think, but no obvious pattern to it. [17:11:28] oh wait - hrm: mw1362.eqiad.wmnet [17:11:32] arg, so.. all of them have been rebooted [17:11:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:11:40] but that's not related to what I was talking about earlier [17:11:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [17:11:41] afaict [17:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P26580 and previous config saved to /var/cache/conftool/dbconfig/20220426-171144-ladsgroup.json [17:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:28] (03CR) 10Ebernhardson: "can compare to previous ab test configured in I63e011610" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786347 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [17:13:24] looking at mw1362 [17:13:37] mutante: https://phabricator.wikimedia.org/P26581 [17:13:38] !log mw1362 - scap pull [17:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:48] 17:13:39 Started scap-cdb-rebuild [17:14:01] brennen: it's pooled.. and it's online and it can pull... hrmm [17:14:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P26582 and previous config saved to /var/cache/conftool/dbconfig/20220426-171418-ladsgroup.json [17:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:26] yeah, this is weird - bunch more eqiad hosts than i thought as well. [17:14:44] testing one of the wtp hosts [17:14:57] so far everything looks normal and working [17:15:16] could we.. hmm.. just repeat it? [17:15:43] once the sync finishes i don't think there's any reason i couldn't run sync-world again. [17:15:44] !log wtp1046 - scap pull [17:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:51] should be fast. i'll plan on that. [17:16:56] also this random parsoid machine is in the conftool-data and pooled and can pull [17:17:23] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [17:21:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:21:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:21:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:57] !log brennen@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.9 refs T305215 (duration: 34m 37s) [17:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:07] T305215: 1.39.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T305215 [17:23:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26583 and previous config saved to /var/cache/conftool/dbconfig/20220426-172312-ladsgroup.json [17:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:31] !log mw2309 - scap pull [17:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:20] !log brennen@deploy1002 Started scap: Re-running sync-world to see if timeouts recur for 32 hosts (T305215) [17:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:04] !log brennen@deploy1002 Finished scap: Re-running sync-world to see if timeouts recur for 32 hosts (T305215) (duration: 01m 43s) [17:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:10] T305215: 1.39.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T305215 [17:28:18] mutante, dancy: that one ran cleanly [17:28:29] 👍🏾 [17:28:51] (03CR) 10Muehlenhoff: "That sounds great, but let's hold merging the patch until Legal has given the whole approach their blessing." [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [17:29:02] brennen: uff, glad to hear that [17:29:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P26584 and previous config saved to /var/cache/conftool/dbconfig/20220426-172923-ladsgroup.json [17:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:42] maybe just network weather, but i haven't generally encountered random timeouts like that in the past. [17:30:36] !log brennen@deploy1002 Pruned MediaWiki: 1.39.0-wmf.7 (duration: 01m 29s) [17:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:59] the only thing is that I had just merged that conftool-data change [17:32:08] but it makes no sense that this group of hosts was affected [17:32:12] (03CR) 10Majavah: [V: 03+1] P:openstack::encapi: add tls for write endpoint (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/785110 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [17:33:43] (03CR) 10Muehlenhoff: [C: 03+2] Don't prompt for loading additional firmware in d-i [puppet] - 10https://gerrit.wikimedia.org/r/784259 (https://phabricator.wikimedia.org/T306148) (owner: 10Muehlenhoff) [17:36:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:36:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:36:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P26585 and previous config saved to /var/cache/conftool/dbconfig/20220426-173817-ladsgroup.json [17:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:37] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:41:12] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [17:43:11] (03PS1) 10Muehlenhoff: sre.ganeti.addnode: Fix bridge detection [cookbooks] - 10https://gerrit.wikimedia.org/r/786356 [17:44:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P26586 and previous config saved to /var/cache/conftool/dbconfig/20220426-174428-ladsgroup.json [17:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:41] (03CR) 10JHathaway: [C: 03+2] icinga: remove SMART check [puppet] - 10https://gerrit.wikimedia.org/r/785921 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [17:47:28] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/786356 (owner: 10Muehlenhoff) [17:53:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298556)', diff saved to https://phabricator.wikimedia.org/P26587 and previous config saved to /var/cache/conftool/dbconfig/20220426-175322-ladsgroup.json [17:53:23] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:53:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:53:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1150.eqiad.wmnet with reason: Maintenance [17:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:29] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [17:53:29] (03PS1) 10Jdlrobson: Enable table of contents a/b test on euwiki and hewiki, enable reading depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786357 (https://phabricator.wikimedia.org/T306606) [17:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2123.codfw.wmnet with reason: Maintenance [17:53:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2123.codfw.wmnet with reason: Maintenance [17:53:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance [17:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance [17:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:54:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [17:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:54:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298556)', diff saved to https://phabricator.wikimedia.org/P26588 and previous config saved to /var/cache/conftool/dbconfig/20220426-175424-ladsgroup.json [17:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:43] 10SRE, 10ops-codfw, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs2001-aqs2012 - https://phabricator.wikimedia.org/T305568 (10Papaul) @Eevans I received those nodes today so I will be racking them tomorrow. Here is my racking proposal for tomorrow. |Row| Rack| nodes| |A|A6|aqs2001,aqs2002,... [17:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298556)', diff saved to https://phabricator.wikimedia.org/P26589 and previous config saved to /var/cache/conftool/dbconfig/20220426-175536-ladsgroup.json [17:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:46] (03CR) 10Gergő Tisza: "Caused:" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785950 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [17:57:35] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) The above patch is working, however I'm not 100% the resulting config is what we need. Looking, for instance, at ml-se... [17:58:18] (03PS1) 10Gergő Tisza: Re-apply "Backport video landing page changes" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785941 [17:59:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P26590 and previous config saved to /var/cache/conftool/dbconfig/20220426-175933-ladsgroup.json [17:59:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [17:59:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [17:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:40] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [17:59:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T306560)', diff saved to https://phabricator.wikimedia.org/P26591 and previous config saved to /var/cache/conftool/dbconfig/20220426-175941-ladsgroup.json [17:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:53] (03PS1) 10Jdlrobson: Expand max-width to login, create account, disable on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786358 (https://phabricator.wikimedia.org/T300182) [18:00:05] brennen and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1800). [18:00:43] o/ - going to group0 shortly. [18:02:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T306560)', diff saved to https://phabricator.wikimedia.org/P26592 and previous config saved to /var/cache/conftool/dbconfig/20220426-180214-ladsgroup.json [18:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:21] (03PS1) 10Jdlrobson: [ToC] Increase threshold for ToC collapsing to 1000px [skins/Vector] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785942 (https://phabricator.wikimedia.org/T306904) [18:03:19] (03PS1) 10Brennen Bearnes: group0 wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786359 [18:03:21] (03CR) 10Brennen Bearnes: [C: 03+2] group0 wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786359 (owner: 10Brennen Bearnes) [18:03:36] (03PS1) 10Jdlrobson: [ToC] Increase threshold for ToC collapsing to 1000px [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/785943 (https://phabricator.wikimedia.org/T306904) [18:04:45] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.9 refs T305215 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786359 (owner: 10Brennen Bearnes) [18:06:02] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.9 refs T305215 [18:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:08] T305215: 1.39.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T305215 [18:07:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:07:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:51] (03CR) 10Gergő Tisza: "The canary error was caused by Idf35f67fb298914dad7c80a2ad135909fd344860. This patch looks safe to re-apply." [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785926 (https://phabricator.wikimedia.org/T304150) (owner: 10Gergő Tisza) [18:11:32] (03PS1) 10Gergő Tisza: Re-apply "Add Link: Add 'excluded sections' task setting" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785944 [18:13:32] (03PS2) 10Gergő Tisza: Re-apply "Add Link: Add 'excluded sections' task setting" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785944 [18:14:04] (03PS2) 10Gergő Tisza: Re-apply "Backport video landing page changes" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785941 [18:17:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P26593 and previous config saved to /var/cache/conftool/dbconfig/20220426-181719-ladsgroup.json [18:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:35] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10RobH) [18:21:56] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db2153.codfw.wmnet - db2174.codfw.wmnet - https://phabricator.wikimedia.org/T306927 (10RobH) [18:25:20] (03PS1) 10Cathal Mooney: Correct wmf-netbox plugin failure with patch panel front ports [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786361 [18:25:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10RobH) [18:26:18] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10RobH) [18:27:20] (03CR) 10Volans: [C: 03+1] "LGTM, optional nit inline" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786361 (owner: 10Cathal Mooney) [18:31:57] (03PS2) 10Cathal Mooney: Correct wmf-netbox plugin failure with patch panel front ports [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786361 [18:32:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P26594 and previous config saved to /var/cache/conftool/dbconfig/20220426-183224-ladsgroup.json [18:32:26] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786361 (owner: 10Cathal Mooney) [18:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:45] (03CR) 10Cathal Mooney: [C: 03+2] Correct wmf-netbox plugin failure with patch panel front ports [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786361 (owner: 10Cathal Mooney) [18:32:48] (03PS1) 10RobH: updating sku list [software] - 10https://gerrit.wikimedia.org/r/786363 [18:32:50] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Correct wmf-netbox plugin failure with patch panel front ports [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786361 (owner: 10Cathal Mooney) [18:33:28] (03CR) 10RobH: [C: 03+2] updating sku list [software] - 10https://gerrit.wikimedia.org/r/786363 (owner: 10RobH) [18:34:40] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] "Thanks volans." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786361 (owner: 10Cathal Mooney) [18:35:07] (03PS1) 10Jbond: C:monitoring: Add define for creating http checks [puppet] - 10https://gerrit.wikimedia.org/r/786365 [18:37:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:37:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:38:35] (03CR) 10Jbond: "early review to discuss path forward" [puppet] - 10https://gerrit.wikimedia.org/r/786365 (owner: 10Jbond) [18:39:21] (03PS3) 10Gergő Tisza: Re-apply "Backport video landing page changes" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785941 [18:39:24] (03PS1) 10Gergő Tisza: Enable SkinAddFooterLinks hook [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/786366 [18:40:27] (03CR) 10Jbond: [C: 04-1] "-1 awaiting legal sign of" [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [18:40:34] (03PS1) 10Ebernhardson: cirrus: Turn on retry_on_conflict quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786367 [18:40:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:40:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1096.eqiad.wmnet with reason: Maintenance [18:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298556)', diff saved to https://phabricator.wikimedia.org/P26595 and previous config saved to /var/cache/conftool/dbconfig/20220426-184058-ladsgroup.json [18:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:07] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [18:42:19] (03CR) 10Umherirrender: [C: 03+1] "[Cannot help on deploying this, normal +2 is not enough on this repo, needs to be listed on https://wikitech.wikimedia.org/wiki/Deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740304 (owner: 10Thiemo Kreuz (WMDE)) [18:43:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298556)', diff saved to https://phabricator.wikimedia.org/P26596 and previous config saved to /var/cache/conftool/dbconfig/20220426-184313-ladsgroup.json [18:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:16] (03PS1) 10Cathal Mooney: Release v0.4.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786369 [18:47:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T306560)', diff saved to https://phabricator.wikimedia.org/P26597 and previous config saved to /var/cache/conftool/dbconfig/20220426-184729-ladsgroup.json [18:47:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [18:47:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [18:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:36] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [18:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:56] (03CR) 10Cathal Mooney: [C: 03+2] Release v0.4.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786369 (owner: 10Cathal Mooney) [18:48:00] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Release v0.4.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/786369 (owner: 10Cathal Mooney) [18:48:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [18:48:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [18:48:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T306560)', diff saved to https://phabricator.wikimedia.org/P26598 and previous config saved to /var/cache/conftool/dbconfig/20220426-184815-ladsgroup.json [18:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:18] !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.4.1a - cmooney@cumin1001 [18:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T306560)', diff saved to https://phabricator.wikimedia.org/P26599 and previous config saved to /var/cache/conftool/dbconfig/20220426-185047-ladsgroup.json [18:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.4.1a - cmooney@cumin1001 [18:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:00] (03CR) 10Umherirrender: "[Cannot help on deploying this, normal +2 is not enough on this repo, needs to be listed on https://wikitech.wikimedia.org/wiki/Deployment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/737859 (owner: 10Thiemo Kreuz (WMDE)) [18:53:03] (03CR) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf (0333 comments) [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [18:57:11] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:58:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26601 and previous config saved to /var/cache/conftool/dbconfig/20220426-185818-ladsgroup.json [18:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:45] !log About to deploy analytics/refinery: Weekly deployment train + Artifacts to 0.1.27 [19:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P26602 and previous config saved to /var/cache/conftool/dbconfig/20220426-190552-ladsgroup.json [19:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:14] !log aqu@deploy1002 Started deploy [analytics/refinery@96a3934]: Regular analytics weekly train [analytics/refinery@96a3934] [19:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:02] (03PS1) 10Gergő Tisza: [beta] Restore eswiki Growth campaigns test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786375 (https://phabricator.wikimedia.org/T306833) [19:13:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P26603 and previous config saved to /var/cache/conftool/dbconfig/20220426-191323-ladsgroup.json [19:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:07] (03PS1) 10Bking: elastic: Add wmf-elasticsearch-search-plugins package for bullseye [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/786376 (https://phabricator.wikimedia.org/T306911) [19:20:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P26604 and previous config saved to /var/cache/conftool/dbconfig/20220426-192057-ladsgroup.json [19:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:24:18] (03CR) 10Muehlenhoff: [C: 03+2] sre.ganeti.addnode: Fix bridge detection [cookbooks] - 10https://gerrit.wikimedia.org/r/786356 (owner: 10Muehlenhoff) [19:27:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10RobH) [19:27:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10RobH) [19:28:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298556)', diff saved to https://phabricator.wikimedia.org/P26605 and previous config saved to /var/cache/conftool/dbconfig/20220426-192828-ladsgroup.json [19:28:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1161.eqiad.wmnet with reason: Maintenance [19:28:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1161.eqiad.wmnet with reason: Maintenance [19:28:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:35] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [19:28:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298556)', diff saved to https://phabricator.wikimedia.org/P26606 and previous config saved to /var/cache/conftool/dbconfig/20220426-192841-ladsgroup.json [19:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:49] !log aqu@deploy1002 Finished deploy [analytics/refinery@96a3934]: Regular analytics weekly train [analytics/refinery@96a3934] (duration: 24m 35s) [19:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298556)', diff saved to https://phabricator.wikimedia.org/P26607 and previous config saved to /var/cache/conftool/dbconfig/20220426-193055-ladsgroup.json [19:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:11] !log aqu@deploy1002 Started deploy [analytics/refinery@96a3934] (thin): Regular analytics weekly train THIN [analytics/refinery@96a3934] [19:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:18] !log aqu@deploy1002 Finished deploy [analytics/refinery@96a3934] (thin): Regular analytics weekly train THIN [analytics/refinery@96a3934] (duration: 00m 07s) [19:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:55] !log aqu@deploy1002 Started deploy [analytics/refinery@96a3934] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@96a3934] [19:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T306560)', diff saved to https://phabricator.wikimedia.org/P26608 and previous config saved to /var/cache/conftool/dbconfig/20220426-193602-ladsgroup.json [19:36:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [19:36:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [19:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:08] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [19:36:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T306560)', diff saved to https://phabricator.wikimedia.org/P26609 and previous config saved to /var/cache/conftool/dbconfig/20220426-193610-ladsgroup.json [19:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:05] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:38:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T306560)', diff saved to https://phabricator.wikimedia.org/P26610 and previous config saved to /var/cache/conftool/dbconfig/20220426-193844-ladsgroup.json [19:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:13] !log aqu@deploy1002 Finished deploy [analytics/refinery@96a3934] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@96a3934] (duration: 07m 19s) [19:42:17] (03PS2) 10Jbond: C:monitoring: Add define for creating http checks [puppet] - 10https://gerrit.wikimedia.org/r/786365 [19:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:07] jouncebot now [19:43:08] For the next 0 hour(s) and 16 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T1800) [19:45:32] !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=mw2419.codfw.wmnet [19:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26611 and previous config saved to /var/cache/conftool/dbconfig/20220426-194600-ladsgroup.json [19:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:30] !log dzahn@cumin2002 conftool action : set/weight=25; selector: dc=codfw,name=mw2419.codfw.wmnet [19:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:18] !log mw2419 - set weight to 25 in conftool, scap pull, first time in production, jobrunner/videoscaler T290192 [19:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:24] T290192: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 [19:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:53:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P26612 and previous config saved to /var/cache/conftool/dbconfig/20220426-195349-ladsgroup.json [19:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:38] !log dzahn@cumin2002 conftool action : set/pooled=yes; selector: dc=codfw,name=mw2419.codfw.wmnet [19:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:51] 10SRE: Migrate role::bastionhost::general and role::bastionhost::pop to Buster - https://phabricator.wikimedia.org/T253779 (10Dzahn) This looks resolved to me: ` [cumin2002:~] $ sudo cumin 'bast*' 'lsb_release -c' 8 hosts will be targeted: bast[1003,2002,3004-3005,4003,5001-5002,6001].wikimedia.org Ok to proce... [19:59:50] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [20:00:05] RoanKattouw, Urbanecm, and cjming: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T2000). [20:00:05] jdrewniak, ebernhardson, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] 10SRE: Migrate role::bastionhost::general and role::bastionhost::pop to Buster - https://phabricator.wikimedia.org/T253779 (10Dzahn) 05Open→03Resolved boldly setting to resolved, correct me if I'm wrong @Muehlenhoff [20:00:24] hey [20:00:26] i can deploy today [20:00:35] \o [20:00:46] ebernhardson: unless you (or others) wish to self-service? :) [20:01:05] jan_drewniak: hi, around? :) [20:01:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P26614 and previous config saved to /var/cache/conftool/dbconfig/20220426-200105-ladsgroup.json [20:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:40] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [20:01:43] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Dzahn) [20:02:12] urbanecm: shrug, you can ship if you want :) [20:02:20] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [20:02:23] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10Dzahn) [20:02:34] (03CR) 10Urbanecm: [C: 03+2] cirrus: Turn on retry_on_conflict quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786367 (owner: 10Ebernhardson) [20:02:52] the quirks one isn't really testable, it only takes effect on the job runners. Can monitor logstash to see if it fixes the thing it's supposed to [20:03:14] ebernhardson: okay, good to know. the calendar links only one patch (but twice). can you fix the second link please? [20:03:21] (03Merged) 10jenkins-bot: cirrus: Turn on retry_on_conflict quirk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786367 (owner: 10Ebernhardson) [20:03:26] the AB test one should be safe as well but might as well test on mwdebug host, it's not turning on the test so there should be no visible change [20:03:28] urbanecm: sure, sec [20:04:12] urbanecm: nope it refers to two different patches, they are exactly 20 patches apart (47 vs 67) [20:04:28] hmm, must've opened it twice myself then. sorry! [20:04:34] (03PS4) 10Urbanecm: Add wbsearchentities profiles for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786347 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [20:05:04] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 080b8fc573d9d682038e09a7a7ad875bce478c00: cirrus: Turn on retry_on_conflict quirk (duration: 00m 53s) [20:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:11] first patch is live [20:05:22] (03CR) 10Urbanecm: [C: 03+2] Add wbsearchentities profiles for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786347 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [20:05:51] (03PS2) 10Cwhite: profile: re-enable grafana db sync post 8.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/785927 [20:06:08] (03Merged) 10jenkins-bot: Add wbsearchentities profiles for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786347 (https://phabricator.wikimedia.org/T306644) (owner: 10Ebernhardson) [20:06:18] cool, dropped 200 deprecation warnings/s, makes things a little quieter :) [20:06:24] err, /min [20:07:04] sounds like a good thing to have :) [20:07:24] ebernhardson: second patch is at mwdebug1001, can you test please? [20:07:39] (03PS2) 10Urbanecm: [beta] Restore eswiki Growth campaigns test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786375 (https://phabricator.wikimedia.org/T306833) (owner: 10Gergő Tisza) [20:07:42] (03CR) 10Urbanecm: [C: 03+2] [beta] Restore eswiki Growth campaigns test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786375 (https://phabricator.wikimedia.org/T306833) (owner: 10Gergő Tisza) [20:07:59] urbanecm: checking [20:08:23] (03Merged) 10jenkins-bot: [beta] Restore eswiki Growth campaigns test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786375 (https://phabricator.wikimedia.org/T306833) (owner: 10Gergő Tisza) [20:08:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:08:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P26615 and previous config saved to /var/cache/conftool/dbconfig/20220426-200854-ladsgroup.json [20:08:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:55] urbanecm: looks to work as expected, good to go [20:10:02] syncing [20:10:06] (03CR) 10Cwhite: [C: 03+2] profile: re-enable grafana db sync post 8.x upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/785927 (owner: 10Cwhite) [20:10:57] jan_drewniak: hello, around? [20:11:11] urbanecm: hey, sorry im late! [20:11:18] no problem [20:11:26] (03CR) 10Urbanecm: [C: 03+2] [ToC] Increase threshold for ToC collapsing to 1000px [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/785943 (https://phabricator.wikimedia.org/T306904) (owner: 10Jdlrobson) [20:11:27] !log urbanecm@deploy1002 Synchronized wmf-config/: 9805e61f7006edf45199a3e22494945bffaaeb4d: Add wbsearchentities profiles for testing (T306644) (duration: 00m 53s) [20:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:33] T306644: re-run wbsearchentities optimization process - https://phabricator.wikimedia.org/T306644 [20:11:34] (03CR) 10Urbanecm: [C: 03+2] [ToC] Increase threshold for ToC collapsing to 1000px [skins/Vector] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785942 (https://phabricator.wikimedia.org/T306904) (owner: 10Jdlrobson) [20:11:50] ebernhardson: it's live. anything else from you? [20:12:19] 10SRE, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10jhathaway) Another option would be to use cpu pinning via taskset(1), where ffmpeg is assigned to cpus 1-N and cpu 0 is left free to s... [20:13:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:14:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:55] (03CR) 10Cwhite: [C: 03+2] logstash: populate target index format and add pipeline diagnostics [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [20:16:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298556)', diff saved to https://phabricator.wikimedia.org/P26616 and previous config saved to /var/cache/conftool/dbconfig/20220426-201610-ladsgroup.json [20:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:17] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [20:16:43] jan_drewniak: just noticed you've a config too. does it depend on the backport? [20:17:15] urbanecm: nope, two different things [20:17:21] great [20:17:44] jan_drewniak: and it's marked as depending on (unscheduled) https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/786357/1. do you want to do both? [20:17:46] or just the scheduled one? [20:18:57] urbanecm: both :) [20:19:10] okay [20:19:13] (03PS2) 10Urbanecm: Enable table of contents a/b test on euwiki and hewiki, enable reading depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786357 (https://phabricator.wikimedia.org/T306606) (owner: 10Jdlrobson) [20:19:24] (03CR) 10Urbanecm: [C: 03+2] Enable table of contents a/b test on euwiki and hewiki, enable reading depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786357 (https://phabricator.wikimedia.org/T306606) (owner: 10Jdlrobson) [20:19:53] (03PS2) 10Urbanecm: Expand max-width to login, create account, disable on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786358 (https://phabricator.wikimedia.org/T300182) (owner: 10Jdlrobson) [20:20:10] (03Merged) 10jenkins-bot: Enable table of contents a/b test on euwiki and hewiki, enable reading depth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786357 (https://phabricator.wikimedia.org/T306606) (owner: 10Jdlrobson) [20:21:46] jan_drewniak: first one pulled to mwdebug1001. can you test? [20:22:59] urbanecm: ok that one's good [20:23:03] syncing [20:23:14] (03CR) 10Urbanecm: [C: 03+2] Expand max-width to login, create account, disable on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786358 (https://phabricator.wikimedia.org/T300182) (owner: 10Jdlrobson) [20:24:00] (03Merged) 10jenkins-bot: Expand max-width to login, create account, disable on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786358 (https://phabricator.wikimedia.org/T300182) (owner: 10Jdlrobson) [20:24:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T306560)', diff saved to https://phabricator.wikimedia.org/P26617 and previous config saved to /var/cache/conftool/dbconfig/20220426-202359-ladsgroup.json [20:24:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [20:24:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [20:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:07] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [20:24:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T306560)', diff saved to https://phabricator.wikimedia.org/P26618 and previous config saved to /var/cache/conftool/dbconfig/20220426-202407-ladsgroup.json [20:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:24:14] (03PS1) 10Ebernhardson: Correct wbsearchentities profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786381 [20:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:24:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:26] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: e3ce97b97c1d83dc4f538040da92a571895cb4d0: Enable table of contents a/b test on euwiki and hewiki, enable reading depth (T306606) (duration: 00m 52s) [20:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:41] jan_drewniak: second patch is at mwdebug1001, please check [20:24:41] T306606: Deploy ToC A/B test to euwiki, hewiki - https://phabricator.wikimedia.org/T306606 [20:26:01] turns out my AB test profiles patch doesn't work entirely (thats why we deploy it with the test turned off :) and it needs a config fix: https://gerrit.wikimedia.org/r/786381 [20:26:19] urbanecm: second patch is good too [20:26:23] (03PS2) 10Urbanecm: Correct wbsearchentities profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786381 (owner: 10Ebernhardson) [20:26:27] (03CR) 10Urbanecm: [C: 03+2] Correct wbsearchentities profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786381 (owner: 10Ebernhardson) [20:26:29] RECOVERY - mediawiki-installation DSH group on mw2419 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:26:35] ebernhardson: okay, let's see :) [20:26:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T306560)', diff saved to https://phabricator.wikimedia.org/P26619 and previous config saved to /var/cache/conftool/dbconfig/20220426-202641-ladsgroup.json [20:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:12] (03Merged) 10jenkins-bot: Correct wbsearchentities profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786381 (owner: 10Ebernhardson) [20:27:51] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@e177d87]: Bump jar dependency to 0.1.27 in mediarequest/hourly [airflow-dags/analytics@e177d87] [20:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:00] (03Merged) 10jenkins-bot: [ToC] Increase threshold for ToC collapsing to 1000px [skins/Vector] (wmf/1.39.0-wmf.9) - 10https://gerrit.wikimedia.org/r/785943 (https://phabricator.wikimedia.org/T306904) (owner: 10Jdlrobson) [20:28:06] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: fe0e119ef7c768373db4afed21537f85004a8ae2: Expand max-width to login, create account, disable on Wikidata (T300182, T306834; 1/2) (duration: 00m 56s) [20:28:08] (03Merged) 10jenkins-bot: [ToC] Increase threshold for ToC collapsing to 1000px [skins/Vector] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785942 (https://phabricator.wikimedia.org/T306904) (owner: 10Jdlrobson) [20:28:09] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@e177d87]: Bump jar dependency to 0.1.27 in mediarequest/hourly [airflow-dags/analytics@e177d87] (duration: 00m 17s) [20:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:13] T300182: Wikidata.org responsive behaviour conflicts with Vector Max width - https://phabricator.wikimedia.org/T300182 [20:28:13] T306834: Add max-width to Log-in & Create account pages - https://phabricator.wikimedia.org/T306834 [20:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:27] (03PS1) 10BryanDavis: wikireplicas: Improve log message for skipped views [puppet] - 10https://gerrit.wikimedia.org/r/786382 [20:29:01] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: fe0e119ef7c768373db4afed21537f85004a8ae2: Expand max-width to login, create account, disable on Wikidata (T300182, T306834; 2/2) (duration: 00m 54s) [20:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:12] jan_drewniak: and live [20:29:23] ebernhardson: your fix is at mwdebug1001 [20:29:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:29:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:29:39] urbanecm: perfect, thanks! [20:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:44] np [20:30:26] (03CR) 10BryanDavis: "Code untested. Likely most easily testable via manual editing on single live wiki replica server. I don't know of any equivalent testing e" [puppet] - 10https://gerrit.wikimedia.org/r/786382 (owner: 10BryanDavis) [20:30:50] jan_drewniak: backports are at mwdebug1001. can you test? [20:33:40] urbanecm: alright we can go ahead with it [20:33:45] !log mw2412, mw2413, mw2414, mw2415 - scap pull, get into production the first time [20:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:26] jan_drewniak: thanks, syncing [20:34:31] mutante: just fyi i'm deploying atm [20:34:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:34:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:13] urbanecm: ok, thanks. I will just have to repeat it but since it's rsync.. will be quicker next time [20:35:57] sure, just wanted to make sure you're aware :) [20:36:23] ebernhardson: how is the fix testing going? [20:36:39] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.8/skins/Vector/resources/skins.vector.styles/: 31ed884d6eda998f8625a88be0f4aa5fd67aef4b: [ToC] Increase threshold for ToC collapsing to 1000px (T306904) (duration: 00m 50s) [20:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:45] T306904: [ToC] Increase threshold for ToC collapsing to 1000px - https://phabricator.wikimedia.org/T306904 [20:37:13] urbanecm: I will wait before adding them to "dsh" so you should not see issues [20:37:23] ok [20:37:30] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.9/skins/Vector/resources/skins.vector.styles/: 019a812176bb940383ddeb22f8a74b5d0f447bf1: [ToC] Increase threshold for ToC collapsing to 1000px (T306904) (duration: 00m 50s) [20:37:32] jan_drewniak: and live [20:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:35] anything else? [20:38:17] !log dzahn@cumin2002 conftool action : set/weight=30; selector: dc=codfw,name=mw2412.codfw.wmnet [20:38:20] urbanecm: that's all for today, thanks again! [20:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:27] np [20:39:40] !log dzahn@cumin2002 conftool action : set/weight=30; selector: dc=codfw,name=mw2413.codfw.wmnet [20:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:45] !log dzahn@cumin2002 conftool action : set/weight=30; selector: dc=codfw,name=mw2414.codfw.wmnet [20:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:39:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:39:54] (03PS1) 10Jbond: C:monitoring::check::http: move config to config ini file [puppet] - 10https://gerrit.wikimedia.org/r/786384 [20:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:02] !log dzahn@cumin2002 conftool action : set/weight=30; selector: dc=codfw,name=mw2415.codfw.wmnet [20:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:34] urbanecm: that's it? [20:40:35] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:40:45] (03CR) 10jerkins-bot: [V: 04-1] C:monitoring::check::http: move config to config ini file [puppet] - 10https://gerrit.wikimedia.org/r/786384 (owner: 10Jbond) [20:40:54] mutante: I'm waiting on ebernhardson's testing of a fix atm [20:40:57] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:41:06] ah, alright [20:41:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P26620 and previous config saved to /var/cache/conftool/dbconfig/20220426-204146-ladsgroup.json [20:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:50] (03CR) 10Jbond: C:monitoring: Add define for creating http checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786365 (owner: 10Jbond) [20:43:40] ebernhardson: how is it going please? [20:44:49] 10SRE, 10serviceops, 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Aklapper) @Ottomata: A #good_first_task is a self-contained, non-controversial task with a clear approach. It should be well-described with pointers to help a completely new con... [20:45:12] mutante: fyi, I believe tgr is going after per -releng [20:45:13] (03PS2) 10Jbond: C:monitoring::check::http: move config to config ini file [puppet] - 10https://gerrit.wikimedia.org/r/786384 [20:45:59] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:46:05] and yes, a window's scheduled right after this one [20:46:38] (03CR) 10Urbanecm: [C: 03+2] Re-apply "Add Link: Add 'excluded sections' task setting" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785944 (owner: 10Gergő Tisza) [20:46:44] (03CR) 10Urbanecm: [C: 03+2] Re-apply "Backport video landing page changes" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785941 (owner: 10Gergő Tisza) [20:47:25] (03PS1) 10Urbanecm: Revert "Correct wbsearchentities profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786406 [20:47:31] (03CR) 10Urbanecm: [C: 03+2] Revert "Correct wbsearchentities profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786406 (owner: 10Urbanecm) [20:47:38] ebernhardson: reverting the fix. [20:47:40] urbanecm: doh, sorry i got distracted on another task [20:47:47] or perhaps not [20:47:55] urbanecm: the expected request works now (https://www.wikidata.org/w/api.php?action=wbsearchentities&search=e&format=json&errorformat=plaintext&language=en&uselang=en&type=item&cirrusWBProfile=wikibase_config_prefix_query-202203-en&cirrusRescoreProfile=wikibase_config_entity_weight-202203-en) [20:47:59] okay, great [20:48:01] so, syncing [20:48:15] (03Abandoned) 10Urbanecm: Revert "Correct wbsearchentities profiles" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786406 (owner: 10Urbanecm) [20:49:41] !log urbanecm@deploy1002 Synchronized wmf-config/SearchSettingsForWikidata.php: f76bc806157a3f4c88d44cd467de347b4b471f4e: Correct wbsearchentities profiles (duration: 00m 57s) [20:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:47] ebernhardson: and, live [20:49:50] i believe that's all? [20:50:13] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@e177d87]: Bump jar dependency to 0.1.27 in mediarequest/hourly [airflow-dags/analytics@e177d87] [20:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:18] urbanecm: looks like I might need to partially revert that last change by Jan [20:50:21] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@e177d87]: Bump jar dependency to 0.1.27 in mediarequest/hourly [airflow-dags/analytics@e177d87] (duration: 00m 07s) [20:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:28] Jdlrobson: what does that mean please? [20:51:08] wmgVectorMaxWidthOptionsNamespaces is not working [20:51:14] it's applying to pages it shouldn't be [20:51:35] Jdlrobson: so i should revert the config patch, right? [20:52:46] (03PS1) 10Jdlrobson: wmgVectorMaxWidthOptionsNamespaces not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786388 [20:52:50] ^ urbanecm [20:52:59] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 156 MB (1% inode=87%): /tmp 156 MB (1% inode=87%): /var/tmp 156 MB (1% inode=87%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [20:53:00] dunno what "applying to pages it shouldn't be" means exactly, it applies correctly at the very last (on those few wikis) https://www.irccloud.com/pastebin/wbZmfNsD/ [20:53:17] Mm on https://en.wikipedia.org/wiki/Special:Contributions/Jdlrobson I'm seeing the max width [20:53:19] urbanecm: thanks! [20:53:26] sorry bout the delay [20:53:37] it happens :) [20:53:38] urbanecm: is that for all wikis? [20:53:51] what is wikidatawiki? [20:53:56] wikidata.org [20:54:12] Jdlrobson: my paste? just randomly picked some wikipedias (and wikidatawiki) to spot-check the config applies [20:54:14] I'm seeing no max width on https://www.wikidata.org/wiki/Q1 [20:54:21] so somethings not right here [20:54:38] k, reverting [20:54:44] (03CR) 10Urbanecm: [C: 03+2] wmgVectorMaxWidthOptionsNamespaces not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786388 (owner: 10Jdlrobson) [20:54:44] So in short: 1) The content on https://www.wikidata.org/wiki/Q1 should not be limited [20:54:53] 2) Or on https://en.wikipedia.org/wiki/Special:Contributions/Jdlrobson [20:55:08] oohhh [20:55:15] it should be $wgVectorMaxWidthOptions['exclude']['namespaces'] [20:55:20] that's what's going on here [20:55:21] oh [20:55:27] removed the +2 [20:55:33] Jdlrobson: wanna upload a followup? [20:56:07] (03PS2) 10Jdlrobson: wmgVectorMaxWidthOptionsNamespaces not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786388 [20:56:11] I amended that patch [20:56:25] yep that will do it [20:56:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P26621 and previous config saved to /var/cache/conftool/dbconfig/20220426-205651-ladsgroup.json [20:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:05] (03CR) 10Urbanecm: [C: 03+2] wmgVectorMaxWidthOptionsNamespaces not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786388 (owner: 10Jdlrobson) [20:57:07] let's hoper [20:57:09] *hope [20:57:09] (03PS3) 10Jbond: C:monitoring::check::http: move config to config ini file [puppet] - 10https://gerrit.wikimedia.org/r/786384 [20:57:47] (03Merged) 10jenkins-bot: wmgVectorMaxWidthOptionsNamespaces not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786388 (owner: 10Jdlrobson) [20:58:19] urbanecm: sorry about that [20:58:26] Jdlrobson: pulled to mwdebug1001. can you have a look? [20:58:34] (and no problem at all, happens from time to time) [20:58:51] RhinosF1: ACK, thx. It can wait :) [20:59:05] urbanecm: testing now [20:59:41] urbanecm: that's working now [20:59:46] excellent [20:59:47] syncing [21:00:05] tgr: That opportune time is upon us again. Time for a Retry of UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220426T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:07] !log mw2416, mw2417, mw2418 - scap pull [21:00:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:00:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:00:15] tgr: please wait for a while [21:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:08] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: cab00628da0ba6226ff162cfc848bea35a35783a: fix wmgVectorMaxWidthOptionsNamespaces (T300182) (duration: 01m 00s) [21:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:14] Jdlrobson: and, live [21:01:14] T300182: Wikidata.org responsive behaviour conflicts with Vector Max width - https://phabricator.wikimedia.org/T300182 [21:01:22] tgr: floor is yours now [21:01:29] let me know if i can help in any way [21:01:43] thanks urbanecm [21:01:53] thanks urbanecm :) [21:01:55] np [21:02:08] mutante: do you want to do the scaps first? I'll take a while [21:02:16] !log dzahn@cumin2002 conftool action : set/weight=30; selector: dc=codfw,name=mw2416.codfw.wmnet [21:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:32] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Jclark-ctr) Hostname. Rack. U Cableid Port aqs1016 a3 u21 1877 port23 aqs1017 b5 u38 23000056... [21:02:50] tgr: nah, you can ignore my !log line for now. they are not in the scap groups yet. I am just preparing them. thanks for offering [21:03:22] !log dzahn@cumin2002 conftool action : set/weight=30; selector: dc=codfw,name=mw2417.codfw.wmnet [21:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:28] !log dzahn@cumin2002 conftool action : set/weight=30; selector: dc=codfw,name=mw2418.codfw.wmnet [21:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:34] stepping back, all yours [21:05:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:05:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:05:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:39] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Jclark-ctr) [21:10:46] ls [21:11:27] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [21:11:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T306560)', diff saved to https://phabricator.wikimedia.org/P26623 and previous config saved to /var/cache/conftool/dbconfig/20220426-211156-ladsgroup.json [21:11:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [21:12:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [21:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:03] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [21:12:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T306560)', diff saved to https://phabricator.wikimedia.org/P26624 and previous config saved to /var/cache/conftool/dbconfig/20220426-211204-ladsgroup.json [21:12:05] 10SRE, 10serviceops, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10Dzahn) There is a new type of servers now: group D - mw2416, mw2417 and mw2418 - R440 - Xeon Silver 4210R 2.4G - (**40 processors, 128GB RAM**), that's only 40 processors vs 48 bu... [21:12:06] ebernhardson: ls: cannot open directory '.' [21:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:12] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10RobH) [21:14:37] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10RobH) [21:14:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T306560)', diff saved to https://phabricator.wikimedia.org/P26625 and previous config saved to /var/cache/conftool/dbconfig/20220426-211437-ladsgroup.json [21:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:39] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 124 probes of 676 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:17:33] (03PS1) 10Dzahn: parsoid: move template for testing server to profile, remove old module [puppet] - 10https://gerrit.wikimedia.org/r/786391 (https://phabricator.wikimedia.org/T279059) [21:28:05] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 86 probes of 676 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:29:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P26626 and previous config saved to /var/cache/conftool/dbconfig/20220426-212943-ladsgroup.json [21:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:07] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [21:37:54] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@e5fecc9]: Fix typo in mediarequest/hourly sensor [airflow-dags/analytics@e5fecc9] [21:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:01] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@e5fecc9]: Fix typo in mediarequest/hourly sensor [airflow-dags/analytics@e5fecc9] (duration: 00m 07s) [21:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:35] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:42:25] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) [21:44:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P26627 and previous config saved to /var/cache/conftool/dbconfig/20220426-214448-ladsgroup.json [21:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:27] (03CR) 10jerkins-bot: [V: 04-1] Re-apply "Add Link: Add 'excluded sections' task setting" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785944 (owner: 10Gergő Tisza) [21:50:44] (03CR) 10Dzahn: [C: 03+2] "fyi, Effie. I am trying to finish the clean-up. https://puppet-compiler.wmflabs.org/pcc-worker1002/34965/" [puppet] - 10https://gerrit.wikimedia.org/r/786391 (https://phabricator.wikimedia.org/T279059) (owner: 10Dzahn) [21:53:06] "Build timed out (after 60 minutes). Marking the build as failed." :/ [21:53:14] (03CR) 10Dzahn: "noop on scandium and testreduce1001, parsoid-test hosts" [puppet] - 10https://gerrit.wikimedia.org/r/786391 (https://phabricator.wikimedia.org/T279059) (owner: 10Dzahn) [21:53:34] I think I'll just force-merge that, all the nonselenium jobs passed [21:59:13] (03PS1) 10Volans: homer: suppress cryptography deprecation warning [puppet] - 10https://gerrit.wikimedia.org/r/786400 [21:59:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T306560)', diff saved to https://phabricator.wikimedia.org/P26628 and previous config saved to /var/cache/conftool/dbconfig/20220426-215953-ladsgroup.json [21:59:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [21:59:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [21:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:00] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:00:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T306560)', diff saved to https://phabricator.wikimedia.org/P26629 and previous config saved to /var/cache/conftool/dbconfig/20220426-220001-ladsgroup.json [22:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T306560)', diff saved to https://phabricator.wikimedia.org/P26630 and previous config saved to /var/cache/conftool/dbconfig/20220426-220234-ladsgroup.json [22:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:57] 10SRE, 10ops-codfw, 10DC-Ops, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab200[2|3] and gitlab-runner200[2|3|4] - https://phabricator.wikimedia.org/T301183 (10Papaul) [22:17:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P26631 and previous config saved to /var/cache/conftool/dbconfig/20220426-221739-ladsgroup.json [22:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:10] (03Merged) 10jenkins-bot: Re-apply "Backport video landing page changes" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785941 (owner: 10Gergő Tisza) [22:18:57] (03CR) 10Gergő Tisza: "Error was "Build timed out (after 60 minutes). Marking the build as failed." for the selenium job." [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785944 (owner: 10Gergő Tisza) [22:19:00] (03CR) 10Gergő Tisza: [V: 03+2] Re-apply "Add Link: Add 'excluded sections' task setting" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/785944 (owner: 10Gergő Tisza) [22:25:10] !log tgr@deploy1002 Started scap: backport with i18n changes: [[gerrit:785944]], [[gerrit:785941]] [22:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P26632 and previous config saved to /var/cache/conftool/dbconfig/20220426-223244-ladsgroup.json [22:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:46:50] !log tgr@deploy1002 Finished scap: backport with i18n changes: [[gerrit:785944]], [[gerrit:785941]] (duration: 21m 40s) [22:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T306560)', diff saved to https://phabricator.wikimedia.org/P26633 and previous config saved to /var/cache/conftool/dbconfig/20220426-224749-ladsgroup.json [22:47:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [22:47:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [22:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:55] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [22:47:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T306560)', diff saved to https://phabricator.wikimedia.org/P26634 and previous config saved to /var/cache/conftool/dbconfig/20220426-224757-ladsgroup.json [22:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:25] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 2 others: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10Dzahn) for updates here also see T304891#7869885 It seems you have already requested the Gerrit repo. [22:48:25] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:48:43] (03CR) 10Gergő Tisza: [V: 03+2 C: 03+2] Enable SkinAddFooterLinks hook [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/786366 (owner: 10Gergő Tisza) [22:50:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T306560)', diff saved to https://phabricator.wikimedia.org/P26635 and previous config saved to /var/cache/conftool/dbconfig/20220426-225030-ladsgroup.json [22:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:51:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:51:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [22:53:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1113.eqiad.wmnet with reason: Maintenance [22:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T298556)', diff saved to https://phabricator.wikimedia.org/P26636 and previous config saved to /var/cache/conftool/dbconfig/20220426-225326-ladsgroup.json [22:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:34] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [22:54:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298556)', diff saved to https://phabricator.wikimedia.org/P26637 and previous config saved to /var/cache/conftool/dbconfig/20220426-225437-ladsgroup.json [22:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:56:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:21] !log tgr@deploy1002 Synchronized php-1.39.0-wmf.8/extensions/GrowthExperiments/extension.json: Backport: [[gerrit:786366|Enable SkinAddFooterLinks hook]] (duration: 00m 51s) [22:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:32] an hour behind schedule (90 minutes CI time for a patch must be a new record) but done [23:00:07] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [23:01:46] (03PS1) 10Dzahn: add image-suggestion.discovery.wmnet and point to ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:03:45] (03PS2) 10Dzahn: add image-suggestion.discovery.wmnet and point to ingress-wikikube [dns] - 10https://gerrit.wikimedia.org/r/786426 (https://phabricator.wikimedia.org/T304891) [23:05:31] (03CR) 10Dzahn: "can step 5 be done before step 4 in https://wikitech.wikimedia.org/wiki/Kubernetes#Add_a_new_service? I have questions about step 4 and wh" [deployment-charts] - 10https://gerrit.wikimedia.org/r/775964 (https://phabricator.wikimedia.org/T304891) (owner: 10Dzahn) [23:05:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P26638 and previous config saved to /var/cache/conftool/dbconfig/20220426-230535-ladsgroup.json [23:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P26639 and previous config saved to /var/cache/conftool/dbconfig/20220426-230942-ladsgroup.json [23:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:47] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:20:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P26640 and previous config saved to /var/cache/conftool/dbconfig/20220426-232040-ladsgroup.json [23:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:24:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P26641 and previous config saved to /var/cache/conftool/dbconfig/20220426-232447-ladsgroup.json [23:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:56] 10SRE: role_contacts (service owners) as a custom puppet fact - https://phabricator.wikimedia.org/T306830 (10Dzahn) >>! In T306830#7880182, @jbond wrote: >> use cumin to ask "what is the kernel version of all machines owned by $subteam" or "which hosts owned by $subteam are still on buster" > As we pass this val... [23:29:14] 10SRE: role_contacts (service owners) as a custom puppet fact - https://phabricator.wikimedia.org/T306830 (10Dzahn) >>! In T306830#7880221, @MoritzMuehlenhoff wrote: > And if that syntax is too cumbersome in the day-to-day we could add a few Cumin aliases? like A:hosts-data-persistence and A:hosts-infrastructure... [23:35:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T306560)', diff saved to https://phabricator.wikimedia.org/P26642 and previous config saved to /var/cache/conftool/dbconfig/20220426-233545-ladsgroup.json [23:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [23:35:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [23:35:52] T306560: Fix nullability of img_major_mime and oi_major_mime - https://phabricator.wikimedia.org/T306560 [23:35:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [23:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [23:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [23:36:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [23:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P26643 and previous config saved to /var/cache/conftool/dbconfig/20220426-233642-ladsgroup.json [23:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T306560)', diff saved to https://phabricator.wikimedia.org/P26644 and previous config saved to /var/cache/conftool/dbconfig/20220426-233917-ladsgroup.json [23:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298556)', diff saved to https://phabricator.wikimedia.org/P26645 and previous config saved to /var/cache/conftool/dbconfig/20220426-233953-ladsgroup.json [23:39:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [23:39:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1180.eqiad.wmnet with reason: Maintenance [23:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:59] T298556: Fix mismatching field type of oldimage.oi_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298556 [23:40:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298556)', diff saved to https://phabricator.wikimedia.org/P26646 and previous config saved to /var/cache/conftool/dbconfig/20220426-234000-ladsgroup.json [23:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:42:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [23:42:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298554)', diff saved to https://phabricator.wikimedia.org/P26647 and previous config saved to /var/cache/conftool/dbconfig/20220426-234224-ladsgroup.json [23:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:34] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [23:42:48] (03PS1) 10Dzahn: cumin: add "owner" aliases to get lists of host per SRE subteam [puppet] - 10https://gerrit.wikimedia.org/r/786430 (https://phabricator.wikimedia.org/T306830) [23:50:02] jouncebot: test [23:50:12] jouncebot: next [23:50:12] In 7 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220427T0700) [23:50:54] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:54:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P26648 and previous config saved to /var/cache/conftool/dbconfig/20220426-235422-ladsgroup.json [23:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log