[00:04:04] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038347 (owner: 10TrainBranchBot) [00:06:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P63968 and previous config saved to /var/cache/conftool/dbconfig/20240604-000612-ladsgroup.json [00:21:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T352010)', diff saved to https://phabricator.wikimedia.org/P63969 and previous config saved to /var/cache/conftool/dbconfig/20240604-002119-ladsgroup.json [00:21:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [00:21:23] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:21:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [00:43:21] (03PS1) 10Bking: data-platform: add alert for WDQS MaxLag [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114) [01:07:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.8 [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038348 (https://phabricator.wikimedia.org/T361402) [01:08:02] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.8 [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038348 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [01:16:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:21:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:30:06] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.8 [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038348 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [01:45:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:55:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0200) [02:10:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [02:47:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:55:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [02:57:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0300) [03:01:43] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038461 (https://phabricator.wikimedia.org/T361402) [03:01:45] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038461 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [03:02:26] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038461 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [03:03:00] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.8 refs T361402 [03:03:03] T361402: 1.43.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T361402 [03:05:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:08:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2212.codfw.wmnet with reason: Maintenance [03:08:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2212.codfw.wmnet with reason: Maintenance [03:09:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63970 and previous config saved to /var/cache/conftool/dbconfig/20240604-030906-marostegui.json [03:09:11] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [03:11:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63971 and previous config saved to /var/cache/conftool/dbconfig/20240604-031117-marostegui.json [03:26:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P63972 and previous config saved to /var/cache/conftool/dbconfig/20240604-032625-marostegui.json [03:41:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P63973 and previous config saved to /var/cache/conftool/dbconfig/20240604-034132-marostegui.json [03:43:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [03:48:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [03:56:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63974 and previous config saved to /var/cache/conftool/dbconfig/20240604-035640-marostegui.json [03:56:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2216.codfw.wmnet with reason: Maintenance [03:56:44] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [03:56:47] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.8 refs T361402 (duration: 53m 47s) [03:56:50] T361402: 1.43.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T361402 [03:56:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2216.codfw.wmnet with reason: Maintenance [03:57:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T364299)', diff saved to https://phabricator.wikimedia.org/P63975 and previous config saved to /var/cache/conftool/dbconfig/20240604-035703-marostegui.json [04:00:04] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0400) [04:01:01] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.5 (duration: 00m 57s) [04:10:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [04:10:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:15:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:17:19] (03PS1) 10Stevemunene: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 [04:18:57] (03CR) 10CI reject: [V:04-1] Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene) [04:19:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [04:20:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance [04:20:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T352010)', diff saved to https://phabricator.wikimedia.org/P63976 and previous config saved to /var/cache/conftool/dbconfig/20240604-042011-ladsgroup.json [04:20:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:34:25] (03PS1) 10BryanDavis: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) [04:34:51] (03CR) 10CI reject: [V:04-1] plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis) [04:40:28] (03PS2) 10BryanDavis: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) [04:40:32] (03PS1) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) [04:41:42] (03CR) 10CI reject: [V:04-1] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene) [04:55:10] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:07:32] (03PS1) 10Marostegui: query_all_hosts.sh: Added to repo [software] - 10https://gerrit.wikimedia.org/r/1038466 [05:10:10] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:15:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:27:06] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T366259 [05:27:09] T366259: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T366259 [05:27:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T366259 [05:28:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db1163 with weight 0 T366259', diff saved to https://phabricator.wikimedia.org/P63977 and previous config saved to /var/cache/conftool/dbconfig/20240604-052803-arnaudb.json [05:32:10] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:32:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:32:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:34:04] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:34:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:49:42] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:50:58] (03PS1) 10Marostegui: db1168: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1038471 [05:51:21] (03CR) 10Marostegui: [C:03+2] db1168: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1038471 (owner: 10Marostegui) [05:52:08] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:59:08] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0600). [06:00:27] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1036604 (https://phabricator.wikimedia.org/T366259) (owner: 10Gerrit maintenance bot) [06:01:56] !log Starting s1 eqiad failover from db1184 to db1163 - T366259 [06:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:59] T366259: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T366259 [06:02:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T366259', diff saved to https://phabricator.wikimedia.org/P63978 and previous config saved to /var/cache/conftool/dbconfig/20240604-060208-arnaudb.json [06:03:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db1163 to s1 primary and set section read-write T366259', diff saved to https://phabricator.wikimedia.org/P63979 and previous config saved to /var/cache/conftool/dbconfig/20240604-060324-arnaudb.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:05:34] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 51 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:06:07] (03CR) 10Arnaudb: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1036605 (https://phabricator.wikimedia.org/T366259) (owner: 10Gerrit maintenance bot) [06:06:13] (03PS2) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1036605 (https://phabricator.wikimedia.org/T366259) [06:06:14] (03CR) 10Arnaudb: [V:03+2 C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1036605 (https://phabricator.wikimedia.org/T366259) (owner: 10Gerrit maintenance bot) [06:07:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1184 T366259', diff saved to https://phabricator.wikimedia.org/P63980 and previous config saved to /var/cache/conftool/dbconfig/20240604-060703-arnaudb.json [06:07:06] T366259: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T366259 [06:07:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'API db1163 T366259', diff saved to https://phabricator.wikimedia.org/P63981 and previous config saved to /var/cache/conftool/dbconfig/20240604-060747-arnaudb.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:25] !log arnaudb@cumin1002 dbctl commit (dc=all): ' fix api db1163 vs db1184 T366259', diff saved to https://phabricator.wikimedia.org/P63982 and previous config saved to /var/cache/conftool/dbconfig/20240604-060925-arnaudb.json [06:10:38] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 8 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:14:47] !log Rename table flaggedpage_pending on db1185 (s5 eqiad dbmaint) - T365568 [06:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:50] T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568 [06:24:18] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:26:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1184.eqiad.wmnet with reason: reimage [06:26:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1184.eqiad.wmnet with reason: reimage [06:26:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1184.eqiad.wmnet with OS bookworm [06:31:25] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 69 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:35:09] (03PS1) 10Slyngshede: New menu [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608 [06:40:05] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage [06:41:11] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:43:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage [06:44:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:48:00] (03PS1) 10Marostegui: db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1038610 [06:48:10] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 52 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:48:35] (03CR) 10Alexandros Kosiaris: [C:03+1] Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz) [06:48:37] (03CR) 10Marostegui: [C:03+2] db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1038610 (owner: 10Marostegui) [06:48:46] (03CR) 10Muehlenhoff: [C:03+2] Remove Hiera entry [puppet] - 10https://gerrit.wikimedia.org/r/1038339 (owner: 10Muehlenhoff) [06:49:40] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 144 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:50:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:50:47] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1038380 (owner: 10Muehlenhoff) [06:51:28] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:51:29] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1038109 (owner: 10Muehlenhoff) [06:53:18] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:53:22] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:53:34] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52067 bytes in 1.970 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:54:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:25] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 42 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:05:13] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:05:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1184.eqiad.wmnet with OS bookworm [07:06:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [07:06:02] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance [07:09:17] (03CR) 10DCausse: [C:03+1] "lgtm but no idea if duplicating the same alerts for multiple teams is the right approach, I fear that over time the alerts might diverge" [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking) [07:10:46] !log dbmaint eqiad s1 deploy schema change on db1184 T355609 [07:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:15:10] !log installing intel-microcode updates on bullseye [07:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:11] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:23:59] (03PS2) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) [07:24:39] (03PS2) 10Stevemunene: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 [07:25:10] (03CR) 10CI reject: [V:04-1] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene) [07:26:23] (03CR) 10CI reject: [V:04-1] Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene) [07:27:58] !log dbmaint eqiad s1 deploy schema change on db1184 T356166 [07:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:02] T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166 [07:28:53] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558 (10Ifrahkhanyaree_WMDE) 03NEW [07:29:10] (03PS4) 10Hashar: Switch Gerrit to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) (owner: 10Muehlenhoff) [07:29:43] (03Abandoned) 10Ayounsi: Update Netbox to v2.10.9-wmf2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/679403 (owner: 10CRusnov) [07:30:09] (03Abandoned) 10Ayounsi: nbdeviceinfo.py: Add simple command-line host dump [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165 (owner: 10CRusnov) [07:31:56] (03PS3) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) [07:33:07] (03CR) 10CI reject: [V:04-1] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene) [07:37:20] (03CR) 10Muehlenhoff: [C:03+2] Switch Gerrit to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) (owner: 10Muehlenhoff) [07:38:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T364299)', diff saved to https://phabricator.wikimedia.org/P63983 and previous config saved to /var/cache/conftool/dbconfig/20240604-073830-marostegui.json [07:38:35] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:42:08] (03PS4) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) [07:42:13] !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc-wf1002.eqiad.wmnet with OS bookworm [07:42:25] !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc-wf2002.codfw.wmnet with OS bookworm [07:42:42] (03CR) 10Arnaudb: [C:03+1] cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:43:15] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [07:43:19] (03CR) 10CI reject: [V:04-1] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene) [07:43:27] I may add a patch to the window [07:43:58] note: we are upgrading gerrit to java 17 :) [07:45:15] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [07:46:57] !log hashar@deploy1002 Started deploy [gerrit/gerrit@6ba3f2e]: gerrit2002: switch to Java 17 version of plugins after having switched Java to 17- T364342 [07:47:01] T364342: Switch Gerrit from Java 11 to Java 17 - https://phabricator.wikimedia.org/T364342 [07:47:02] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@6ba3f2e]: gerrit2002: switch to Java 17 version of plugins after having switched Java to 17- T364342 (duration: 00m 05s) [07:48:02] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9858183 (10WMDE-leszek) I approve the request on WMDE's end, thank you [07:48:20] (03PS5) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) [07:49:09] hashar: ah, can I not go ahead with the backport now? [07:49:49] (03PS1) 10Kosta Harlan: IPReputationHooks: Bump schema version [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1038633 (https://phabricator.wikimedia.org/T354597) [07:50:02] (03PS1) 10Kosta Harlan: IPReputationHooks: Bump schema version [extensions/WikimediaEvents] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038634 (https://phabricator.wikimedia.org/T354597) [07:50:57] (03PS3) 10Stevemunene: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 [07:52:07] (03CR) 10CI reject: [V:04-1] Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene) [07:52:11] hashar: or, can I proceed right after you're done with the upgrade? [07:52:23] yes yes [07:52:36] we are about to restart the primary gerrit [07:53:03] hashar: ok please let me know when you're done [07:53:10] and good luck :) [07:53:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P63984 and previous config saved to /var/cache/conftool/dbconfig/20240604-075338-marostegui.json [07:53:51] (03CR) 10MVernon: [C:03+2] cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [07:54:25] (03PS4) 10Stevemunene: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 [07:55:59] !log hashar@deploy1002 Started deploy [gerrit/gerrit@6ba3f2e]: gerrit1003: switch to Java 17 version of plugins after having switched Java to 17- T364342 [07:56:02] T364342: Switch Gerrit from Java 11 to Java 17 - https://phabricator.wikimedia.org/T364342 [07:56:03] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@6ba3f2e]: gerrit1003: switch to Java 17 version of plugins after having switched Java to 17- T364342 (duration: 00m 03s) [07:56:29] !log Restarting Gerrit for Java 17 upgrade # T364342 [07:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:33] 06SRE, 06serviceops, 13Patch-For-Review: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253#9858230 (10jijiki) >>! In T236253#9856381, @Dzahn wrote: > I talked a bit about this in #systemd IRC channel. Mostly to ask if the config is irrelevant as long as the package i... [07:57:04] !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage [07:59:53] kostajh: we have upgraded Gerrit to Java 17 :) [08:00:31] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563 (10MatthewVernon) 03NEW [08:00:31] nice thanks! [08:00:33] \o/ [08:00:39] hashar: can I go ahead with the backports? [08:00:42] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#9858260 (10MatthewVernon) p:05Triage→03High [08:00:46] yes :) [08:01:11] !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage [08:01:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1038633 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [08:02:40] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage [08:03:25] (03PS1) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038687 [08:03:39] (03Abandoned) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 (owner: 10Effie Mouzeli) [08:04:29] (03Merged) 10jenkins-bot: IPReputationHooks: Bump schema version [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1038633 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [08:05:23] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:1038633|IPReputationHooks: Bump schema version (T354597)]] [08:05:27] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [08:06:06] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage [08:06:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T364069)', diff saved to https://phabricator.wikimedia.org/P63985 and previous config saved to /var/cache/conftool/dbconfig/20240604-080617-marostegui.json [08:06:21] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:07:02] (03PS1) 10Daniel Kinzler: Set LinterParseOnDerivedDataUpdate to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013) [08:08:10] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1038633|IPReputationHooks: Bump schema version (T354597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:08:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P63986 and previous config saved to /var/cache/conftool/dbconfig/20240604-080846-marostegui.json [08:09:23] (03CR) 10Jelto: [C:03+1] "I'm fine with both defaults, using staticttendril:main or use the security-landing-page:latest" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [08:10:59] !log kharlan@deploy1002 kharlan: Continuing with sync [08:11:21] (03PS1) 10Hashar: gerrit: move java_home hiera setting to role [puppet] - 10https://gerrit.wikimedia.org/r/1038690 [08:12:37] PROBLEM - statsv process on webperf2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [08:13:37] RECOVERY - statsv process on webperf2003 is OK: PROCS OK: 2 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [08:17:37] (03PS1) 10Brouberol: datahub: enable internal registry for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038691 [08:19:32] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:1038633|IPReputationHooks: Bump schema version (T354597)]] (duration: 14m 08s) [08:19:35] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [08:19:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1002.eqiad.wmnet with OS bookworm [08:20:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038634 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [08:21:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P63987 and previous config saved to /var/cache/conftool/dbconfig/20240604-082125-marostegui.json [08:22:33] (03Merged) 10jenkins-bot: IPReputationHooks: Bump schema version [extensions/WikimediaEvents] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038634 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [08:23:05] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:1038634|IPReputationHooks: Bump schema version (T354597)]] [08:23:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T364299)', diff saved to https://phabricator.wikimedia.org/P63988 and previous config saved to /var/cache/conftool/dbconfig/20240604-082354-marostegui.json [08:23:56] !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf2002.codfw.wmnet with OS bookworm [08:24:11] (03PS1) 10Jcrespo: dbbackups: Change correct port for s1 backups [puppet] - 10https://gerrit.wikimedia.org/r/1038693 (https://phabricator.wikimedia.org/T362509) [08:24:35] (03CR) 10Brouberol: [C:03+1] "Nicely done!" [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene) [08:25:12] (03CR) 10Brouberol: [C:03+1] "Indeed, that could not have worked. Sorry it slipped past review in the past." [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene) [08:25:33] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1038634|IPReputationHooks: Bump schema version (T354597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:25:39] (03CR) 10Brouberol: "I think this one can be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene) [08:27:55] (03PS1) 10Ayounsi: Netbox deploy for 4.0.2 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) [08:28:23] (03PS1) 10Fabfur: depool text@magru before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1038695 (https://phabricator.wikimedia.org/T366466) [08:30:17] !log jelto@cumin1002 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [08:30:21] (03CR) 10Ayounsi: Netbox deploy for 4.0.2 (031 comment) [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [08:30:23] !log kharlan@deploy1002 kharlan: Continuing with sync [08:31:31] (03PS2) 10Ayounsi: Netbox deploy for 4.0.2 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) [08:32:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2099.codfw.wmnet with reason: Maintenance [08:32:42] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2099.codfw.wmnet with reason: Maintenance [08:33:19] (03PS1) 10Effie Mouzeli: memcached: switch to memcache user (role) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 [08:33:45] (03PS2) 10Effie Mouzeli: memcached: switch to memcache user (role) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 [08:34:09] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli) [08:35:51] (03CR) 10Stevemunene: [C:03+2] Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene) [08:36:00] (03PS3) 10Effie Mouzeli: memcached: switch to memcache user (role) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 [08:36:25] (03PS1) 10Fabfur: hiera: enable IPIP encapsulation on text@magru [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466) [08:36:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P63989 and previous config saved to /var/cache/conftool/dbconfig/20240604-083633-marostegui.json [08:37:02] (03Merged) 10jenkins-bot: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene) [08:37:10] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli) [08:37:18] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [08:37:32] (03CR) 10Marostegui: [C:03+2] query_all_hosts.sh: Added to repo [software] - 10https://gerrit.wikimedia.org/r/1038466 (owner: 10Marostegui) [08:37:42] (03PS6) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) [08:38:00] (03Merged) 10jenkins-bot: query_all_hosts.sh: Added to repo [software] - 10https://gerrit.wikimedia.org/r/1038466 (owner: 10Marostegui) [08:38:51] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:1038634|IPReputationHooks: Bump schema version (T354597)]] (duration: 15m 45s) [08:38:55] T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597 [08:39:32] (03CR) 10Stevemunene: [C:03+2] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene) [08:40:24] !log UTC morning deploys done [08:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:34] (03PS1) 10Urbanecm: Drop logging level for unsupported providers to DEBUG [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038714 (https://phabricator.wikimedia.org/T366519) [08:40:34] (03PS2) 10Brouberol: datahub: enable internal registry for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038691 (https://phabricator.wikimedia.org/T363461) [08:40:42] (03Merged) 10jenkins-bot: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene) [08:41:15] (03PS2) 10Urbanecm: Drop logging level for unsupported providers to DEBUG [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038714 (https://phabricator.wikimedia.org/T366519) [08:44:07] 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858437 (10jijiki) 05In progress→03Open The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikim... [08:44:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1156', diff saved to https://phabricator.wikimedia.org/P63990 and previous config saved to /var/cache/conftool/dbconfig/20240604-084428-root.json [08:45:09] (03PS1) 10Marostegui: db1156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1038700 [08:45:29] 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858443 (10jijiki) [08:45:41] (03PS1) 10Urbanecm: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954) [08:46:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1003.wikimedia.org [08:47:46] (03CR) 10Stevemunene: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038691 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [08:50:13] (03CR) 10Brouberol: [C:03+2] datahub: enable internal registry for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038691 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol) [08:50:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1003.wikimedia.org [08:51:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T364069)', diff saved to https://phabricator.wikimedia.org/P63991 and previous config saved to /var/cache/conftool/dbconfig/20240604-085141-marostegui.json [08:51:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [08:51:45] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:51:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install7001.wikimedia.org [08:51:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance [08:52:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T364069)', diff saved to https://phabricator.wikimedia.org/P63992 and previous config saved to /var/cache/conftool/dbconfig/20240604-085205-marostegui.json [08:52:11] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:52:18] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [08:52:35] PROBLEM - WDQS SPARQL on wdqs1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:53:09] (03PS1) 10Hashar: gerrit: remove mac algos no more supported by Mina SSHD [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) [08:53:30] FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:54:07] (03CR) 10Michael Große: [Beta] Enable CommunityConfiguration extension in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [08:54:48] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [08:55:20] (03CR) 10Hashar: "I have looked at the source code and pasted my findings at T366565#9858442" [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar) [08:55:21] (03PS1) 10DCausse: cirrus: relax CirrusConsumerRerenderFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1038705 [08:56:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install7001.wikimedia.org [08:56:27] (03CR) 10Jcrespo: [C:03+2] dbbackups: Change correct port for s1 backups [puppet] - 10https://gerrit.wikimedia.org/r/1038693 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [08:57:07] (03PS2) 10Hashar: gerrit: move java_home hiera setting to role [puppet] - 10https://gerrit.wikimedia.org/r/1038690 [08:57:18] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038690 (owner: 10Hashar) [08:58:29] RECOVERY - WDQS SPARQL on wdqs1020 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:01:44] (03CR) 10Hashar: "> 1 hosts noop No difference or change fixed compilation" [puppet] - 10https://gerrit.wikimedia.org/r/1038690 (owner: 10Hashar) [09:01:51] !log imported python3-xapian-haystack 2.1.1-1+deb12u1 to bookworm-wikimedia (already lined up for the next Bookworm point release to address https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1066136 and needed for the update of the Mailman servers T331706 [09:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:54] T331706: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 [09:03:30] RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:03:34] 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858514 (10jijiki) [09:05:48] 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858516 (10jijiki) [09:08:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6002.wikimedia.org [09:08:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [09:08:58] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org [09:09:59] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe1001.eqiad.wmnet [09:12:04] (03CR) 10Muehlenhoff: [C:03+2] gerrit: move java_home hiera setting to role [puppet] - 10https://gerrit.wikimedia.org/r/1038690 (owner: 10Hashar) [09:12:05] kostajh: have you managed to deploy your backport? [09:13:25] (03PS4) 10Effie Mouzeli: memcached: switch to memcache user (role) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 [09:14:27] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli) [09:14:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6002.wikimedia.org [09:14:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org [09:14:59] PROBLEM - Host arclamp2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:15:24] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org [09:15:36] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [09:15:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:41] RECOVERY - Host arclamp2001 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [09:15:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe1001.eqiad.wmnet [09:17:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe2001.codfw.wmnet [09:18:15] 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858577 (10jijiki) [09:18:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5002.wikimedia.org [09:18:49] PROBLEM - Host arclamp1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:20:11] RECOVERY - Host arclamp1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [09:21:02] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host testhost2001.codfw.wmnet [09:21:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [09:21:33] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 56 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:21:58] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org [09:22:09] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org [09:22:29] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudweb1003.wikimedia.org [09:22:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe2001.codfw.wmnet [09:23:16] hashar: yes all done [09:23:28] kostajh: great!! [09:25:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5002.wikimedia.org [09:26:03] (03PS2) 10Fabfur: hiera: enable IPIP for high-traffic1@magru for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466) [09:26:54] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4002.wikimedia.org [09:27:10] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testhost2001.codfw.wmnet [09:27:14] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on P{ms-fe1*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [09:27:17] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2001-dev.codfw.wmnet [09:27:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:27:36] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host mwlog1002.eqiad.wmnet [09:27:37] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9858591 (10kamila) @VRiley-WMF Yes, that works, thank you! Since with moving racks it's going to take a while, could we please d... [09:27:38] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2007-dev.codfw.wmnet [09:27:39] jouncebot: next [09:27:39] In 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1000) [09:27:51] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:28:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52067 bytes in 5.364 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:29:16] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1003.wikimedia.org [09:29:19] (03PS3) 10Clément Goubert: miscweb: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) [09:29:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:29:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org [09:29:33] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:29:47] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:30:19] (03CR) 10Clément Goubert: miscweb: Update various modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [09:30:27] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudweb1004.wikimedia.org [09:31:16] 06SRE, 10SRE-tools: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9858600 (10Ladsgroup) Hi, clinic duty again. Can you tag it with a team? Wouldn't I/F be okay here? [09:33:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4002.wikimedia.org [09:33:35] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:33:43] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog1002.eqiad.wmnet [09:33:47] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:34:04] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet [09:34:06] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2007-dev.codfw.wmnet [09:34:30] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2008-dev.codfw.wmnet [09:36:06] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2001-dev.codfw.wmnet [09:36:13] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 787 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:36:18] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet [09:37:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3003.wikimedia.org [09:37:20] (03PS5) 10Pmiazga: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) [09:37:20] (03CR) 10Pmiazga: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [09:37:29] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1004.wikimedia.org [09:37:40] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host graphite2004.codfw.wmnet [09:38:02] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9858611 (10MoritzMuehlenhoff) [09:38:08] (03CR) 10Gergő Tisza: [C:03+1] beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [09:38:49] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:38:53] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw2002-dev.codfw.wmnet [09:39:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::generation::worker::dumper_monitor [09:39:35] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:40:11] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet [09:40:47] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2008-dev.codfw.wmnet [09:41:38] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet [09:41:44] (03PS1) 10Muehlenhoff: Switch dumps::generation::worker::dumper_monitor to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038729 (https://phabricator.wikimedia.org/T349619) [09:42:05] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [09:42:11] RECOVERY - Disk space on karapace1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=karapace1002&var-datasource=eqiad+prometheus/ops [09:42:35] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:42:49] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:43:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3003.wikimedia.org [09:44:29] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:44:33] PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:44:36] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2002-dev.codfw.wmnet [09:44:41] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:44:43] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:45:04] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2002-dev.codfw.wmnet [09:45:11] (03CR) 10Muehlenhoff: [C:03+2] Switch dumps::generation::worker::dumper_monitor to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038729 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:45:17] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2004.codfw.wmnet [09:45:23] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet [09:45:28] FIRING: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:47:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [09:47:29] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:47:33] RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:47:41] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:47:43] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:48:14] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw2003-dev.codfw.wmnet [09:48:18] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet [09:48:36] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [09:49:41] (03PS1) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730 [09:49:54] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730 (owner: 10Ayounsi) [09:50:28] RESOLVED: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:50:37] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:50:49] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:51:01] (03PS2) 10Pmiazga: [beta] Add test2.wikimedia.beta.wmcloud.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) [09:51:29] (03CR) 10Gergő Tisza: [C:03+1] [beta] Add test2.wikimedia.beta.wmcloud.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [09:53:12] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet [09:53:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [09:54:37] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:54:49] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2003-dev.codfw.wmnet [09:54:49] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:55:58] (03CR) 10Pmiazga: [C:03+1] [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [09:56:29] (03CR) 10Marostegui: [C:03+2] db1156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1038700 (owner: 10Marostegui) [09:56:56] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin1001.eqiad.wmnet [09:57:05] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet [09:58:23] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet [09:58:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1156.eqiad.wmnet with OS bookworm [09:58:58] FIRING: [2x] KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:59:13] RESOLVED: [2x] KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:59:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T366552 [09:59:49] T366552: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T366552 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1000) [10:00:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T366552 [10:00:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2212 with weight 0 T366552', diff saved to https://phabricator.wikimedia.org/P63993 and previous config saved to /var/cache/conftool/dbconfig/20240604-100024-root.json [10:00:50] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1038351 (https://phabricator.wikimedia.org/T366552) (owner: 10Gerrit maintenance bot) [10:01:11] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:01:37] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:03:37] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:04:00] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on P{ms-fe1*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [10:04:03] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb2002-dev.codfw.wmnet [10:04:11] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:05:17] (03CR) 10Pmiazga: [C:03+1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [10:05:54] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki) - https://phabricator.wikimedia.org/T362323#9858668 (10Clement_Goubert) [10:06:47] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1001.eqiad.wmnet [10:07:03] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet [10:07:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::generation::worker::dumper_monitor [10:08:15] !log dbmaint eqiad s1 deploy schema change on db1184 T364299 [10:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:18] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:09:29] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:09:38] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on P{ms-fe2*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [10:09:56] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 90% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038732 (https://phabricator.wikimedia.org/T362323) [10:10:09] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:10:38] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb2002-dev.codfw.wmnet [10:11:03] (03PS2) 10Hnowlan: trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) [10:12:09] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:12:29] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:12:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: host reimage [10:15:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: host reimage [10:15:26] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1002.eqiad.wmnet [10:16:39] !log Upgrading CI Jenkins # T366008 [10:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:42] T366008: Upgrade Jenkins instances to 2.452.1 - https://phabricator.wikimedia.org/T366008 [10:16:42] (03PS1) 10Clément Goubert: trafficserver: Migrate votewiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1038735 (https://phabricator.wikimedia.org/T209892) [10:18:29] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9858735 (10Ladsgroup) [10:18:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet [10:20:20] jouncebot: next [10:20:20] In 1 hour(s) and 39 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1200) [10:20:33] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet [10:21:03] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9858744 (10MoritzMuehlenhoff) [10:21:12] (03CR) 10Hnowlan: [C:03+1] "🎉🎉🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1038735 (https://phabricator.wikimedia.org/T209892) (owner: 10Clément Goubert) [10:22:06] (03CR) 10Clément Goubert: [C:03+2] trafficserver: Migrate votewiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1038735 (https://phabricator.wikimedia.org/T209892) (owner: 10Clément Goubert) [10:22:20] (03PS1) 10AikoChou: ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) [10:22:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet [10:23:19] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:23:21] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:23:24] !log Migrating votewiki to mw-on-k8s - T362323 [10:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:27] T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323 [10:24:16] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::generation::worker::dumper [10:27:34] !log hashar@deploy1002 Started deploy [releng/jenkins-deploy@5d3a06d] (releasing): (no justification provided) [10:27:54] !log Upgrading releases Jenkins instances # T366008 [10:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:59] T366008: Upgrade Jenkins instances to 2.452.1 - https://phabricator.wikimedia.org/T366008 [10:28:43] FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:28:47] !log hashar@deploy1002 Finished deploy [releng/jenkins-deploy@5d3a06d] (releasing): (no justification provided) (duration: 01m 12s) [10:29:03] (03PS1) 10Marostegui: Revert "db1156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038722 [10:30:13] (03PS1) 10Muehlenhoff: Switch dumps::generation::worker::dumper to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038738 (https://phabricator.wikimedia.org/T349619) [10:30:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet [10:32:30] (03PS6) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [10:33:16] (03PS1) 10Majavah: hieradata: Remove unused role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1038739 [10:34:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet [10:34:09] (03CR) 10Muehlenhoff: [C:03+2] Switch dumps::generation::worker::dumper to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038738 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:34:24] (03PS7) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [10:34:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [10:35:04] 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9858791 (10cmooney) >>! In T366193#9855670, @BBlack wrote: > IMHO, the A/B set solution with a pair of anycasts, is the most elegant and simple way to achieve the best balance of resiliency and perf for our authdns. I thin... [10:35:11] (03PS3) 10Clément Goubert: trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [10:35:19] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! git apply worked cleanly locally on latest wmf/stable branch thus +2" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037902 (owner: 10Pppery) [10:36:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1156.eqiad.wmnet with OS bookworm [10:38:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::generation::worker::dumper [10:39:12] (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685) [10:39:14] (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038741 (https://phabricator.wikimedia.org/T360685) [10:39:16] (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038742 (https://phabricator.wikimedia.org/T360685) [10:40:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [10:40:53] PROBLEM - SSH on centrallog1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:40:55] RESOLVED: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:42:01] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet [10:42:17] !log Starting s1 codfw failover from db2203 to db2212 - T366552 [10:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:23] T366552: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T366552 [10:42:35] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9858804 (10Ladsgroup) Waiting for approval on data engineering side. [10:42:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2212 to s1 primary T366552', diff saved to https://phabricator.wikimedia.org/P63994 and previous config saved to /var/cache/conftool/dbconfig/20240604-104241-root.json [10:43:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2203 T366552', diff saved to https://phabricator.wikimedia.org/P63995 and previous config saved to /var/cache/conftool/dbconfig/20240604-104337-root.json [10:44:37] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:45:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db2141.codfw.wmnet with reason: Long schema change [10:45:17] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:45:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2141.codfw.wmnet with reason: Long schema change [10:45:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db2203.codfw.wmnet with reason: Long schema change [10:45:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2203.codfw.wmnet with reason: Long schema change [10:45:39] !log dbmaint codfw s1 deploy schema change on db2203 T364299 [10:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:42] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:46:10] (03PS1) 10Fabfur: cache:hiera: enable IPIP on text@magru [puppet] - 10https://gerrit.wikimedia.org/r/1038744 (https://phabricator.wikimedia.org/T366466) [10:46:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on P{ms-fe2*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad) [10:47:44] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038744 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [10:48:00] (03PS2) 10Clément Goubert: miscweb: Use a random miscweb image for default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) [10:48:03] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling reboot on A:thanos-fe [10:48:17] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:48:25] (03CR) 10Kevin Bazira: [C:03+1] ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [10:48:37] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:49:07] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2001-dev.codfw.wmnet [10:49:48] (03CR) 10Btullis: [C:03+1] an-test-druid: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1032632 (owner: 10Muehlenhoff) [10:50:38] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1001.eqiad.wmnet [10:50:50] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet [10:50:56] (03CR) 10Klausman: [C:03+1] ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [10:51:39] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:51:51] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:52:45] PROBLEM - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:52:46] ACKNOWLEDGEMENT - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T366580 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:52:51] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T366580 (10ops-monitoring-bot) 03NEW [10:53:01] PROBLEM - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [10:53:03] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw1358.eqiad.wmnet with reason: Waiting on iDrac update [10:53:05] (03CR) 10Btullis: [C:03+1] "Looks good. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [10:53:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw1358.eqiad.wmnet with reason: Waiting on iDrac update [10:53:17] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:53:29] PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:53:33] PROBLEM - Bird Internet Routing Daemon on centrallog1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:53:43] RECOVERY - SSH on centrallog1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:54:33] RECOVERY - Bird Internet Routing Daemon on centrallog1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [10:54:33] RECOVERY - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is OK: OK: UP (pid=3953) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [10:54:44] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet [10:54:51] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:55:21] RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:55:21] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:55:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P63996 and previous config saved to /var/cache/conftool/dbconfig/20240604-105525-root.json [10:55:28] (03CR) 10Marostegui: [C:03+2] Revert "db1156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038722 (owner: 10Marostegui) [10:55:34] (03PS1) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) [10:55:39] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:55:44] RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:56:19] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:56:23] (03PS2) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) [10:56:29] RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:31] (03PS3) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) [10:57:48] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2001-dev.codfw.wmnet [10:57:58] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet [10:58:17] (03PS4) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) [10:59:08] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1002.eqiad.wmnet [10:59:50] !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1358.eqiad.wmnet [11:00:01] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1358.eqiad.wmnet [11:00:09] (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [11:00:39] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:00:53] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:01:46] (03PS6) 10Kosta Harlan: geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 [11:01:54] (03PS8) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) [11:02:25] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T366580#9858928 (10fgiunchedi) →14Duplicate dup:03T363660 [11:02:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9858930 (10fgiunchedi) [11:03:39] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:03:53] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:04:16] 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583 (10Clement_Goubert) 03NEW [11:04:38] 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9858964 (10Clement_Goubert) [11:04:48] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [11:04:57] (03CR) 10Muehlenhoff: [C:03+1] "Ah, yes. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1038739 (owner: 10Majavah) [11:05:52] (03CR) 10Majavah: [C:03+2] hieradata: Remove unused role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1038739 (owner: 10Majavah) [11:06:01] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9858966 (10Clement_Goubert) [11:06:08] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9858967 (10Clement_Goubert) [11:06:12] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:06:46] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2002-dev.codfw.wmnet [11:06:57] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet [11:07:44] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9858969 (10fgiunchedi) I'm not sure exactly what happened, though while working today on {T366555} centrallog1002 md1 raid wouldn't come up cleanly. I've assembled it with three disks and then put ba... [11:08:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:39] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:09:53] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:10:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P63998 and previous config saved to /var/cache/conftool/dbconfig/20240604-111031-root.json [11:12:41] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:12:41] (03PS5) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) [11:12:53] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:14:26] (03PS6) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) [11:15:45] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet [11:16:24] (03PS7) 10Kosta Harlan: IPInfo: Switch to using GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) [11:20:30] (03PS1) 10Muehlenhoff: Remove obsolete Icinga stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038748 [11:21:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [11:21:27] (03CR) 10Esanders: [C:03+2] Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz) [11:22:04] (03CR) 10Esanders: [C:03+1] Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz) [11:25:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P63999 and previous config saved to /var/cache/conftool/dbconfig/20240604-112537-root.json [11:26:21] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9859001 (10VRiley-WMF) Sure thing! We'll do it one at a time. [11:27:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [11:27:13] (03CR) 10Gergő Tisza: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [11:27:21] (03PS1) 10Majavah: wikitech: Replace OSM class in Gerrit blocking hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553) [11:27:22] (03PS1) 10Majavah: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) [11:27:23] (03CR) 10Gergő Tisza: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [11:29:22] (03PS2) 10Majavah: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) [11:29:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [11:36:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [11:36:18] (03CR) 10Tchanders: [C:03+1] IPInfo: Switch to using GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) (owner: 10Kosta Harlan) [11:39:05] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling reboot on A:thanos-fe [11:39:48] !log cgoubert@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw [11:40:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64000 and previous config saved to /var/cache/conftool/dbconfig/20240604-114043-root.json [11:41:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2136.codfw.wmnet with reason: Maintenance [11:41:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2136.codfw.wmnet with reason: Maintenance [11:41:52] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9859048 (10SGupta-WMF) Hi @Scott_French We are almost done coding the services... [11:41:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T364299)', diff saved to https://phabricator.wikimedia.org/P64001 and previous config saved to /var/cache/conftool/dbconfig/20240604-114157-marostegui.json [11:42:00] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:44:09] !log klausman@cumin2002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-codfw [11:47:22] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [11:47:47] ^side effect of reboots [11:48:10] I'll fix it once their dedicated hosts are done rebooting [11:48:43] FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:50:31] !log klausman@cumin2002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-eqiad [11:50:44] RESOLVED: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:53:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:53:43] FIRING: [3x] ProbeDown: Service ml-cache1001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:09] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:54:16] (03CR) 10Brouberol: [C:03+1] Add wikidata history-dumps import to hdfs job [puppet] - 10https://gerrit.wikimedia.org/r/1036614 (https://phabricator.wikimedia.org/T364045) (owner: 10Joal) [11:54:30] (03CR) 10Brouberol: [C:03+2] Add wikidata history-dumps import to hdfs job [puppet] - 10https://gerrit.wikimedia.org/r/1036614 (https://phabricator.wikimedia.org/T364045) (owner: 10Joal) [11:54:36] !log depooling 3 api appservers and 2 appservers in advance of reimaging [11:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:44] (03CR) 10Vgutierrez: [C:03+1] hiera: enable IPIP for high-traffic1@magru for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [11:54:49] (03CR) 10Vgutierrez: [C:03+1] cache:hiera: enable IPIP on text@magru [puppet] - 10https://gerrit.wikimedia.org/r/1038744 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [11:55:44] FIRING: [4x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:55:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64002 and previous config saved to /var/cache/conftool/dbconfig/20240604-115549-root.json [11:56:45] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:57:09] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:57:54] (03CR) 10Vgutierrez: [C:03+1] depool text@magru before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1038695 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [11:58:43] FIRING: [6x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:59:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T352010)', diff saved to https://phabricator.wikimedia.org/P64003 and previous config saved to /var/cache/conftool/dbconfig/20240604-115907-ladsgroup.json [11:59:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1200) [12:00:44] RESOLVED: [6x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:02:41] !log taavi@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database dtpwiki (T365229) [12:02:43] T365229: Prepare and check storage layer for dtpwiki - https://phabricator.wikimedia.org/T365229 [12:03:43] FIRING: [7x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:39] (03PS1) 10Hnowlan: kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323) [12:05:44] FIRING: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:05:44] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:07:22] FIRING: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:08:33] !log klausman@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-codfw [12:08:43] FIRING: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:55] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS6 [12:09:55] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:09:55] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:10:01] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS [12:10:01] 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:10:45] RESOLVED: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64004 and previous config saved to /var/cache/conftool/dbconfig/20240604-121056-root.json [12:11:57] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:03] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:12:15] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [12:12:21] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [12:13:43] FIRING: [9x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:14:04] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [12:14:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P64005 and previous config saved to /var/cache/conftool/dbconfig/20240604-121415-ladsgroup.json [12:14:21] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [12:14:35] RECOVERY - MD RAID on centrallog1002 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [12:15:24] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [12:15:40] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [12:15:44] FIRING: [9x] ProbeDown: Service ml-cache1001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:15:57] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:17:10] !log klausman@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-eqiad [12:17:30] (03PS1) 10Jelto: conftool-data: add gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1038758 (https://phabricator.wikimedia.org/T365259) [12:18:43] RESOLVED: [7x] ProbeDown: Service ml-cache1002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:21:44] (03CR) 10Filippo Giunchedi: [C:03+1] Remove obsolete Icinga stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038748 (owner: 10Muehlenhoff) [12:22:01] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS6 [12:22:01] : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:22:05] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS [12:22:05] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:22:20] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete Icinga stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038748 (owner: 10Muehlenhoff) [12:22:37] jouncebot: next [12:22:37] In 0 hour(s) and 37 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1300) [12:22:52] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet [12:23:01] (03CR) 10Muehlenhoff: [C:03+2] an-test-druid: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1032632 (owner: 10Muehlenhoff) [12:24:05] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:24:07] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:25:45] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:25:45] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:25:46] (03PS1) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 [12:26:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64006 and previous config saved to /var/cache/conftool/dbconfig/20240604-122602-root.json [12:26:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet [12:28:00] !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database dtpwiki (T365229) [12:28:03] T365229: Prepare and check storage layer for dtpwiki - https://phabricator.wikimedia.org/T365229 [12:28:45] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:28:45] RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:28:54] (03CR) 10CI reject: [V:04-1] base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [12:29:09] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet [12:29:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P64007 and previous config saved to /var/cache/conftool/dbconfig/20240604-122924-ladsgroup.json [12:30:58] (03PS2) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 [12:32:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet [12:32:12] !log brouberol@cumin2002 START - Cookbook sre.wdqs.restart [12:32:13] !log brouberol@cumin2002 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97) [12:32:20] !log brouberol@cumin2002 START - Cookbook sre.wdqs.restart [12:34:05] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS [12:34:05] 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:34:11] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A [12:34:11] v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:34:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1003.wikimedia.org [12:34:56] (03CR) 10CI reject: [V:04-1] base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [12:35:18] (03PS2) 10Ilias Sarantopoulos: ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842) [12:36:05] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:36:11] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:38:07] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9859201 (10MoritzMuehlenhoff) [12:39:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1003.wikimedia.org [12:39:53] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet [12:43:48] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet [12:44:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T352010)', diff saved to https://phabricator.wikimedia.org/P64008 and previous config saved to /var/cache/conftool/dbconfig/20240604-124432-ladsgroup.json [12:44:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:44:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:44:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:45:13] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS [12:45:13] 6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:45:36] (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [12:46:13] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS [12:46:13] 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:47:10] The BGP errors are expected because of reboots [12:47:13] (03PS1) 10Ilias Sarantopoulos: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274) [12:47:19] Sorry for the noise though [12:48:13] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:48:13] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:48:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2004.wikimedia.org [12:48:54] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet [12:49:41] (03PS3) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 [12:51:43] (03CR) 10Muehlenhoff: [C:03+2] Switch maps/codfw to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1038240 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [12:52:22] RESOLVED: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:52:49] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet [12:53:01] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet [12:53:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2004.wikimedia.org [12:53:35] !log brouberol@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [12:56:57] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet [12:57:29] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1001.eqiad.wmnet [12:58:35] (03PS1) 10Muehlenhoff: Remove obsolete swift stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038769 [12:59:09] PROBLEM - pybal on lvs7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [12:59:09] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:59:09] PROBLEM - PyBal backends health check on lvs7001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [12:59:17] ^ expected [12:59:43] PROBLEM - PyBal connections to etcd on lvs7001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [12:59:52] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1001.eqiad.wmnet [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1300). [13:00:04] Nemoralis and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:21] hi [13:00:24] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1002.eqiad.wmnet [13:02:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A [13:02:15] v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:15] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A [13:02:15] v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:50] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1002.eqiad.wmnet [13:03:05] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2001.codfw.wmnet [13:04:00] (03CR) 10MVernon: [C:03+1] "Thanks for doing the tidy-up, this looks good to me." [labs/private] - 10https://gerrit.wikimedia.org/r/1038769 (owner: 10Muehlenhoff) [13:04:55] any deployers around? [13:05:13] we've got just one real change and one beta-only change today [13:05:29] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2001.codfw.wmnet [13:05:45] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2002.codfw.wmnet [13:06:30] (03PS1) 10Stevemunene: Clean up datahub from main cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038773 (https://phabricator.wikimedia.org/T366338) [13:08:10] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2002.codfw.wmnet [13:08:25] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2003.codfw.wmnet [13:09:15] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete swift stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038769 (owner: 10Muehlenhoff) [13:09:39] (03PS1) 10Brouberol: analytics_test_cluster_coordinator: upgrade mariadb to version 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1038771 (https://phabricator.wikimedia.org/T365503) [13:10:21] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete thanos-swift.discovery.wmnet.crt certificate [puppet] - 10https://gerrit.wikimedia.org/r/1038368 (https://phabricator.wikimedia.org/T356412) (owner: 10Muehlenhoff) [13:10:48] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2003.codfw.wmnet [13:11:02] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7001.magru.wmnet [13:11:27] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_magru [13:11:52] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_magru [13:12:48] (03PS1) 10Santiago Faci: MPIC chart: Added two new secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182) [13:12:51] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet [13:13:31] (03PS2) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730 [13:13:34] (03PS1) 10Marostegui: db1156: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1038776 [13:14:29] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730 (owner: 10Ayounsi) [13:14:32] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7001.magru.wmnet [13:14:41] PROBLEM - Host lvs7001 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:53] (03PS1) 10Effie Mouzeli: mcrouter ds: use in mw-debug in codfw and not eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038777 [13:15:15] RECOVERY - Host lvs7001 is UP: PING OK - Packet loss = 0%, RTA = 115.70 ms [13:15:42] (03CR) 10Marostegui: [C:03+2] db1156: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1038776 (owner: 10Marostegui) [13:16:13] PROBLEM - pybal on lvs7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:16:15] PROBLEM - PyBal backends health check on lvs7001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:16:26] ^ expected, resolving soon [13:17:09] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:17:09] i'm still holding out for a deployer, if anyone would like to volunteer [13:17:10] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet [13:17:13] RECOVERY - pybal on lvs7001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:17:15] RECOVERY - PyBal backends health check on lvs7001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:17:17] (03PS2) 10Santiago Faci: MPIC chart: Added two new secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182) [13:17:19] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet [13:17:30] (03CR) 10Bking: "Understood. I share the same concerns, and we talked about changing from a team-based notification model to a service-based notification m" [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking) [13:17:47] (03CR) 10Bking: [C:03+2] data-platform: add alert for WDQS MaxLag [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking) [13:17:48] (03PS1) 10Slyngshede: Attempt to fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [13:18:10] (03CR) 10Effie Mouzeli: [C:03+2] mcrouter ds: use in mw-debug in codfw and not eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038777 (owner: 10Effie Mouzeli) [13:18:22] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: sync on production [13:18:59] (03Merged) 10jenkins-bot: data-platform: add alert for WDQS MaxLag [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking) [13:19:07] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [13:19:19] jouncebot: now [13:19:19] For the next 0 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1300) [13:19:33] PROBLEM - Host kubernetes2033 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:33] PROBLEM - Host kubernetes2030 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:33] PROBLEM - Host kubernetes2035 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:41] RECOVERY - PyBal connections to etcd on lvs7001 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [13:19:41] (03PS10) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [13:19:57] (03Abandoned) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730 (owner: 10Ayounsi) [13:20:33] FIRING: [3x] KubernetesCalicoDown: kubernetes2030.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:20:48] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [13:20:57] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:21:08] !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [13:21:36] (03PS2) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [13:22:22] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:23:10] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet [13:23:26] (03PS1) 10Cwhite: logstash: drop messages from datahub-mce-consumer [puppet] - 10https://gerrit.wikimedia.org/r/1038786 (https://phabricator.wikimedia.org/T366596) [13:23:56] (03PS11) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [13:24:52] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [13:24:58] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [13:25:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [13:25:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [13:25:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:09] PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:27:13] PROBLEM - pybal on lvs7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [13:27:13] PROBLEM - PyBal backends health check on lvs7002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [13:27:16] ^ expected [13:27:22] (03CR) 10Cwhite: [C:03+2] logstash: drop messages from datahub-mce-consumer [puppet] - 10https://gerrit.wikimedia.org/r/1038786 (https://phabricator.wikimedia.org/T366596) (owner: 10Cwhite) [13:27:45] PROBLEM - PyBal connections to etcd on lvs7002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [13:29:13] (03CR) 10Volans: "approach LGTM, some details inline" [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:29:28] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [13:29:42] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [13:30:29] (03CR) 10Brouberol: [C:03+1] "praise: spot on!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182) (owner: 10Santiago Faci) [13:32:42] (03CR) 10Btullis: logstash: drop messages from datahub-mce-consumer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038786 (https://phabricator.wikimedia.org/T366596) (owner: 10Cwhite) [13:32:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Bumping db1194 weight', diff saved to https://phabricator.wikimedia.org/P64009 and previous config saved to /var/cache/conftool/dbconfig/20240604-133250-ladsgroup.json [13:35:23] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [13:36:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in moss-be1002 - https://phabricator.wikimedia.org/T366153#9859578 (10VRiley-WMF) a:03VRiley-WMF [13:36:51] (03PS1) 10Muehlenhoff: Remove thanos-fe-combined.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038781 [13:36:54] (03PS1) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 [13:37:44] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [13:38:17] (03PS12) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [13:38:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038781 (owner: 10Muehlenhoff) [13:39:17] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [13:40:12] (03PS2) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 [13:40:16] (03PS1) 10Btullis: Update the logstash filters for datahub mae/mce consumer pods [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) [13:40:54] (03PS2) 10Btullis: Update the logstash filters for datahub mae/mce consumer pods [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) [13:42:07] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [13:42:11] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2737/console" [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) (owner: 10Btullis) [13:42:30] !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [13:42:55] (03CR) 10Effie Mouzeli: "PCC OK, all are false positives https://puppet-compiler.wmflabs.org/output/1038697/1077/" [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli) [13:43:26] (03PS5) 10Effie Mouzeli: memcached: switch to memcache user (role and profile) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 [13:45:00] (03PS1) 10Ladsgroup: rpc: Update function call in RunSingleJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038785 (https://phabricator.wikimedia.org/T363839) [13:45:14] (03PS1) 10Cwhite: logstash: expand datahub drop filters to match all consumers [puppet] - 10https://gerrit.wikimedia.org/r/1038787 (https://phabricator.wikimedia.org/T363856) [13:45:45] (03CR) 10Effie Mouzeli: [C:03+2] memcached: switch to memcache user (role and profile) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli) [13:46:52] !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [13:46:59] (03PS6) 10Effie Mouzeli: memcached: switch to memcache user (role and profile) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 [13:47:03] (03CR) 10Brouberol: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) (owner: 10Btullis) [13:48:29] (03PS3) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 [13:49:59] (03CR) 10Btullis: Update the logstash filters for datahub mae/mce consumer pods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) (owner: 10Btullis) [13:52:07] (03CR) 10Btullis: "Ah, I will abandon the other similar change that I had started: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038783" [puppet] - 10https://gerrit.wikimedia.org/r/1038787 (https://phabricator.wikimedia.org/T363856) (owner: 10Cwhite) [13:52:30] (03Abandoned) 10Btullis: Update the logstash filters for datahub mae/mce consumer pods [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) (owner: 10Btullis) [13:56:57] 06SRE, 10SRE-tools: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9859741 (10Volans) @Ladsgroup no, not really. It should be the one of the owners of the systems with raid0 that are interested in automating this step. So I guess `o11y` in this... [13:58:30] (03CR) 10Filippo Giunchedi: [C:03+1] logstash: expand datahub drop filters to match all consumers [puppet] - 10https://gerrit.wikimedia.org/r/1038787 (https://phabricator.wikimedia.org/T363856) (owner: 10Cwhite) [13:59:17] !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl1001.eqiad.wmnet [13:59:19] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7002.magru.wmnet [13:59:22] (03CR) 10Cwhite: [C:03+2] logstash: expand datahub drop filters to match all consumers [puppet] - 10https://gerrit.wikimedia.org/r/1038787 (https://phabricator.wikimedia.org/T363856) (owner: 10Cwhite) [13:59:59] (03PS3) 10Hashar: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis) [13:59:59] (03PS1) 10Hashar: Use a wildcard TypeScript include for plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038810 [14:00:33] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl1001.eqiad.wmnet [14:00:34] (03CR) 10CI reject: [V:04-1] plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis) [14:00:52] (03PS4) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 [14:02:00] 06SRE, 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9859792 (10Jdforrester-WMF) [14:02:17] 06SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 4 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#9859793 (10Jdforrester-WMF) [14:02:48] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7002.magru.wmnet [14:03:19] PROBLEM - pybal on lvs7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:03:21] PROBLEM - PyBal backends health check on lvs7002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [14:03:27] RECOVERY - Host kubernetes2030 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms [14:04:19] (03CR) 10CI reject: [V:04-1] WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey) [14:04:51] PROBLEM - SSH on kubernetes2030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:05:11] RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:05:19] RECOVERY - pybal on lvs7002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [14:05:21] RECOVERY - PyBal backends health check on lvs7002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:06:04] (03CR) 10Esanders: [C:03+2] Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz) [14:06:53] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:06:58] (03Merged) 10jenkins-bot: Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz) [14:07:43] RECOVERY - PyBal connections to etcd on lvs7002 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [14:07:46] (03PS1) 10Muehlenhoff: Switch maps/eqiad to PKI as well [puppet] - 10https://gerrit.wikimedia.org/r/1038815 (https://phabricator.wikimedia.org/T360778) [14:08:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038815 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [14:09:51] PROBLEM - Host kubernetes2030 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:32] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [14:14:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [14:14:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:14:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-ctrl1001.eqiad.wmnet [14:15:33] (03CR) 10Filippo Giunchedi: [C:03+1] Remove thanos-fe-combined.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038781 (owner: 10Muehlenhoff) [14:16:25] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9859840 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl1001.eqiad.... [14:16:54] (03CR) 10Muehlenhoff: [C:03+2] Remove thanos-fe-combined.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038781 (owner: 10Muehlenhoff) [14:22:25] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-worker-codfw [14:22:31] :( [14:22:52] (03PS1) 10Muehlenhoff: Remove obsolete thanos-query.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038818 (https://phabricator.wikimedia.org/T360414) [14:23:44] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:12] (03CR) 10Volans: [C:03+1] "LGTM, but please test it for Dell hosts before merging it to be sure we're not breaking the current workflow. Feel free to use the sretest" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [14:24:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038818 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff) [14:27:04] !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7003.magru.wmnet [14:28:00] (03CR) 10Effie Mouzeli: [C:03+2] memcached: switch to memcache user (role and profile) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli) [14:28:55] (03PS5) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 [14:30:09] PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:30:19] ^ expected, lvs7003 [14:31:25] (03PS6) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 [14:33:09] RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:33:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7003.magru.wmnet [14:34:29] (03PS1) 10Marostegui: Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038827 [14:34:54] (03PS7) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 [14:36:55] 10SRE-tools, 10observability: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9859944 (10Ladsgroup) Done. Thanks. [14:36:56] 10SRE-tools, 10observability: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9859947 (10Ladsgroup) [14:37:48] 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609 (10Clement_Goubert) 03NEW p:05Triage→03High [14:38:17] (03PS8) 10Elukey: sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 [14:38:31] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp4045*} and A:cp [14:38:42] (03PS2) 10Dr0ptp4kt: Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) [14:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:26] (03PS2) 10EoghanGaffney: lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) [14:43:00] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [14:43:36] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [14:46:24] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=kubernetes203(1|4).codfw.wmnet,cluster=kubernetes,service=kubesvc [14:48:28] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [14:48:38] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubernetes[2030,2033,2035].codfw.wmnet with reason: Hardware issue [14:48:49] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp4045*} and A:cp [14:48:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubernetes[2030,2033,2035].codfw.wmnet with reason: Hardware issue [14:49:04] 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9859997 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=da38d2ec-3c5a-4c49-a0b8-5355aa47... [14:49:12] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [14:50:29] (03CR) 10Gergő Tisza: [C:03+2] Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486) (owner: 10Bartosz Dziewoński) [14:52:03] (03PS2) 10Bartosz Dziewoński: Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486) [14:52:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64012 and previous config saved to /var/cache/conftool/dbconfig/20240604-145203-root.json [14:52:33] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:53:23] (03PS22) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [14:53:26] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:53:29] (03CR) 10Marostegui: [C:03+2] Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038827 (owner: 10Marostegui) [14:55:09] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3066*} and A:cp [14:55:11] RECOVERY - Host kubernetes2035 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [14:55:27] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1001 [14:56:42] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [14:56:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [14:57:07] (03CR) 10Klausman: [C:03+1] ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842) (owner: 10Ilias Sarantopoulos) [14:57:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1001 [14:57:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:57:53] (03CR) 10Gergő Tisza: [C:03+2] Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486) (owner: 10Bartosz Dziewoński) [14:58:35] (03Merged) 10jenkins-bot: Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486) (owner: 10Bartosz Dziewoński) [14:58:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:53] (03CR) 10JHathaway: [V:03+1] "mutante, this patch now longer generates a puppet diff in prod.. In cloud it will produce an empty array, which should match the current s" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [15:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1500). [15:00:57] (03CR) 10Hashar: "I have rebase your change on top of https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/1038810/ to ensure TypeScript runs." [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis) [15:02:08] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842) (owner: 10Ilias Sarantopoulos) [15:02:09] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Update [15:02:23] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Update [15:02:53] !log brennen@deploy1002 Started deploy [phabricator/deployment@ef680d8]: deploy phab2002 for T366605 [15:02:56] T366605: Deploy Phabricator/Phorge 2024-06-04 - https://phabricator.wikimedia.org/T366605 [15:03:09] (03Merged) 10jenkins-bot: ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842) (owner: 10Ilias Sarantopoulos) [15:03:26] !log brennen@deploy1002 Finished deploy [phabricator/deployment@ef680d8]: deploy phab2002 for T366605 (duration: 00m 33s) [15:03:40] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Update [15:03:54] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Update [15:03:59] !log brennen@deploy1002 Started deploy [phabricator/deployment@ef680d8]: deploy phab1004 for T366605 [15:04:14] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:04:32] !log brennen@deploy1002 Finished deploy [phabricator/deployment@ef680d8]: deploy phab1004 for T366605 (duration: 00m 32s) [15:04:48] (03CR) 10Milimetric: [C:03+1] Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) (owner: 10Dr0ptp4kt) [15:05:00] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:05:17] RECOVERY - Host kubernetes2033 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [15:05:41] (03PS2) 10AikoChou: ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) [15:06:06] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp3066*} and A:cp [15:06:18] (03CR) 10AikoChou: [C:03+2] ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [15:06:52] (03CR) 10Elukey: "Tested the following:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey) [15:07:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64013 and previous config saved to /var/cache/conftool/dbconfig/20240604-150710-root.json [15:07:18] (03Merged) 10jenkins-bot: ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou) [15:08:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [15:08:19] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:08:21] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED [15:08:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [15:08:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T352010)', diff saved to https://phabricator.wikimedia.org/P64014 and previous config saved to /var/cache/conftool/dbconfig/20240604-150835-ladsgroup.json [15:08:38] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:09:17] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:10:16] (03PS2) 10Majavah: openldap: cross-validate-accounts: Note shell users disabled in LDAP [puppet] - 10https://gerrit.wikimedia.org/r/999103 [15:11:12] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED [15:11:53] !log dancy@deploy1002 Installing scap version "4.85.0" for 294 hosts [15:11:53] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_magru [15:11:55] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED [15:11:58] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1001 to a new rack - kamila@cumin1002" [15:12:33] !log dancy@deploy1002 Installation of scap version "4.85.0" completed for 294 hosts [15:12:34] (03PS1) 10FNegri: wikireplicas: Add conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/1038847 [15:12:41] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:12:43] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038847 (owner: 10FNegri) [15:13:09] 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9860167 (10jijiki) [15:13:14] 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9860169 (10jijiki) 05Open→03In progress [15:13:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL [15:15:02] (03PS1) 10Aklapper: Correct name of Herald option [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1038849 [15:15:09] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:15:09] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:15:10] (03PS3) 10EoghanGaffney: lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) [15:15:23] RECOVERY - SSH on kubernetes2030 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:15:25] RECOVERY - Host kubernetes2030 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [15:15:29] (03CR) 10Aklapper: [V:03+2 C:03+2] Correct name of Herald option [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1038849 (owner: 10Aklapper) [15:15:35] !log kamila@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1001 to a new rack - kamila@cumin1002" [15:15:35] !log kamila@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:15:44] (03CR) 10Urbanecm: [Beta] Enable CommunityConfiguration extension in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [15:16:31] (03CR) 10Paladox: [C:03+1] gerrit: remove mac algos no more supported by Mina SSHD [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar) [15:18:11] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1038818 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff) [15:18:15] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:18:46] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-ctrl1001.eqiad.wmnet [15:18:56] !log elukey@cumin1002 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM aux-k8s-ctrl1001.eqiad.wmnet [15:19:07] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-ctrl1001.eqiad.wmnet [15:19:39] (03CR) 10CDanis: [C:03+1] sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey) [15:19:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:19:55] (03PS2) 10FNegri: wikireplicas: Add conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/1038847 [15:20:03] (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038847 (owner: 10FNegri) [15:20:19] (03CR) 10Scott French: [C:03+2] changeprop: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030190 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:21:15] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1001 [15:21:15] (03Merged) 10jenkins-bot: changeprop: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030190 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:21:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1001 [15:22:13] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1001.eqiad.wmnet [15:22:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64015 and previous config saved to /var/cache/conftool/dbconfig/20240604-152216-root.json [15:25:32] 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860268 (10Jhancock.wm) when I put a faceplate on all three servers, I find the same error: The system Confi... [15:25:38] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-ctrl1001.eqiad.wmnet [15:25:45] (03CR) 10Clément Goubert: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [15:26:41] (03CR) 10Jelto: [C:03+1] "lgtm, last sentence in commit message is outdated but should be fine for the initial test" [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:26:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T364299)', diff saved to https://phabricator.wikimedia.org/P64017 and previous config saved to /var/cache/conftool/dbconfig/20240604-152644-marostegui.json [15:26:48] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:27:01] !log dcausse@deploy1002 Started deploy [airflow-dags/search@a279784]: search: bump to discolytics 0.24 and name n-triples dumps [15:27:12] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [15:27:13] !log tchin@deploy1002 Started deploy [airflow-dags/analytics@a279784]: (no justification provided) [15:27:28] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:27:28] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@a279784]: search: bump to discolytics 0.24 and name n-triples dumps (duration: 00m 27s) [15:27:40] !log tchin@deploy1002 Finished deploy [airflow-dags/analytics@a279784]: (no justification provided) (duration: 00m 27s) [15:28:20] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1001.eqiad.wmnet [15:28:38] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:29:10] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [15:29:14] (03PS4) 10EoghanGaffney: lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) [15:29:24] !log tchin@deploy1002 Started deploy [airflow-dags/analytics_test@a279784]: (no justification provided) [15:29:26] (03PS5) 10EoghanGaffney: lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) [15:29:34] !log tchin@deploy1002 Finished deploy [airflow-dags/analytics_test@a279784]: (no justification provided) (duration: 00m 10s) [15:31:10] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-ctrl1002.eqiad.wmnet [15:31:32] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1002.eqiad.wmnet [15:31:44] (03PS1) 10Urbanecm: [beta] arwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038852 (https://phabricator.wikimedia.org/T364895) [15:32:39] (03PS2) 10Urbanecm: [beta] arwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038852 (https://phabricator.wikimedia.org/T364892) [15:34:01] (03CR) 10EoghanGaffney: [C:03+2] lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:34:01] (03CR) 10Urbanecm: [C:03+2] [beta] arwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038852 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm) [15:34:03] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [15:34:06] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860331 (10Jhancock.wm) all servers are updated and are error free. if this happens again with any... [15:34:40] (03Merged) 10jenkins-bot: [beta] arwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038852 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm) [15:34:53] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860334 (10Clement_Goubert) Thanks so much @Jhancock.wm [15:35:56] (03PS1) 10Alexandros Kosiaris: sextant cache: Add new service major version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038857 [15:35:56] (03PS1) 10Alexandros Kosiaris: sextant cache: Allow defining mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038858 [15:35:56] (03PS1) 10Alexandros Kosiaris: mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 [15:35:57] (03PS1) 10Alexandros Kosiaris: mw-mcrouter: Switch helmfile.d to use the newer cache module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860 [15:36:02] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [15:36:52] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for kubernetes[2030,2033,2035].codfw.wmnet [15:36:53] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1002.eqiad.wmnet [15:36:54] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2030,2033,2035].codfw.wmnet [15:37:06] (03CR) 10CI reject: [V:04-1] mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (owner: 10Alexandros Kosiaris) [15:37:15] (03CR) 10CI reject: [V:04-1] mw-mcrouter: Switch helmfile.d to use the newer cache module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860 (owner: 10Alexandros Kosiaris) [15:37:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64018 and previous config saved to /var/cache/conftool/dbconfig/20240604-153722-root.json [15:37:39] !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=kubernetes203(0|3|5).codfw.wmnet,cluster=kubernetes,service=kubesvc [15:37:41] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-ctrl1002.eqiad.wmnet [15:38:05] !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2001.codfw.wmnet [15:39:20] (03PS2) 10Alexandros Kosiaris: mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 [15:39:20] (03PS2) 10Alexandros Kosiaris: mw-mcrouter: Switch helmfile.d to use the newer cache module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860 [15:40:41] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_magru [15:41:51] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860373 (10Clement_Goubert) 05Open→03Resolved Hosts repooled, uncordoned and set back to ac... [15:41:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P64019 and previous config saved to /var/cache/conftool/dbconfig/20240604-154153-marostegui.json [15:42:01] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2001.codfw.wmnet [15:42:19] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:42:43] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012797 (https://phabricator.wikimedia.org/T360378) (owner: 10BryanDavis) [15:42:50] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1001.eqiad.wmnet [15:43:03] !log elukey@cumin1002 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM aux-k8s-etcd1001.eqiad.wmnet [15:43:15] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:43:23] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1001.eqiad.wmnet [15:44:06] !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host miscweb2003.codfw.wmnet [15:45:15] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:45:35] (03CR) 10FNegri: [C:04-1] "This would break https://wikitech.wikimedia.org/wiki/Puppet/Coding_and_style_guidelines#Roles so I need to find another way" [puppet] - 10https://gerrit.wikimedia.org/r/1038847 (owner: 10FNegri) [15:46:52] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet [15:47:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1001.eqiad.wmnet [15:47:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet [15:47:49] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1003.eqiad.wmnet [15:47:52] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1002.eqiad.wmnet [15:48:03] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb2003.codfw.wmnet [15:50:25] FIRING: SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:51:32] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s1 [15:51:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1002.eqiad.wmnet [15:52:01] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3 [15:52:10] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1003.eqiad.wmnet [15:52:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64020 and previous config saved to /var/cache/conftool/dbconfig/20240604-155228-root.json [15:52:33] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1004.eqiad.wmnet [15:52:54] (03PS1) 10Mhorsey: Activate campaignEvents extension on Igbo wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038862 (https://phabricator.wikimedia.org/T363199) [15:53:06] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet [15:53:23] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet [15:53:34] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1003.eqiad.wmnet [15:54:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:55:39] PROBLEM - MariaDB Replica IO: s1 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:55:39] PROBLEM - MariaDB Replica SQL: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:55:39] PROBLEM - MariaDB Replica IO: s3 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:55:42] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1013.eqiad.wmnet [15:55:51] :-) [15:56:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Bumping db1194 weight', diff saved to https://phabricator.wikimedia.org/P64021 and previous config saved to /var/cache/conftool/dbconfig/20240604-155629-ladsgroup.json [15:57:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P64022 and previous config saved to /var/cache/conftool/dbconfig/20240604-155701-marostegui.json [15:57:26] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1003.eqiad.wmnet [15:57:39] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1004.eqiad.wmnet [15:58:24] !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host phab2002.codfw.wmnet [15:59:14] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:59:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:00:01] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [16:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1600). [16:00:05] pmiazga: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:10] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet [16:01:13] o/ [16:01:59] dmed, pmiazga [16:02:10] (03CR) 10JHathaway: [C:03+2] [beta] Add test2.wikimedia.beta.wmcloud.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [16:02:25] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in moss-be1002 - https://phabricator.wikimedia.org/T366153#9860503 (10VRiley-WMF) Since the server is no longer under warranty, we have swapped the HDD with a HDD from a decommissioned server. [16:02:44] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in moss-be1002 - https://phabricator.wikimedia.org/T366153#9860506 (10VRiley-WMF) 05Open→03Resolved [16:02:59] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1005.eqiad.wmnet [16:04:15] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [16:04:28] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab2002.codfw.wmnet [16:04:41] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [16:05:14] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [16:05:25] FIRING: [2x] SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:05:39] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [16:06:39] RECOVERY - MariaDB Replica IO: s1 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:07:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64023 and previous config saved to /var/cache/conftool/dbconfig/20240604-160735-root.json [16:07:39] RECOVERY - MariaDB Replica SQL: s3 on clouddb1013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:07:43] RECOVERY - MariaDB Replica IO: s3 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:08:47] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet [16:09:36] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1005.eqiad.wmnet [16:09:48] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1013.eqiad.wmnet [16:10:24] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3 [16:10:25] FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:36] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s1 [16:10:42] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp7001.magru.wmnet [16:11:15] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp7002.magru.wmnet [16:12:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T364299)', diff saved to https://phabricator.wikimedia.org/P64024 and previous config saved to /var/cache/conftool/dbconfig/20240604-161210-marostegui.json [16:12:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [16:12:13] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:12:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance [16:12:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T364299)', diff saved to https://phabricator.wikimedia.org/P64025 and previous config saved to /var/cache/conftool/dbconfig/20240604-161233-marostegui.json [16:15:25] FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:15:54] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp7001.magru.wmnet [16:18:14] (03PS4) 10EoghanGaffney: lists: Add option to block incoming mail [puppet] - 10https://gerrit.wikimedia.org/r/1038772 [16:20:25] FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:22:03] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp7002.magru.wmnet [16:22:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64028 and previous config saved to /var/cache/conftool/dbconfig/20240604-162241-root.json [16:26:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:29:42] !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1010.eqiad.wmnet [16:31:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [16:31:53] !log delete 3 pods in eventgate-main on wikikube-eqiad to test if envoy on them is in a weird state [16:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:45] (03CR) 10Santiago Faci: [C:03+2] MPIC chart: Added two new secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182) (owner: 10Santiago Faci) [16:34:16] (03Merged) 10jenkins-bot: MPIC chart: Added two new secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182) (owner: 10Santiago Faci) [16:34:39] (03CR) 10Volans: [C:03+1] "Sure, why not, suggestion inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman) [16:35:25] FIRING: [3x] SystemdUnitFailed: ferm.service on mw1360:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:35:38] 10ops-codfw, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T364577#9860723 (10Andrew) [16:35:51] 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9860726 (10KFrancis) The NDA is complete. Thanks! [16:36:26] !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1010.eqiad.wmnet [16:38:33] (03PS13) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [16:38:34] (03PS1) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [16:39:00] (03CR) 10CI reject: [V:04-1] Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 (owner: 10Ayounsi) [16:39:48] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [16:40:25] RESOLVED: [3x] SystemdUnitFailed: ferm.service on mw1360:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:41:54] !log delete other 2 pods in eventgate-main on wikikube-eqiad to test if envoy on them is in a weird state [16:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:01] (03PS2) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [16:44:01] (03PS14) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [16:44:29] (03CR) 10CI reject: [V:04-1] Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 (owner: 10Ayounsi) [16:44:59] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [16:47:36] (03PS5) 10Clément Goubert: sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 [16:49:49] (03PS3) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 [16:49:50] (03PS15) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 [16:50:15] (03CR) 10Elukey: "Let's coordinate if possible, I have filed https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1038782 that shouldn't clash with yours" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert) [16:50:53] (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi) [16:50:55] (03CR) 10Clément Goubert: "Yeah, I was in the process of doing that 😄" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert) [16:51:57] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:52:50] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:53:15] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp700[12].magru.wmnet,service=(cdn|ats-be) [16:53:55] (03PS7) 10Clément Goubert: sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 [16:55:25] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 23 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:57:44] (03PS2) 10Hnowlan: kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323) [16:58:35] (03PS9) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [16:58:39] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1700) [17:00:13] (03CR) 10Clément Goubert: [C:03+1] kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [17:02:17] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:03:13] (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz) [17:07:17] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:08:52] (03PS4) 10Gergő Tisza: multiversion: Support beta for upload hostname check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929 [17:08:52] (03PS4) 10Gergő Tisza: multiversion: Add tests for MWMultiVersion::getMediaWiki() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930 [17:08:52] (03PS8) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [17:09:35] (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [17:11:49] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [17:13:51] (03PS1) 10Ssingh: hiera: add profile::cache::base::use_noflow_iface_preup for magru cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1038872 (https://phabricator.wikimedia.org/T366606) [17:14:27] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1001 to a new rack - kamila@cumin1002" [17:15:06] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2744/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038872 (https://phabricator.wikimedia.org/T366606) (owner: 10Ssingh) [17:15:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1001 to a new rack - kamila@cumin1002" [17:15:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:16:05] (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: add profile::cache::base::use_noflow_iface_preup for magru cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1038872 (https://phabricator.wikimedia.org/T366606) (owner: 10Ssingh) [17:22:11] !log sudo cumin 'A:cp and A:magru' 'run-puppet-agent' [17:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:00] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [17:23:08] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9860880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [17:27:30] (03CR) 10Scott French: [C:03+1] "LGTM. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [17:29:57] (03CR) 10Stoyofuku-wmf: [C:03+1] "Confirmed this is no longer used" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [17:30:53] (03CR) 10JMeybohm: "I think I failed to create a task last time (or I failed to find it)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert) [17:32:00] (03PS2) 10Gergő Tisza: [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) [17:33:03] (03PS1) 10Jforrester: Add wikilambda-edit-monolingual-text-placeholder message to extension.json [extensions/WikiLambda] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038828 (https://phabricator.wikimedia.org/T359782) [17:39:11] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7002*} and A:cp [17:39:16] (03PS1) 10Stoyofuku-wmf: Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) [17:40:23] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9860978 (10BCornwall) [17:40:43] (03PS3) 10Stoyofuku-wmf: Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) [17:42:11] (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event table migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz) [17:49:00] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7002*} and A:cp [17:51:33] !log sudo cumin 'A:cp-text and A:magru' "sed -i '/\sup ethtool -A eno12399np0/d' /etc/network/interfaces" [17:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:53] !log sudo cumin 'A:cp-upload and A:magru' "sed -i '/\sup ethtool -A eno12399np0/d' /etc/network/interfaces" [17:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:28] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7014*} and A:cp [17:54:41] (03CR) 10Jdlrobson: [C:03+1] Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) (owner: 10Stoyofuku-wmf) [17:54:42] (03PS8) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) [17:54:54] (03CR) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz) [18:00:04] dduvall and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1800). nyaa~ [18:00:20] Lurking. [18:04:27] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7014*} and A:cp [18:04:55] (03PS9) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [18:05:48] (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [18:05:56] (03CR) 10Dzahn: "alright! how about this: I disable puppet on prod phab, merge this, run it on cloud and if it breaks there I just revert, if not I enable " [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:07:10] (03CR) 10JHathaway: [V:03+1] "sounds great" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:07:13] (03PS1) 10CDobbins: purged: set use_pki to true for all eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1038881 (https://phabricator.wikimedia.org/T360506) [18:08:24] dancy: o/ [18:11:48] (03CR) 10Dzahn: "it fails in compiler like this: Error: Evaluation Error: Error while evaluating a Function Call, Failed to execute '/pdb/query/v4' on at l" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:12:37] (03CR) 10Dzahn: "Since this already happens on the compiler hosts I would expect the same on the devtools hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:13:00] (03PS1) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) [18:14:25] (03CR) 10Dzahn: "can we lookup the list of host name in Hiera?" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:14:42] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038883 (https://phabricator.wikimedia.org/T361402) [18:14:44] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038883 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [18:15:37] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [18:15:44] (03CR) 10Dzahn: "keep in mind if you just rename the resource itself and don't absent it then puppet won't remove the timer/service and you'll end up with " [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [18:15:45] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:46] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038883 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot) [18:15:48] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [18:16:44] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2746/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038881 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:16:48] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1037621/2745/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:17:55] (03CR) 10JHathaway: [V:03+1] "There is logic in `modules/wmflib/functions/puppetdb_query.pp` to return an empty array, if a puppetdb server is not present in an environ" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:18:26] (03CR) 10Ssingh: [C:03+1] "Looks good, let's plan to merge this on Wed Jun 5!" [puppet] - 10https://gerrit.wikimedia.org/r/1038881 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:19:16] (03CR) 10Dzahn: "even if it works in devtools this would still mean we can't compile changes anymore in the future" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:19:23] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9861173 (10cmooney) [18:19:44] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9861174 (10cmooney) [18:21:53] (03CR) 10Michael Große: [C:03+1] "Sounds good to me, but this feels really like something where I would like us to get explicit approval from RelEng (SRE?) about before dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm) [18:22:57] (03PS1) 10Ssingh: haproxy: update systemd template for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/1038884 [18:23:30] (03CR) 10Dzahn: "Or it needs a "if $realm = production" clause around lookup and something else in an else branch. Those realm checks are not recommended b" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:23:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T364069)', diff saved to https://phabricator.wikimedia.org/P64031 and previous config saved to /var/cache/conftool/dbconfig/20240604-182342-marostegui.json [18:23:47] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [18:24:13] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2747/console" [puppet] - 10https://gerrit.wikimedia.org/r/1038884 (owner: 10Ssingh) [18:25:10] (03CR) 10Urbanecm: [C:04-1] [Beta] Enable CommunityConfiguration extension in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [18:26:57] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.8 refs T361402 [18:27:00] T361402: 1.43.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T361402 [18:28:17] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on doc.wikimedia.org with reason: reboot T366555 [18:28:18] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on doc.wikimedia.org with reason: reboot T366555 [18:28:37] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on doc1003.eqiad.wmnet with reason: reboot T366555 [18:28:38] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on doc1003.eqiad.wmnet with reason: reboot T366555 [18:29:00] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9861186 (10cmooney) [18:30:19] !log doc.wikimedia.org - very short downtime for maintenance [18:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:48] (03PS2) 10Ssingh: P:cache::haproxy: update systemd template for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/1038884 [18:32:02] train looks good [18:32:15] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2748/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038884 (owner: 10Ssingh) [18:35:45] !log aphlict - (phab realtime notifications) - reboots [18:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:51] (03CR) 10Scott French: "Looks good! Only one notable comment / question." [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [18:38:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P64032 and previous config saved to /var/cache/conftool/dbconfig/20240604-183850-marostegui.json [18:40:28] (03CR) 10BCornwall: [C:03+1] P:cache::haproxy: update systemd template for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/1038884 (owner: 10Ssingh) [18:41:19] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.142`. Pre-deploy tests passing on canary `wdqs1016` [18:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:04] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@143ca33]: 0.3.142 [18:45:18] !log [WDQS Deploy] Tests passing following deploy of `0.3.142` on canary `wdqs1016`; proceeding to rest of fleet [18:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:07] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@143ca33]: 0.3.142 (duration: 02m 02s) [18:46:20] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#9861253 (10CDanis) I discussed this with @Muehlenhoff in his evening/my morning. `lang=irc 09:12:36 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#9861256 (10CDanis) p:05High→03Medium [18:47:50] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@43b966f]: 0.3.142 [18:48:15] (03PS1) 10Jsn.sherman: InitialiseSettings: Enable AutoModerator on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038886 (https://phabricator.wikimedia.org/T362622) [18:48:59] !log [WDQS Deploy] Forgot to run the command to set git hash to tip of origin/master so deploy was a partial no-op. Re-rolling... [18:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:57] (03PS1) 10JHathaway: devtools: update puppetmaster and pubkey [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) [18:51:46] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:53:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P64033 and previous config saved to /var/cache/conftool/dbconfig/20240604-185358-marostegui.json [18:57:28] (03CR) 10Dzahn: [C:03+1] "confirmed the puppetmaster for devtools moved to puppetmaster-1003. haven't checked where you got the key from, but lgtm. it's a change to" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [18:57:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:57:53] (03CR) 10JHathaway: [C:03+2] devtools: update puppetmaster and pubkey [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [19:00:43] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@43b966f]: 0.3.142 (duration: 12m 53s) [19:03:14] (03CR) 10JHathaway: [C:03+2] "will do, is there a doc on doing that somewhere?" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [19:04:15] (03CR) 10Dzahn: [C:03+1] "well, my comment was because I don't know that" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [19:06:23] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [19:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:35] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [19:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:43] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [19:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T364069)', diff saved to https://phabricator.wikimedia.org/P64034 and previous config saved to /var/cache/conftool/dbconfig/20240604-190906-marostegui.json [19:09:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [19:09:12] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [19:09:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [19:09:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T364069)', diff saved to https://phabricator.wikimedia.org/P64035 and previous config saved to /var/cache/conftool/dbconfig/20240604-190931-marostegui.json [19:11:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:11:20] (03CR) 10Dzahn: [C:03+1] "but that would be the project puppet-diffs, not devtools, where it would have to be deployed I think" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [19:12:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [19:12:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [19:13:43] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:57] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on releases1003.eqiad.wmnet with reason: reboot T366555 [19:14:10] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on releases1003.eqiad.wmnet with reason: reboot T366555 [19:16:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:16:36] !log releases.wikimedia.org - short downtime for maintenance [19:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:53] (03CR) 10Eevans: [C:03+2] cassandra: create new commons_impact_analytics role [puppet] - 10https://gerrit.wikimedia.org/r/1038409 (https://phabricator.wikimedia.org/T361835) (owner: 10Eevans) [19:32:51] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint2002.wikimedia.org with reason: reboot T366555 [19:33:05] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on contint2002.wikimedia.org with reason: reboot T366555 [19:35:05] (03PS1) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) [19:36:37] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [19:36:47] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [19:36:48] (03PS2) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189) [19:37:34] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on gerrit2002.wikimedia.org with reason: reboot T366555 [19:37:40] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [19:37:41] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9861511 (10Scott_French) Thanks for the update, @SGupta-WMF - that's great! T... [19:37:46] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861512 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [19:37:47] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on gerrit2002.wikimedia.org with reason: reboot T366555 [19:37:55] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [19:38:00] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861516 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [19:38:04] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on gerrit-replica.wikimedia.org with reason: reboot T366555 [19:38:05] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on gerrit-replica.wikimedia.org with reason: reboot T366555 [19:38:10] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9861517 (10Scott_French) [19:38:28] !log https://gerrit-replica.wikimedia.org - short downtime for maintenance [19:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:13] jouncebot: nowandnext [19:40:13] For the next 0 hour(s) and 19 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1800) [19:40:13] In 0 hour(s) and 19 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T2000) [19:40:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T364299)', diff saved to https://phabricator.wikimedia.org/P64036 and previous config saved to /var/cache/conftool/dbconfig/20240604-194031-marostegui.json [19:40:34] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:44:22] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [19:44:33] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [19:46:07] (03PS3) 10Pppery: [pawiki] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T366434) [19:47:48] !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [19:49:23] !log ecarg@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [19:49:25] !log ecarg@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [19:55:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P64037 and previous config saved to /var/cache/conftool/dbconfig/20240604-195539-marostegui.json [19:59:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001'] [20:00:03] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T2000). [20:00:04] pppery, pmiazga, tgr, and toyofuku: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:08] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [20:00:10] Here [20:00:16] This is my first time doing this, though [20:00:30] o/ [20:00:40] Also here, and it's my second 🙃 [20:01:05] (03CR) 10Jforrester: "Eh, fine, you've convinced me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [20:09:23] Hello? [20:10:32] Pppery: I'm not on window right now but I can deploy this sooon [20:10:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P64038 and previous config saved to /var/cache/conftool/dbconfig/20240604-201047-marostegui.json [20:14:00] (03CR) 10Ladsgroup: [C:03+2] [pawiki] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T366434) (owner: 10Pppery) [20:14:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T366434) (owner: 10Pppery) [20:14:39] (03Merged) 10jenkins-bot: [pawiki] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T366434) (owner: 10Pppery) [20:15:08] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1037945|[pawiki] Enable wgMinervaEnableSiteNotice (T366434)]] [20:15:12] T366434: Enable SiteNotice in Mobile View on Punjabi Wikipedia - https://phabricator.wikimedia.org/T366434 [20:15:45] (03CR) 10Hashar: plugins: Add wm-schedule-deployment plugin (031 comment) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis) [20:17:45] !log ladsgroup@deploy1002 pppery and ladsgroup: Backport for [[gerrit:1037945|[pawiki] Enable wgMinervaEnableSiteNotice (T366434)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:19:12] Can confirm I see the sitenotice at pa.m.wikipedia.org with X-Wikimedia-Debug set up and don't when it isn't set up, so looks good [20:19:58] !log ladsgroup@deploy1002 pppery and ladsgroup: Continuing with sync [20:20:08] moving forward. thanks [20:21:39] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1358.eqiad.wmnet [20:21:50] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1358.eqiad.wmnet [20:22:06] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1358.eqiad.wmnet [20:22:15] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1358.eqiad.wmnet [20:22:54] (03CR) 10Urbanecm: [C:03+1] "lgtm. can we get it merged?" [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [20:23:28] (03PS6) 10Pmiazga: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) [20:25:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T364299)', diff saved to https://phabricator.wikimedia.org/P64039 and previous config saved to /var/cache/conftool/dbconfig/20240604-202554-marostegui.json [20:25:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:25:59] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [20:26:05] (03CR) 10Pmiazga: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [20:26:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [20:27:22] !log vacuuming pcc db [20:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:32] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1037945|[pawiki] Enable wgMinervaEnableSiteNotice (T366434)]] (duration: 13m 24s) [20:28:35] T366434: Enable SiteNotice in Mobile View on Punjabi Wikipedia - https://phabricator.wikimedia.org/T366434 [20:28:43] Pppery: deployed [20:29:44] I need to be afk for a bit, if someone else can take over, that'd be amazing [20:31:19] will do [20:31:29] ty ty [20:33:23] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [20:33:59] (03Merged) 10jenkins-bot: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [20:34:30] !log tgr@deploy1002 Started scap: Backport for [[gerrit:1035749|beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (T355281)]] [20:34:34] T355281: Set up some beta cluster wikis with different registrable domain - https://phabricator.wikimedia.org/T355281 [20:37:54] !log tgr@deploy1002 tgr and pmiazga: Backport for [[gerrit:1035749|beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (T355281)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:38:41] (03PS1) 10Pppery: [jawikinews] Set $wgArticleCountMethod to any [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038897 (https://phabricator.wikimedia.org/T364189) [20:39:10] !log tgr@deploy1002 tgr and pmiazga: Continuing with sync [20:42:26] (03CR) 10Ebernhardson: [C:03+1] "Generally looks good, one nit on an awkward comments. Could add more nits on some python bits, but they are generally irrelevant." [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [20:43:22] hi pmiazga :) [20:43:27] o/ [20:44:56] pmiazga: (recapping from -releng) you wanted me to deploy something. can do, once tgr|away is done with the patch / window, as appropriate. [20:47:06] cool. thank you. Mine is no-op for prod [20:47:43] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1035749|beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (T355281)]] (duration: 13m 12s) [20:47:46] T355281: Set up some beta cluster wikis with different registrable domain - https://phabricator.wikimedia.org/T355281 [20:47:58] ^ I think that was the one [20:48:37] so if prod works, everything is good.Nice, thank you tgr|away! Looks like for last 40 mins I was looking into empty #wikimedia-releng channel and I was wondering why no one deploys now [20:48:49] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9861710 (10Jclark-ctr) 05Open→03Resolved manually updated firmware iDRAC Firmware Version 7.00.00.171 BIOS Version... [20:49:06] (03CR) 10Gergő Tisza: [C:03+2] [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) (owner: 10Gergő Tisza) [20:49:22] (03PS1) 10Pppery: [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) [20:49:42] (03PS3) 10Gergő Tisza: [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) [20:50:03] (03CR) 10CI reject: [V:04-1] [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) (owner: 10Pppery) [20:50:41] (03CR) 10Gergő Tisza: [C:03+2] [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) (owner: 10Gergő Tisza) [20:51:05] (03PS2) 10Pppery: [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) [20:51:21] (03Merged) 10jenkins-bot: [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) (owner: 10Gergő Tisza) [20:51:43] (03CR) 10CI reject: [V:04-1] [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) (owner: 10Pppery) [20:51:46] (03PS5) 10Gergő Tisza: multiversion: Support beta for upload hostname check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929 [20:51:54] (03PS5) 10Gergő Tisza: multiversion: Add tests for MWMultiVersion::getMediaWiki() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930 [20:52:09] (03PS10) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [20:52:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929 (owner: 10Gergő Tisza) [20:52:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930 (owner: 10Gergő Tisza) [20:52:41] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [20:52:46] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [20:52:47] (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [20:53:13] (03Merged) 10jenkins-bot: multiversion: Support beta for upload hostname check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929 (owner: 10Gergő Tisza) [20:53:19] (03Merged) 10jenkins-bot: multiversion: Add tests for MWMultiVersion::getMediaWiki() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930 (owner: 10Gergő Tisza) [20:53:50] !log tgr@deploy1002 Started scap: Backport for [[gerrit:1037929|multiversion: Support beta for upload hostname check]], [[gerrit:1037930|multiversion: Add tests for MWMultiVersion::getMediaWiki()]] [20:56:34] !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [20:58:33] !log tgr@deploy1002 tgr: Backport for [[gerrit:1037929|multiversion: Support beta for upload hostname check]], [[gerrit:1037930|multiversion: Add tests for MWMultiVersion::getMediaWiki()]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:59:08] (03PS1) 10Pppery: [ptwikinews] Set atom feed link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038901 (https://phabricator.wikimedia.org/T356003) [20:59:17] 20:56:06 Check 'check_testservers_k8s' failed: Sending to mwdebug.discovery.wmnet... [20:59:20] https://techconduct.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_remnant.yaml:159) Status code: expected 200, got 503. [20:59:35] error went away on retry so fingers crossed... [21:01:19] tgr|away: can you please ping me once done? :) [21:01:57] !log tgr@deploy1002 tgr: Continuing with sync [21:05:50] (03PS4) 10Stoyofuku-wmf: Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) [21:06:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001'] [21:07:46] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [21:08:00] (03PS11) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [21:08:36] (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [21:10:23] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1037929|multiversion: Support beta for upload hostname check]], [[gerrit:1037930|multiversion: Add tests for MWMultiVersion::getMediaWiki()]] (duration: 16m 33s) [21:10:46] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq... [21:10:58] urbanecm: done [21:11:34] thanks tgr|away [21:12:17] i see an unmerged patch by toyofuku as well – did we decide to skip it, or should that be done as well? [21:12:27] also pmiazga what is it you wanted deployed? [21:12:48] I mean, I'm here if someone's willing to deploy it [21:13:13] As of right now I'm unqualified to do so myself 😭 I promise to pay it back when I'm trained up [21:13:16] i can do that :) [21:13:31] thank you!! [21:13:32] (deployment, although i can help with deployment advice too if needed) [21:13:45] oh sorry, don't know how I missed that [21:13:51] haha I think I'm good in that area - shadowing tomorrow [21:14:01] (all good!) [21:14:05] (03CR) 10Urbanecm: [C:03+2] Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) (owner: 10Stoyofuku-wmf) [21:14:22] toyofuku: enjoy the shadowing then! [21:14:35] 💜 [21:14:43] (03Merged) 10jenkins-bot: Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) (owner: 10Stoyofuku-wmf) [21:15:04] pmiazga: can you link your patch as well please? [21:15:44] urbanecm: I think that was https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1035749 ? [21:16:30] which is already deployed (and was when pmiazga pinged me asking for a deployment of "a couple of things") [21:16:33] so...probably not [21:16:48] (03PS12) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [21:17:25] (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [21:18:21] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1038444|Disable font size options on specified pages for most wikis (T366334)]] [21:18:24] T366334: Enable different default font size on different pages for Vector 2022 in production - https://phabricator.wikimedia.org/T366334 [21:19:19] (03PS23) 10Ryan Kemper: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [21:19:57] (03CR) 10Ryan Kemper: [C:03+2] wdqs: extract categories reload to its own cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [21:20:12] (03CR) 10Ryan Kemper: [C:03+2] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [21:21:23] !log urbanecm@deploy1002 toyofuku and urbanecm: Backport for [[gerrit:1038444|Disable font size options on specified pages for most wikis (T366334)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:21:34] toyofuku: can you test your patch at mwdebug, please? :) [21:21:52] Yep, doing so now! [21:23:16] Looks good - thank you so much! [21:23:42] (03Merged) 10jenkins-bot: wdqs: extract categories reload to its own cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [21:24:11] (03Merged) 10jenkins-bot: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [21:24:54] !log urbanecm@deploy1002 toyofuku and urbanecm: Continuing with sync [21:24:56] proceeding! [21:27:23] (03PS7) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [21:27:34] pmiazga: last ping...? [21:27:55] (03PS3) 10Pppery: [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) [21:28:10] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:28:15] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:31:52] (03PS1) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [21:32:48] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:32:53] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:33:31] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1038444|Disable font size options on specified pages for most wikis (T366334)]] (duration: 15m 10s) [21:33:34] T366334: Enable different default font size on different pages for Vector 2022 in production - https://phabricator.wikimedia.org/T366334 [21:33:37] toyofuku: and done [21:33:42] (03PS2) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [21:33:44] anything else? [21:33:50] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:33:54] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:34:45] (03PS3) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [21:34:54] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:34:59] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:36:33] (03CR) 10BryanDavis: plugins: Add wm-schedule-deployment plugin (033 comments) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis) [21:36:47] (03PS4) 10BryanDavis: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) [21:39:43] (03PS4) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [21:40:42] (03PS5) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [21:41:06] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:41:12] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:42:15] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9861911 (10Pppery) [21:42:31] (03PS1) 10Andrew Bogott: wmfkeystonehooks: use project_id rather than project_name for auth [puppet] - 10https://gerrit.wikimedia.org/r/1038907 (https://phabricator.wikimedia.org/T343158) [21:43:10] (03CR) 10Andrew Bogott: [C:03+2] wmfkeystonehooks: use project_id rather than project_name for auth [puppet] - 10https://gerrit.wikimedia.org/r/1038907 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott) [21:57:22] (03CR) 10BryanDavis: [C:03+1] Use a wildcard TypeScript include for plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038810 (owner: 10Hashar) [21:58:58] (03PS6) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [21:59:24] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [21:59:29] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [22:00:23] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye [22:00:31] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad.... [22:01:50] 06SRE, 06serviceops: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9861981 (10Jdforrester-WMF) [22:01:54] (03PS7) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [22:02:07] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [22:02:15] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [22:07:25] (03PS8) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [22:08:12] (03PS9) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [22:08:21] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [22:09:00] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T364577#9861989 (10Papaul) a:03Jhancock.wm @Jhancock.wm can you please proceed with this and resolve the task once done. Thanks [22:13:20] 06SRE, 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, and 2 others: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567#9862002 (10TheDJ) In the last 7 days there were 85 log entries for this warning. 48 of these were on labswiki, triggered... [22:16:57] !log removing three files for legal compliance [22:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:15] !log removing two files for legal compliance [22:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:52] jouncebot: nowandnext [22:34:52] No deployments scheduled for the next 7 hour(s) and 25 minute(s) [22:34:52] In 7 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T0600) [22:35:29] 10SRE-swift-storage, 10MediaWiki-Uploading: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9862056 (10TheDJ) I think we can close this ticket ? I'm sure some incidental problems might still e... [22:35:32] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint.wikimedia.org with reason: reboot T366555 [22:35:32] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on contint.wikimedia.org with reason: reboot T366555 [22:36:03] !log CI - (integration.wikimedia.org) short downtime for maintenance [22:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:10] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint1002.wikimedia.org with reason: reboot T366555 [22:36:11] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on contint1002.wikimedia.org with reason: reboot T366555 [22:39:17] PROBLEM - SSH on contint1002 is CRITICAL: connect to address 208.80.154.132 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:40:10] FIRING: ProbeDown: Service contint1002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:46:48] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint1002.wikimedia.org with reason: reboot T366555 [22:46:48] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on contint1002.wikimedia.org with reason: reboot T366555 [22:47:12] !log removing one file for legal compliance [22:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:51] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint.wikimedia.org with reason: reboot T366555 [22:47:52] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on contint.wikimedia.org with reason: reboot T366555 [22:50:47] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet) [22:53:17] RECOVERY - SSH on contint1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:55:10] RESOLVED: ProbeDown: Service contint1002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:57:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:58:10] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn) [22:58:22] (03CR) 10CI reject: [V:04-1] admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn) [22:59:51] (03PS4) 10Dzahn: admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) [23:00:22] (03CR) 10CI reject: [V:04-1] admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn) [23:01:30] 10SRE-swift-storage, 10MediaWiki-Uploading: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9862096 (10Bawolff) 05Open→03Resolved a:03Bawolff The biggest known issue at this point is... [23:06:36] (03PS5) 10Dzahn: admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) [23:09:39] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on miscweb1003.eqiad.wmnet with reason: reboot T366555 [23:09:53] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on miscweb1003.eqiad.wmnet with reason: reboot T366555 [23:15:28] !log removing one file for legal compliance [23:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:27] 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9862118 (10Dzahn) Would it be helpful if you contact the original admins or we reset to the original admins from T340380? [23:31:12] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9862139 (10Dzahn) We have an existing list "wikipedia-bn@lists.wikimedia.org" for Bengali Wikipedia. This new list seems to be across projects, so wikimedia, and based on language alone.... [23:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038789 [23:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038789 (owner: 10TrainBranchBot) [23:42:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2147.codfw.wmnet with reason: Maintenance [23:42:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2147.codfw.wmnet with reason: Maintenance [23:42:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T364299)', diff saved to https://phabricator.wikimedia.org/P64040 and previous config saved to /var/cache/conftool/dbconfig/20240604-234228-marostegui.json [23:42:31] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299