[00:04:04] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038347 (owner: 10TrainBranchBot)
[00:06:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P63968 and previous config saved to /var/cache/conftool/dbconfig/20240604-000612-ladsgroup.json
[00:21:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T352010)', diff saved to https://phabricator.wikimedia.org/P63969 and previous config saved to /var/cache/conftool/dbconfig/20240604-002119-ladsgroup.json
[00:21:22] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[00:21:23] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[00:21:35] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[00:43:21] <wikibugs>	 (03PS1) 10Bking: data-platform: add alert for WDQS MaxLag [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114)
[01:07:59] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.8 [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038348 (https://phabricator.wikimedia.org/T361402)
[01:08:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.8 [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038348 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot)
[01:16:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[01:21:45] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[01:30:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.8 [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038348 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot)
[01:45:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[01:55:45] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0200)
[02:10:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:38:43] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:47:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[02:47:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[02:55:44] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:57:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[02:57:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0300)
[03:01:43] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038461 (https://phabricator.wikimedia.org/T361402)
[03:01:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038461 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot)
[03:02:26] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038461 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot)
[03:03:00] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.8  refs T361402
[03:03:03] <stashbot>	 T361402: 1.43.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T361402
[03:05:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:08:35] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2212.codfw.wmnet with reason: Maintenance
[03:08:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2212.codfw.wmnet with reason: Maintenance
[03:09:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63970 and previous config saved to /var/cache/conftool/dbconfig/20240604-030906-marostegui.json
[03:09:11] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[03:11:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63971 and previous config saved to /var/cache/conftool/dbconfig/20240604-031117-marostegui.json
[03:26:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P63972 and previous config saved to /var/cache/conftool/dbconfig/20240604-032625-marostegui.json
[03:41:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212', diff saved to https://phabricator.wikimedia.org/P63973 and previous config saved to /var/cache/conftool/dbconfig/20240604-034132-marostegui.json
[03:43:45] <jinxer-wm>	 FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[03:48:45] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[03:56:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2212 (T364299)', diff saved to https://phabricator.wikimedia.org/P63974 and previous config saved to /var/cache/conftool/dbconfig/20240604-035640-marostegui.json
[03:56:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2216.codfw.wmnet with reason: Maintenance
[03:56:44] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[03:56:47] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.8  refs T361402 (duration: 53m 47s)
[03:56:50] <stashbot>	 T361402: 1.43.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T361402
[03:56:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2216.codfw.wmnet with reason: Maintenance
[03:57:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T364299)', diff saved to https://phabricator.wikimedia.org/P63975 and previous config saved to /var/cache/conftool/dbconfig/20240604-035703-marostegui.json
[04:00:04] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0400)
[04:01:01] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.5 (duration: 00m 57s)
[04:10:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[04:10:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[04:15:45] <jinxer-wm>	 RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater  - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[04:17:19] <wikibugs>	 (03PS1) 10Stevemunene: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349
[04:18:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene)
[04:19:50] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance
[04:20:04] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1246.eqiad.wmnet with reason: Maintenance
[04:20:12] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T352010)', diff saved to https://phabricator.wikimedia.org/P63976 and previous config saved to /var/cache/conftool/dbconfig/20240604-042011-ladsgroup.json
[04:20:14] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[04:34:25] <wikibugs>	 (03PS1) 10BryanDavis: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512)
[04:34:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis)
[04:40:28] <wikibugs>	 (03PS2) 10BryanDavis: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512)
[04:40:32] <wikibugs>	 (03PS1) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301)
[04:41:42] <wikibugs>	 (03CR) 10CI reject: [V:04-1] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene)
[04:55:10] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:07:32] <wikibugs>	 (03PS1) 10Marostegui: query_all_hosts.sh: Added to repo [software] - 10https://gerrit.wikimedia.org/r/1038466
[05:10:10] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:15:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:27:06] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T366259
[05:27:09] <stashbot>	 T366259: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T366259
[05:27:36] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T366259
[05:28:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db1163 with weight 0 T366259', diff saved to https://phabricator.wikimedia.org/P63977 and previous config saved to /var/cache/conftool/dbconfig/20240604-052803-arnaudb.json
[05:32:10] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:32:12] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:32:36] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:34:04] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.243 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:34:26] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52065 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:49:42] <icinga-wm_>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[05:50:58] <wikibugs>	 (03PS1) 10Marostegui: db1168: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1038471
[05:51:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1168: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1038471 (owner: 10Marostegui)
[05:52:08] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:59:08] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, Amir1, and arnaudb: Time to snap out of that daydream and deploy Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T0600).
[06:00:27] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1036604 (https://phabricator.wikimedia.org/T366259) (owner: 10Gerrit maintenance bot)
[06:01:56] <arnaudb>	 !log Starting s1 eqiad failover from db1184 to db1163 - T366259
[06:01:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:59] <stashbot>	 T366259: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T366259
[06:02:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T366259', diff saved to https://phabricator.wikimedia.org/P63978 and previous config saved to /var/cache/conftool/dbconfig/20240604-060208-arnaudb.json
[06:03:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db1163 to s1 primary and set section read-write T366259', diff saved to https://phabricator.wikimedia.org/P63979 and previous config saved to /var/cache/conftool/dbconfig/20240604-060324-arnaudb.json
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:05:34] <icinga-wm_>	 PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 51 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:06:07] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1036605 (https://phabricator.wikimedia.org/T366259) (owner: 10Gerrit maintenance bot)
[06:06:13] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1036605 (https://phabricator.wikimedia.org/T366259)
[06:06:14] <wikibugs>	 (03CR) 10Arnaudb: [V:03+2 C:03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/1036605 (https://phabricator.wikimedia.org/T366259) (owner: 10Gerrit maintenance bot)
[06:07:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db1184 T366259', diff saved to https://phabricator.wikimedia.org/P63980 and previous config saved to /var/cache/conftool/dbconfig/20240604-060703-arnaudb.json
[06:07:06] <stashbot>	 T366259: Switchover s1 master (db1184 -> db1163) - https://phabricator.wikimedia.org/T366259
[06:07:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'API db1163 T366259', diff saved to https://phabricator.wikimedia.org/P63981 and previous config saved to /var/cache/conftool/dbconfig/20240604-060747-arnaudb.json
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): ' fix api db1163 vs db1184 T366259', diff saved to https://phabricator.wikimedia.org/P63982 and previous config saved to /var/cache/conftool/dbconfig/20240604-060925-arnaudb.json
[06:10:38] <icinga-wm_>	 RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 8 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:14:47] <marostegui>	 !log Rename table flaggedpage_pending on db1185 (s5 eqiad dbmaint) - T365568
[06:14:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:50] <stashbot>	 T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568
[06:24:18] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:26:05] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3:00:00 on db1184.eqiad.wmnet with reason: reimage
[06:26:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1184.eqiad.wmnet with reason: reimage
[06:26:33] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1184.eqiad.wmnet with OS bookworm
[06:31:25] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 69 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:35:09] <wikibugs>	 (03PS1) 10Slyngshede: New menu [software/bitu] - 10https://gerrit.wikimedia.org/r/1038608
[06:40:05] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage
[06:41:11] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:43:25] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: host reimage
[06:44:36] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:48:00] <wikibugs>	 (03PS1) 10Marostegui: db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1038610
[06:48:10] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 52 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:48:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz)
[06:48:37] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1184: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1038610 (owner: 10Marostegui)
[06:48:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove Hiera entry [puppet] - 10https://gerrit.wikimedia.org/r/1038339 (owner: 10Muehlenhoff)
[06:49:40] <icinga-wm_>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 144 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:50:20] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:50:47] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1038380 (owner: 10Muehlenhoff)
[06:51:28] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:51:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1038109 (owner: 10Muehlenhoff)
[06:53:18] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:53:22] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:53:34] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52067 bytes in 1.970 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:54:12] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:00:25] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 42 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:05:13] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:05:20] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1184.eqiad.wmnet with OS bookworm
[07:06:00] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[07:06:02] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[07:09:17] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "lgtm but no idea if duplicating the same alerts for multiple teams is the right approach, I fear that over time the alerts might diverge" [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking)
[07:10:46] <marostegui>	 !log dbmaint eqiad s1 deploy schema change on db1184 T355609
[07:10:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:49] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[07:15:10] <moritzm>	 !log installing intel-microcode updates on bullseye
[07:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:11] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:23:59] <wikibugs>	 (03PS2) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301)
[07:24:39] <wikibugs>	 (03PS2) 10Stevemunene: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349
[07:25:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene)
[07:26:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene)
[07:27:58] <marostegui>	 !log dbmaint eqiad s1 deploy schema change on db1184 T356166
[07:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:02] <stashbot>	 T356166: Drop cl_collation_ext index from categorylinks in production - https://phabricator.wikimedia.org/T356166
[07:28:53] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558 (10Ifrahkhanyaree_WMDE) 03NEW
[07:29:10] <wikibugs>	 (03PS4) 10Hashar: Switch Gerrit to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) (owner: 10Muehlenhoff)
[07:29:43] <wikibugs>	 (03Abandoned) 10Ayounsi: Update Netbox to v2.10.9-wmf2 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/679403 (owner: 10CRusnov)
[07:30:09] <wikibugs>	 (03Abandoned) 10Ayounsi: nbdeviceinfo.py: Add simple command-line host dump [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/525165 (owner: 10CRusnov)
[07:31:56] <wikibugs>	 (03PS3) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301)
[07:33:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene)
[07:37:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch Gerrit to Java 17 [puppet] - 10https://gerrit.wikimedia.org/r/1038249 (https://phabricator.wikimedia.org/T364342) (owner: 10Muehlenhoff)
[07:38:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T364299)', diff saved to https://phabricator.wikimedia.org/P63983 and previous config saved to /var/cache/conftool/dbconfig/20240604-073830-marostegui.json
[07:38:35] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[07:42:08] <wikibugs>	 (03PS4) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301)
[07:42:13] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reimage for host mc-wf1002.eqiad.wmnet with OS bookworm
[07:42:25] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.reimage for host mc-wf2002.codfw.wmnet with OS bookworm
[07:42:42] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[07:43:15] <icinga-wm_>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[07:43:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene)
[07:43:27] <kostajh>	 I may add a patch to the window
[07:43:58] <hashar>	 note: we are upgrading gerrit to java 17 :)
[07:45:15] <icinga-wm_>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[07:46:57] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@6ba3f2e]: gerrit2002: switch to Java 17 version of plugins after having switched Java to 17- T364342
[07:47:01] <stashbot>	 T364342: Switch Gerrit from Java 11 to Java 17 - https://phabricator.wikimedia.org/T364342
[07:47:02] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@6ba3f2e]: gerrit2002: switch to Java 17 version of plugins after having switched Java to 17- T364342 (duration: 00m 05s)
[07:48:02] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9858183 (10WMDE-leszek) I approve the request on WMDE's end, thank you
[07:48:20] <wikibugs>	 (03PS5) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301)
[07:49:09] <kostajh>	 hashar: ah, can I not go ahead with the backport now?
[07:49:49] <wikibugs>	 (03PS1) 10Kosta Harlan: IPReputationHooks: Bump schema version [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1038633 (https://phabricator.wikimedia.org/T354597)
[07:50:02] <wikibugs>	 (03PS1) 10Kosta Harlan: IPReputationHooks: Bump schema version [extensions/WikimediaEvents] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038634 (https://phabricator.wikimedia.org/T354597)
[07:50:57] <wikibugs>	 (03PS3) 10Stevemunene: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349
[07:52:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene)
[07:52:11] <kostajh>	 hashar: or, can I proceed right after you're done with the upgrade?
[07:52:23] <hashar>	 yes yes
[07:52:36] <hashar>	 we are about to restart the primary gerrit
[07:53:03] <kostajh>	 hashar: ok please let me know when you're done
[07:53:10] <kostajh>	 and good luck :)
[07:53:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P63984 and previous config saved to /var/cache/conftool/dbconfig/20240604-075338-marostegui.json
[07:53:51] <wikibugs>	 (03CR) 10MVernon: [C:03+2] cephadm: allow ssh from all mgrs to all targets [puppet] - 10https://gerrit.wikimedia.org/r/1038391 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[07:54:25] <wikibugs>	 (03PS4) 10Stevemunene: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349
[07:55:59] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@6ba3f2e]: gerrit1003: switch to Java 17 version of plugins after having switched Java to 17- T364342
[07:56:02] <stashbot>	 T364342: Switch Gerrit from Java 11 to Java 17 - https://phabricator.wikimedia.org/T364342
[07:56:03] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@6ba3f2e]: gerrit1003: switch to Java 17 version of plugins after having switched Java to 17- T364342 (duration: 00m 03s)
[07:56:29] <hashar>	 !log Restarting Gerrit for Java 17 upgrade # T364342
[07:56:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:33] <wikibugs>	 06SRE, 06serviceops, 13Patch-For-Review: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253#9858230 (10jijiki) >>! In T236253#9856381, @Dzahn wrote: > I talked a bit about this in #systemd IRC channel. Mostly to ask if the config is irrelevant as long as the package i...
[07:57:04] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage
[07:59:53] <hashar>	 kostajh: we have upgraded Gerrit to Java 17 :)
[08:00:31] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563 (10MatthewVernon) 03NEW
[08:00:31] <jelto>	 nice thanks!
[08:00:33] <kostajh>	 \o/
[08:00:39] <kostajh>	 hashar: can I go ahead with the backports?
[08:00:42] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#9858260 (10MatthewVernon) p:05Triage→03High
[08:00:46] <hashar>	 yes :)
[08:01:11] <logmsgbot>	 !log jiji@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage
[08:01:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1038633 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan)
[08:02:40] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf1002.eqiad.wmnet with reason: host reimage
[08:03:25] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038687
[08:03:39] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 (owner: 10Effie Mouzeli)
[08:04:29] <wikibugs>	 (03Merged) 10jenkins-bot: IPReputationHooks: Bump schema version [extensions/WikimediaEvents] (wmf/1.43.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1038633 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan)
[08:05:23] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:1038633|IPReputationHooks: Bump schema version (T354597)]]
[08:05:27] <stashbot>	 T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597
[08:06:06] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-wf2002.codfw.wmnet with reason: host reimage
[08:06:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T364069)', diff saved to https://phabricator.wikimedia.org/P63985 and previous config saved to /var/cache/conftool/dbconfig/20240604-080617-marostegui.json
[08:06:21] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[08:07:02] <wikibugs>	 (03PS1) 10Daniel Kinzler: Set LinterParseOnDerivedDataUpdate to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038688 (https://phabricator.wikimedia.org/T361013)
[08:08:10] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1038633|IPReputationHooks: Bump schema version (T354597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:08:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P63986 and previous config saved to /var/cache/conftool/dbconfig/20240604-080846-marostegui.json
[08:09:23] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "I'm fine with both defaults, using staticttendril:main or use the security-landing-page:latest" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert)
[08:10:59] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Continuing with sync
[08:11:21] <wikibugs>	 (03PS1) 10Hashar: gerrit: move java_home hiera setting to role [puppet] - 10https://gerrit.wikimedia.org/r/1038690
[08:12:37] <icinga-wm_>	 PROBLEM - statsv process on webperf2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv
[08:13:37] <icinga-wm_>	 RECOVERY - statsv process on webperf2003 is OK: PROCS OK: 2 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv
[08:17:37] <wikibugs>	 (03PS1) 10Brouberol: datahub: enable internal registry for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038691
[08:19:32] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:1038633|IPReputationHooks: Bump schema version (T354597)]] (duration: 14m 08s)
[08:19:35] <stashbot>	 T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597
[08:19:58] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf1002.eqiad.wmnet with OS bookworm
[08:20:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/WikimediaEvents] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038634 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan)
[08:21:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P63987 and previous config saved to /var/cache/conftool/dbconfig/20240604-082125-marostegui.json
[08:22:33] <wikibugs>	 (03Merged) 10jenkins-bot: IPReputationHooks: Bump schema version [extensions/WikimediaEvents] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038634 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan)
[08:23:05] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:1038634|IPReputationHooks: Bump schema version (T354597)]]
[08:23:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T364299)', diff saved to https://phabricator.wikimedia.org/P63988 and previous config saved to /var/cache/conftool/dbconfig/20240604-082354-marostegui.json
[08:23:56] <logmsgbot>	 !log jiji@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-wf2002.codfw.wmnet with OS bookworm
[08:24:11] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Change correct port for s1 backups [puppet] - 10https://gerrit.wikimedia.org/r/1038693 (https://phabricator.wikimedia.org/T362509)
[08:24:35] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Nicely done!" [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene)
[08:25:12] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Indeed, that could not have worked. Sorry it slipped past review in the past." [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene)
[08:25:33] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1038634|IPReputationHooks: Bump schema version (T354597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:25:39] <wikibugs>	 (03CR) 10Brouberol: "I think this one can be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene)
[08:27:55] <wikibugs>	 (03PS1) 10Ayounsi: Netbox deploy for 4.0.2 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275)
[08:28:23] <wikibugs>	 (03PS1) 10Fabfur: depool text@magru before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1038695 (https://phabricator.wikimedia.org/T366466)
[08:30:17] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner
[08:30:21] <wikibugs>	 (03CR) 10Ayounsi: Netbox deploy for 4.0.2 (031 comment) [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[08:30:23] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Continuing with sync
[08:31:31] <wikibugs>	 (03PS2) 10Ayounsi: Netbox deploy for 4.0.2 [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275)
[08:32:29] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2099.codfw.wmnet with reason: Maintenance
[08:32:42] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2099.codfw.wmnet with reason: Maintenance
[08:33:19] <wikibugs>	 (03PS1) 10Effie Mouzeli: memcached: switch to memcache user (role) [puppet] - 10https://gerrit.wikimedia.org/r/1038697
[08:33:45] <wikibugs>	 (03PS2) 10Effie Mouzeli: memcached: switch to memcache user (role) [puppet] - 10https://gerrit.wikimedia.org/r/1038697
[08:34:09] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli)
[08:35:51] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene)
[08:36:00] <wikibugs>	 (03PS3) 10Effie Mouzeli: memcached: switch to memcache user (role) [puppet] - 10https://gerrit.wikimedia.org/r/1038697
[08:36:25] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable IPIP encapsulation on text@magru [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466)
[08:36:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P63989 and previous config saved to /var/cache/conftool/dbconfig/20240604-083633-marostegui.json
[08:37:02] <wikibugs>	 (03Merged) 10jenkins-bot: Remove pod name from superset dashboard [alerts] - 10https://gerrit.wikimedia.org/r/1038349 (owner: 10Stevemunene)
[08:37:10] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli)
[08:37:18] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[08:37:32] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] query_all_hosts.sh: Added to repo [software] - 10https://gerrit.wikimedia.org/r/1038466 (owner: 10Marostegui)
[08:37:42] <wikibugs>	 (03PS6) 10Stevemunene: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301)
[08:38:00] <wikibugs>	 (03Merged) 10jenkins-bot: query_all_hosts.sh: Added to repo [software] - 10https://gerrit.wikimedia.org/r/1038466 (owner: 10Marostegui)
[08:38:51] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:1038634|IPReputationHooks: Bump schema version (T354597)]] (duration: 15m 45s)
[08:38:55] <stashbot>	 T354597: Record IP reputation data for account creations and edits - https://phabricator.wikimedia.org/T354597
[08:39:32] <wikibugs>	 (03CR) 10Stevemunene: [C:03+2] add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene)
[08:40:24] <kostajh>	 !log UTC morning deploys done
[08:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:34] <wikibugs>	 (03PS1) 10Urbanecm: Drop logging level for unsupported providers to DEBUG [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038714 (https://phabricator.wikimedia.org/T366519)
[08:40:34] <wikibugs>	 (03PS2) 10Brouberol: datahub: enable internal registry for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038691 (https://phabricator.wikimedia.org/T363461)
[08:40:42] <wikibugs>	 (03Merged) 10jenkins-bot: add datahub availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1038350 (https://phabricator.wikimedia.org/T363301) (owner: 10Stevemunene)
[08:41:15] <wikibugs>	 (03PS2) 10Urbanecm: Drop logging level for unsupported providers to DEBUG [extensions/CommunityConfiguration] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038714 (https://phabricator.wikimedia.org/T366519)
[08:44:07] <wikibugs>	 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858437 (10jijiki) 05In progress→03Open The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikim...
[08:44:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1156', diff saved to https://phabricator.wikimedia.org/P63990 and previous config saved to /var/cache/conftool/dbconfig/20240604-084428-root.json
[08:45:09] <wikibugs>	 (03PS1) 10Marostegui: db1156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1038700
[08:45:29] <wikibugs>	 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858443 (10jijiki)
[08:45:41] <wikibugs>	 (03PS1) 10Urbanecm: testwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038701 (https://phabricator.wikimedia.org/T360954)
[08:46:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp-test1003.wikimedia.org
[08:47:46] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038691 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol)
[08:50:13] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: enable internal registry for all releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038691 (https://phabricator.wikimedia.org/T363461) (owner: 10Brouberol)
[08:50:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp-test1003.wikimedia.org
[08:51:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T364069)', diff saved to https://phabricator.wikimedia.org/P63991 and previous config saved to /var/cache/conftool/dbconfig/20240604-085141-marostegui.json
[08:51:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance
[08:51:45] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[08:51:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install7001.wikimedia.org
[08:51:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: Maintenance
[08:52:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T364069)', diff saved to https://phabricator.wikimedia.org/P63992 and previous config saved to /var/cache/conftool/dbconfig/20240604-085205-marostegui.json
[08:52:11] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:52:18] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production
[08:52:35] <icinga-wm_>	 PROBLEM - WDQS SPARQL on wdqs1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:53:09] <wikibugs>	 (03PS1) 10Hashar: gerrit: remove mac algos no more supported by Mina SSHD [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565)
[08:53:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:54:07] <wikibugs>	 (03CR) 10Michael Große: [Beta] Enable CommunityConfiguration extension in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno)
[08:54:48] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[08:55:20] <wikibugs>	 (03CR) 10Hashar: "I have looked at the source code and pasted my findings at T366565#9858442" [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar)
[08:55:21] <wikibugs>	 (03PS1) 10DCausse: cirrus: relax CirrusConsumerRerenderFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1038705
[08:56:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install7001.wikimedia.org
[08:56:27] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Change correct port for s1 backups [puppet] - 10https://gerrit.wikimedia.org/r/1038693 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo)
[08:57:07] <wikibugs>	 (03PS2) 10Hashar: gerrit: move java_home hiera setting to role [puppet] - 10https://gerrit.wikimedia.org/r/1038690
[08:57:18] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038690 (owner: 10Hashar)
[08:58:29] <icinga-wm_>	 RECOVERY - WDQS SPARQL on wdqs1020 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[09:01:44] <wikibugs>	 (03CR) 10Hashar: "> 1 hosts noop No difference or change fixed compilation" [puppet] - 10https://gerrit.wikimedia.org/r/1038690 (owner: 10Hashar)
[09:01:51] <moritzm>	 !log imported python3-xapian-haystack 2.1.1-1+deb12u1 to bookworm-wikimedia (already lined up for the next Bookworm point release to address https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1066136 and needed for the update of the Mailman servers T331706
[09:01:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:54] <stashbot>	 T331706: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706
[09:03:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1020:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1020:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:03:34] <wikibugs>	 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858514 (10jijiki)
[09:05:48] <wikibugs>	 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858516 (10jijiki)
[09:08:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install6002.wikimedia.org
[09:08:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner
[09:08:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab1004.wikimedia.org
[09:09:59] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe1001.eqiad.wmnet
[09:12:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] gerrit: move java_home hiera setting to role [puppet] - 10https://gerrit.wikimedia.org/r/1038690 (owner: 10Hashar)
[09:12:05] <hashar>	 kostajh: have you managed to deploy your backport?
[09:13:25] <wikibugs>	 (03PS4) 10Effie Mouzeli: memcached: switch to memcache user (role) [puppet] - 10https://gerrit.wikimedia.org/r/1038697
[09:14:27] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli)
[09:14:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install6002.wikimedia.org
[09:14:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1004.wikimedia.org
[09:14:59] <icinga-wm_>	 PROBLEM - Host arclamp2001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:15:24] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudweb2002-dev.wikimedia.org
[09:15:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org
[09:15:40] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:15:41] <icinga-wm_>	 RECOVERY - Host arclamp2001 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms
[09:15:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe1001.eqiad.wmnet
[09:17:04] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe2001.codfw.wmnet
[09:18:15] <wikibugs>	 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9858577 (10jijiki)
[09:18:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install5002.wikimedia.org
[09:18:49] <icinga-wm_>	 PROBLEM - Host arclamp1001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:20:11] <icinga-wm_>	 RECOVERY - Host arclamp1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[09:21:02] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host testhost2001.codfw.wmnet
[09:21:31] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org
[09:21:33] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 56 probes of 786 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:21:58] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb2002-dev.wikimedia.org
[09:22:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org
[09:22:29] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudweb1003.wikimedia.org
[09:22:58] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe2001.codfw.wmnet
[09:23:16] <kostajh>	 hashar: yes all done
[09:23:28] <hashar>	 kostajh: great!!
[09:25:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install5002.wikimedia.org
[09:26:03] <wikibugs>	 (03PS2) 10Fabfur: hiera: enable IPIP for high-traffic1@magru for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466)
[09:26:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install4002.wikimedia.org
[09:27:10] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testhost2001.codfw.wmnet
[09:27:14] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on P{ms-fe1*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[09:27:17] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2001-dev.codfw.wmnet
[09:27:31] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:27:36] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host mwlog1002.eqiad.wmnet
[09:27:37] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9858591 (10kamila) @VRiley-WMF Yes, that works, thank you!  Since with moving racks it's going to take a while, could we please d...
[09:27:38] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2007-dev.codfw.wmnet
[09:27:39] <godog>	 jouncebot: next
[09:27:39] <jouncebot>	 In 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1000)
[09:27:51] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:28:47] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52067 bytes in 5.364 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:29:16] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1003.wikimedia.org
[09:29:19] <wikibugs>	 (03PS3) 10Clément Goubert: miscweb: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978)
[09:29:23] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:29:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org
[09:29:33] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:29:47] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:30:19] <wikibugs>	 (03CR) 10Clément Goubert: miscweb: Update various modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert)
[09:30:27] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudweb1004.wikimedia.org
[09:31:16] <wikibugs>	 06SRE, 10SRE-tools: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9858600 (10Ladsgroup) Hi, clinic duty again. Can you tag it with a team? Wouldn't I/F be okay here?
[09:33:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install4002.wikimedia.org
[09:33:35] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:33:43] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog1002.eqiad.wmnet
[09:33:47] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:34:04] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host mwlog2002.codfw.wmnet
[09:34:06] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2007-dev.codfw.wmnet
[09:34:30] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2008-dev.codfw.wmnet
[09:36:06] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2001-dev.codfw.wmnet
[09:36:13] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 787 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:36:18] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet
[09:37:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host install3003.wikimedia.org
[09:37:20] <wikibugs>	 (03PS5) 10Pmiazga: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281)
[09:37:20] <wikibugs>	 (03CR) 10Pmiazga: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[09:37:29] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudweb1004.wikimedia.org
[09:37:40] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host graphite2004.codfw.wmnet
[09:38:02] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9858611 (10MoritzMuehlenhoff)
[09:38:08] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[09:38:49] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:38:53] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw2002-dev.codfw.wmnet
[09:39:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::generation::worker::dumper_monitor
[09:39:35] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:40:11] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwlog2002.codfw.wmnet
[09:40:47] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol2008-dev.codfw.wmnet
[09:41:38] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcumin2001.codfw.wmnet
[09:41:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch dumps::generation::worker::dumper_monitor to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038729 (https://phabricator.wikimedia.org/T349619)
[09:42:05] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet
[09:42:11] <icinga-wm_>	 RECOVERY - Disk space on karapace1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=karapace1002&var-datasource=eqiad+prometheus/ops
[09:42:35] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:42:49] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:43:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host install3003.wikimedia.org
[09:44:29] <icinga-wm_>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:44:33] <icinga-wm_>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:44:36] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2002-dev.codfw.wmnet
[09:44:41] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:44:43] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:45:04] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2002-dev.codfw.wmnet
[09:45:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch dumps::generation::worker::dumper_monitor to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038729 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:45:17] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2004.codfw.wmnet
[09:45:23] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin2001.codfw.wmnet
[09:45:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:47:21] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet
[09:47:29] <icinga-wm_>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:47:33] <icinga-wm_>	 RECOVERY - BFD status on cr1-codfw is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:47:41] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:47:43] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:48:14] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw2003-dev.codfw.wmnet
[09:48:18] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet
[09:48:36] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet
[09:49:41] <wikibugs>	 (03PS1) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730
[09:49:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730 (owner: 10Ayounsi)
[09:50:28] <jinxer-wm>	 RESOLVED: KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:50:37] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:50:49] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:51:01] <wikibugs>	 (03PS2) 10Pmiazga: [beta] Add test2.wikimedia.beta.wmcloud.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281)
[09:51:29] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+1] [beta] Add test2.wikimedia.beta.wmcloud.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[09:53:12] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcumin1001.eqiad.wmnet
[09:53:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet
[09:54:37] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:54:49] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2003-dev.codfw.wmnet
[09:54:49] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:55:58] <wikibugs>	 (03CR) 10Pmiazga: [C:03+1] [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[09:56:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1038700 (owner: 10Marostegui)
[09:56:56] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcumin1001.eqiad.wmnet
[09:57:05] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet
[09:58:23] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet
[09:58:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1156.eqiad.wmnet with OS bookworm
[09:58:58] <jinxer-wm>	 FIRING: [2x] KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:59:13] <jinxer-wm>	 RESOLVED: [2x] KeyholderUnarmed: 2 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:59:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T366552
[09:59:49] <stashbot>	 T366552: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T366552
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1000)
[10:00:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T366552
[10:00:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2212 with weight 0 T366552', diff saved to https://phabricator.wikimedia.org/P63993 and previous config saved to /var/cache/conftool/dbconfig/20240604-100024-root.json
[10:00:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2212 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/1038351 (https://phabricator.wikimedia.org/T366552) (owner: 10Gerrit maintenance bot)
[10:01:11] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:01:37] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:03:37] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:04:00] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on P{ms-fe1*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[10:04:03] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb2002-dev.codfw.wmnet
[10:04:11] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:05:17] <wikibugs>	 (03CR) 10Pmiazga: [C:03+1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[10:05:54] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki) - https://phabricator.wikimedia.org/T362323#9858668 (10Clement_Goubert)
[10:06:47] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1001.eqiad.wmnet
[10:07:03] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet
[10:07:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::generation::worker::dumper_monitor
[10:08:15] <marostegui>	 !log dbmaint eqiad s1 deploy schema change on db1184 T364299
[10:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:18] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[10:09:29] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:09:38] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling reboot on P{ms-fe2*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[10:09:56] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 90% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038732 (https://phabricator.wikimedia.org/T362323)
[10:10:09] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:10:38] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb2002-dev.codfw.wmnet
[10:11:03] <wikibugs>	 (03PS2) 10Hnowlan: trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323)
[10:12:09] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:12:29] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:12:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: host reimage
[10:15:11] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: host reimage
[10:15:26] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1002.eqiad.wmnet
[10:16:39] <hashar>	 !log Upgrading CI Jenkins # T366008
[10:16:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:42] <stashbot>	 T366008: Upgrade Jenkins instances to 2.452.1 - https://phabricator.wikimedia.org/T366008
[10:16:42] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: Migrate votewiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1038735 (https://phabricator.wikimedia.org/T209892)
[10:18:29] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9858735 (10Ladsgroup)
[10:18:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid2002.codfw.wmnet
[10:20:20] <godog>	 jouncebot: next
[10:20:20] <jouncebot>	 In 1 hour(s) and 39 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1200)
[10:20:33] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet
[10:21:03] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9858744 (10MoritzMuehlenhoff)
[10:21:12] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "🎉🎉🎉" [puppet] - 10https://gerrit.wikimedia.org/r/1038735 (https://phabricator.wikimedia.org/T209892) (owner: 10Clément Goubert)
[10:22:06] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] trafficserver: Migrate votewiki to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1038735 (https://phabricator.wikimedia.org/T209892) (owner: 10Clément Goubert)
[10:22:20] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744)
[10:22:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid2002.codfw.wmnet
[10:23:19] <icinga-wm_>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:23:21] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:23:24] <claime>	 !log Migrating votewiki to mw-on-k8s - T362323
[10:23:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:27] <stashbot>	 T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323
[10:24:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::generation::worker::dumper
[10:27:34] <logmsgbot>	 !log hashar@deploy1002 Started deploy [releng/jenkins-deploy@5d3a06d] (releasing): (no justification provided)
[10:27:54] <hashar>	 !log Upgrading releases Jenkins instances # T366008
[10:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:59] <stashbot>	 T366008: Upgrade Jenkins instances to 2.452.1 - https://phabricator.wikimedia.org/T366008
[10:28:43] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:28:47] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [releng/jenkins-deploy@5d3a06d] (releasing): (no justification provided) (duration: 01m 12s)
[10:29:03] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038722
[10:30:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch dumps::generation::worker::dumper to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038738 (https://phabricator.wikimedia.org/T349619)
[10:30:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host failoid1002.eqiad.wmnet
[10:32:30] <wikibugs>	 (03PS6) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686)
[10:33:16] <wikibugs>	 (03PS1) 10Majavah: hieradata: Remove unused role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1038739
[10:34:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host failoid1002.eqiad.wmnet
[10:34:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch dumps::generation::worker::dumper to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1038738 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[10:34:24] <wikibugs>	 (03PS7) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686)
[10:34:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet
[10:35:04] <wikibugs>	 06SRE, 06Traffic: Anycast ns1.wikimedia.org - https://phabricator.wikimedia.org/T366193#9858791 (10cmooney) >>! In T366193#9855670, @BBlack wrote: > IMHO, the A/B set solution with a pair of anycasts, is the most elegant and simple way to achieve the best balance of resiliency and perf for our authdns.  I thin...
[10:35:11] <wikibugs>	 (03PS3) 10Clément Goubert: trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan)
[10:35:19] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! git apply worked cleanly locally on latest wmf/stable branch thus +2" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037902 (owner: 10Pppery)
[10:36:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1156.eqiad.wmnet with OS bookworm
[10:38:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::generation::worker::dumper
[10:39:12] <wikibugs>	 (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038740 (https://phabricator.wikimedia.org/T360685)
[10:39:14] <wikibugs>	 (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038741 (https://phabricator.wikimedia.org/T360685)
[10:39:16] <wikibugs>	 (03PS1) 10Dreamy Jazz: [CheckUser] Stop writing old for event tables migration on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038742 (https://phabricator.wikimedia.org/T360685)
[10:40:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet
[10:40:53] <icinga-wm_>	 PROBLEM - SSH on centrallog1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:40:55] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:42:01] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb1001.eqiad.wmnet
[10:42:17] <marostegui>	 !log Starting s1 codfw failover from db2203 to db2212 - T366552
[10:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:23] <stashbot>	 T366552: Switchover s1 master (db2203 -> db2212) - https://phabricator.wikimedia.org/T366552
[10:42:35] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9858804 (10Ladsgroup) Waiting for approval on data engineering side.
[10:42:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2212 to s1 primary T366552', diff saved to https://phabricator.wikimedia.org/P63994 and previous config saved to /var/cache/conftool/dbconfig/20240604-104241-root.json
[10:43:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2203 T366552', diff saved to https://phabricator.wikimedia.org/P63995 and previous config saved to /var/cache/conftool/dbconfig/20240604-104337-root.json
[10:44:37] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:45:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db2141.codfw.wmnet with reason: Long schema change
[10:45:17] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:45:18] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2141.codfw.wmnet with reason: Long schema change
[10:45:26] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db2203.codfw.wmnet with reason: Long schema change
[10:45:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2203.codfw.wmnet with reason: Long schema change
[10:45:39] <marostegui>	 !log dbmaint codfw s1 deploy schema change on db2203 T364299
[10:45:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:42] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[10:46:10] <wikibugs>	 (03PS1) 10Fabfur: cache:hiera: enable IPIP on text@magru [puppet] - 10https://gerrit.wikimedia.org/r/1038744 (https://phabricator.wikimedia.org/T366466)
[10:46:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling reboot on P{ms-fe2*} and (A:swift-fe or A:swift-fe-canary or A:swift-fe-codfw or A:swift-fe-eqiad)
[10:47:44] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038744 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[10:48:00] <wikibugs>	 (03PS2) 10Clément Goubert: miscweb: Use a random miscweb image for default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518)
[10:48:03] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling reboot on A:thanos-fe
[10:48:17] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:48:25] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+1] ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[10:48:37] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:49:07] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2001-dev.codfw.wmnet
[10:49:48] <wikibugs>	 (03CR) 10Btullis: [C:03+1] an-test-druid: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1032632 (owner: 10Muehlenhoff)
[10:50:38] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1001.eqiad.wmnet
[10:50:50] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb1002.eqiad.wmnet
[10:50:56] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[10:51:39] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:51:51] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:52:45] <icinga-wm_>	 PROBLEM - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[10:52:46] <icinga-wm_>	 ACKNOWLEDGEMENT - MD RAID on centrallog1002 is CRITICAL: CRITICAL: State: degraded, Active: 7, Working: 7, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T366580 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[10:52:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T366580 (10ops-monitoring-bot) 03NEW
[10:53:01] <icinga-wm_>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[10:53:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on mw1358.eqiad.wmnet with reason: Waiting on iDrac update
[10:53:05] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good. Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French)
[10:53:17] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on mw1358.eqiad.wmnet with reason: Waiting on iDrac update
[10:53:17] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:53:29] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:53:33] <icinga-wm_>	 PROBLEM - Bird Internet Routing Daemon on centrallog1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[10:53:43] <icinga-wm_>	 RECOVERY - SSH on centrallog1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:54:33] <icinga-wm_>	 RECOVERY - Bird Internet Routing Daemon on centrallog1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[10:54:33] <icinga-wm_>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on centrallog1002 is OK: OK: UP (pid=3953) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[10:54:44] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet
[10:54:51] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:55:21] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:55:21] <icinga-wm_>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:55:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P63996 and previous config saved to /var/cache/conftool/dbconfig/20240604-105525-root.json
[10:55:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038722 (owner: 10Marostegui)
[10:55:34] <wikibugs>	 (03PS1) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884)
[10:55:39] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:55:44] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job rsyslog-receiver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:56:19] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:56:23] <wikibugs>	 (03PS2) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884)
[10:56:29] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-d5-eqiad.mgmt is OK: BGP OK - up: 14, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:57:31] <wikibugs>	 (03PS3) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884)
[10:57:48] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2001-dev.codfw.wmnet
[10:57:58] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2002-dev.codfw.wmnet
[10:58:17] <wikibugs>	 (03PS4) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884)
[10:59:08] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb1002.eqiad.wmnet
[10:59:50] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1358.eqiad.wmnet
[11:00:01] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1358.eqiad.wmnet
[11:00:09] <wikibugs>	 (03CR) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[11:00:39] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:00:53] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:01:46] <wikibugs>	 (03PS6) 10Kosta Harlan: geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531
[11:01:54] <wikibugs>	 (03PS8) 10Kosta Harlan: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272)
[11:02:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T366580#9858928 (10fgiunchedi) →14Duplicate dup:03T363660
[11:02:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9858930 (10fgiunchedi)
[11:03:39] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:03:53] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:04:16] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583 (10Clement_Goubert) 03NEW
[11:04:38] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9858964 (10Clement_Goubert)
[11:04:48] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox
[11:04:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Ah, yes. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1038739 (owner: 10Majavah)
[11:05:52] <wikibugs>	 (03CR) 10Majavah: [C:03+2] hieradata: Remove unused role hiera [puppet] - 10https://gerrit.wikimedia.org/r/1038739 (owner: 10Majavah)
[11:06:01] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9858966 (10Clement_Goubert)
[11:06:08] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9858967 (10Clement_Goubert)
[11:06:12] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:06:46] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2002-dev.codfw.wmnet
[11:06:57] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet
[11:07:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9858969 (10fgiunchedi) I'm not sure exactly what happened, though while working today on {T366555} centrallog1002 md1 raid wouldn't come up cleanly. I've assembled it with three disks and then put ba...
[11:08:44] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:09:39] <icinga-wm_>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:09:53] <icinga-wm_>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:10:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P63998 and previous config saved to /var/cache/conftool/dbconfig/20240604-111031-root.json
[11:12:41] <icinga-wm_>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:12:41] <wikibugs>	 (03PS5) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884)
[11:12:53] <icinga-wm_>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:14:26] <wikibugs>	 (03PS6) 10Kosta Harlan: IPInfo: Remove $wgIPInfoGeoIP2EnterprisePath and use GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884)
[11:15:45] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet
[11:16:24] <wikibugs>	 (03PS7) 10Kosta Harlan: IPInfo: Switch to using GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884)
[11:20:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Icinga stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038748
[11:21:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet
[11:21:27] <wikibugs>	 (03CR) 10Esanders: [C:03+2] Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz)
[11:22:04] <wikibugs>	 (03CR) 10Esanders: [C:03+1] Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz)
[11:25:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P63999 and previous config saved to /var/cache/conftool/dbconfig/20240604-112537-root.json
[11:26:21] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9859001 (10VRiley-WMF) Sure thing! We'll do it one at a time.
[11:27:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet
[11:27:13] <wikibugs>	 (03CR) 10Gergő Tisza: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[11:27:21] <wikibugs>	 (03PS1) 10Majavah: wikitech: Replace OSM class in Gerrit blocking hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553)
[11:27:22] <wikibugs>	 (03PS1) 10Majavah: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553)
[11:27:23] <wikibugs>	 (03CR) 10Gergő Tisza: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1030591 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza)
[11:29:22] <wikibugs>	 (03PS2) 10Majavah: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553)
[11:29:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet
[11:36:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet
[11:36:18] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] IPInfo: Switch to using GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) (owner: 10Kosta Harlan)
[11:39:05] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling reboot on A:thanos-fe
[11:39:48] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw
[11:40:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64000 and previous config saved to /var/cache/conftool/dbconfig/20240604-114043-root.json
[11:41:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2136.codfw.wmnet with reason: Maintenance
[11:41:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2136.codfw.wmnet with reason: Maintenance
[11:41:52] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9859048 (10SGupta-WMF) Hi @Scott_French We are almost done coding the services...
[11:41:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2136 (T364299)', diff saved to https://phabricator.wikimedia.org/P64001 and previous config saved to /var/cache/conftool/dbconfig/20240604-114157-marostegui.json
[11:42:00] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[11:44:09] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-codfw
[11:47:22] <jinxer-wm>	 FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[11:47:47] <claime>	 ^side effect of reboots
[11:48:10] <claime>	 I'll fix it once their dedicated hosts are done rebooting
[11:48:43] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:50:31] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-eqiad
[11:50:44] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:53:43] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:53:43] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service ml-cache1001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:54:09] <icinga-wm_>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 137, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:54:16] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Add wikidata history-dumps import to hdfs job [puppet] - 10https://gerrit.wikimedia.org/r/1036614 (https://phabricator.wikimedia.org/T364045) (owner: 10Joal)
[11:54:30] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Add wikidata history-dumps import to hdfs job [puppet] - 10https://gerrit.wikimedia.org/r/1036614 (https://phabricator.wikimedia.org/T364045) (owner: 10Joal)
[11:54:36] <hnowlan>	 !log depooling 3 api appservers and 2 appservers in advance of reimaging 
[11:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:44] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: enable IPIP for high-traffic1@magru for text services [puppet] - 10https://gerrit.wikimedia.org/r/1038698 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[11:54:49] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache:hiera: enable IPIP on text@magru [puppet] - 10https://gerrit.wikimedia.org/r/1038744 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[11:55:44] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:55:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64002 and previous config saved to /var/cache/conftool/dbconfig/20240604-115549-root.json
[11:56:45] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:57:09] <icinga-wm_>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 138, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:57:54] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] depool text@magru before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1038695 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[11:58:43] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:59:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T352010)', diff saved to https://phabricator.wikimedia.org/P64003 and previous config saved to /var/cache/conftool/dbconfig/20240604-115907-ladsgroup.json
[11:59:10] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1200)
[12:00:44] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:02:41] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database dtpwiki (T365229)
[12:02:43] <stashbot>	 T365229: Prepare and check storage layer for dtpwiki - https://phabricator.wikimedia.org/T365229
[12:03:43] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:05:39] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323)
[12:05:44] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:05:44] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:07:22] <jinxer-wm>	 FIRING: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[12:08:33] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-codfw
[12:08:43] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:09:55] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS6
[12:09:55] <icinga-wm_>	 : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:09:55] <icinga-wm_>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:10:01] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS
[12:10:01] <icinga-wm_>	 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:10:45] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:10:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64004 and previous config saved to /var/cache/conftool/dbconfig/20240604-121056-root.json
[12:11:57] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:12:03] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:12:15] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply
[12:12:21] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply
[12:13:43] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:14:04] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply
[12:14:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P64005 and previous config saved to /var/cache/conftool/dbconfig/20240604-121415-ladsgroup.json
[12:14:21] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply
[12:14:35] <icinga-wm_>	 RECOVERY - MD RAID on centrallog1002 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:15:24] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply
[12:15:40] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply
[12:15:44] <jinxer-wm>	 FIRING: [9x] ProbeDown: Service ml-cache1001-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:15:57] <icinga-wm_>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:17:10] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-eqiad
[12:17:30] <wikibugs>	 (03PS1) 10Jelto: conftool-data: add gerrit and gerrit-replica [puppet] - 10https://gerrit.wikimedia.org/r/1038758 (https://phabricator.wikimedia.org/T365259)
[12:18:43] <jinxer-wm>	 RESOLVED: [7x] ProbeDown: Service ml-cache1002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:21:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Remove obsolete Icinga stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038748 (owner: 10Muehlenhoff)
[12:22:01] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS6
[12:22:01] <icinga-wm_>	 : Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:22:05] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS
[12:22:05] <icinga-wm_>	 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:22:20] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete Icinga stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038748 (owner: 10Muehlenhoff)
[12:22:37] <godog>	 jouncebot: next
[12:22:37] <jouncebot>	 In 0 hour(s) and 37 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1300)
[12:22:52] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host centrallog1002.eqiad.wmnet
[12:23:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] an-test-druid: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1032632 (owner: 10Muehlenhoff)
[12:24:05] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:24:07] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:25:45] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:25:45] <icinga-wm_>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:25:46] <wikibugs>	 (03PS1) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759
[12:26:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1156 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64006 and previous config saved to /var/cache/conftool/dbconfig/20240604-122602-root.json
[12:26:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet
[12:28:00] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) for database dtpwiki (T365229)
[12:28:03] <stashbot>	 T365229: Prepare and check storage layer for dtpwiki - https://phabricator.wikimedia.org/T365229
[12:28:45] <icinga-wm_>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:28:45] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:28:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman)
[12:29:09] <logmsgbot>	 !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog1002.eqiad.wmnet
[12:29:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P64007 and previous config saved to /var/cache/conftool/dbconfig/20240604-122924-ladsgroup.json
[12:30:58] <wikibugs>	 (03PS2) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759
[12:32:02] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet
[12:32:12] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.wdqs.restart
[12:32:13] <logmsgbot>	 !log brouberol@cumin2002 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97)
[12:32:20] <logmsgbot>	 !log brouberol@cumin2002 START - Cookbook sre.wdqs.restart
[12:34:05] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS
[12:34:05] <icinga-wm_>	 6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:34:11] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, A
[12:34:11] <icinga-wm_>	 v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:34:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1003.wikimedia.org
[12:34:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman)
[12:35:18] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842)
[12:36:05] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:36:11] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:38:07] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9859201 (10MoritzMuehlenhoff)
[12:39:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1003.wikimedia.org
[12:39:53] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet
[12:43:48] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet
[12:44:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T352010)', diff saved to https://phabricator.wikimedia.org/P64008 and previous config saved to /var/cache/conftool/dbconfig/20240604-124432-ladsgroup.json
[12:44:34] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[12:44:35] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[12:44:47] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[12:45:13] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS
[12:45:13] <icinga-wm_>	 6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:45:36] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[12:46:13] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS
[12:46:13] <icinga-wm_>	 4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:47:10] <claime>	 The BGP errors are expected because of reboots
[12:47:13] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: use multi-processing for viwiki in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038765 (https://phabricator.wikimedia.org/T349274)
[12:47:19] <claime>	 Sorry for the noise though
[12:48:13] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:48:13] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:48:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2004.wikimedia.org
[12:48:54] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet
[12:49:41] <wikibugs>	 (03PS3) 10Klausman: base functions: make sleep() output a bit friendlier [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759
[12:51:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch maps/codfw to PKI [puppet] - 10https://gerrit.wikimedia.org/r/1038240 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[12:52:22] <jinxer-wm>	 RESOLVED: [2x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[12:52:49] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet
[12:53:01] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet
[12:53:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2004.wikimedia.org
[12:53:35] <logmsgbot>	 !log brouberol@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0)
[12:56:57] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet
[12:57:29] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1001.eqiad.wmnet
[12:58:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete swift stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038769
[12:59:09] <icinga-wm_>	 PROBLEM - pybal on lvs7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[12:59:09] <icinga-wm_>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:59:09] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs7001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[12:59:17] <sukhe>	 ^ expected
[12:59:43] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs7001 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[12:59:52] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1001.eqiad.wmnet
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1300).
[13:00:04] <jouncebot>	 Nemoralis and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:21] <MatmaRex>	 hi
[13:00:24] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd1002.eqiad.wmnet
[13:02:15] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A
[13:02:15] <icinga-wm_>	 v4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:02:15] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, A
[13:02:15] <icinga-wm_>	 v6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:02:50] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd1002.eqiad.wmnet
[13:03:05] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2001.codfw.wmnet
[13:04:00] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Thanks for doing the tidy-up, this looks good to me." [labs/private] - 10https://gerrit.wikimedia.org/r/1038769 (owner: 10Muehlenhoff)
[13:04:55] <MatmaRex>	 any deployers around?
[13:05:13] <MatmaRex>	 we've got just one real change and one beta-only change today
[13:05:29] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2001.codfw.wmnet
[13:05:45] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2002.codfw.wmnet
[13:06:30] <wikibugs>	 (03PS1) 10Stevemunene: Clean up datahub from main cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038773 (https://phabricator.wikimedia.org/T366338)
[13:08:10] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2002.codfw.wmnet
[13:08:25] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2003.codfw.wmnet
[13:09:15] <wikibugs>	 (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete swift stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1038769 (owner: 10Muehlenhoff)
[13:09:39] <wikibugs>	 (03PS1) 10Brouberol: analytics_test_cluster_coordinator: upgrade mariadb to version 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1038771 (https://phabricator.wikimedia.org/T365503)
[13:10:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete thanos-swift.discovery.wmnet.crt certificate [puppet] - 10https://gerrit.wikimedia.org/r/1038368 (https://phabricator.wikimedia.org/T356412) (owner: 10Muehlenhoff)
[13:10:48] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2003.codfw.wmnet
[13:11:02] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7001.magru.wmnet
[13:11:27] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_magru
[13:11:52] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_magru
[13:12:48] <wikibugs>	 (03PS1) 10Santiago Faci: MPIC chart: Added two new secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182)
[13:12:51] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet
[13:13:31] <wikibugs>	 (03PS2) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730
[13:13:34] <wikibugs>	 (03PS1) 10Marostegui: db1156: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1038776
[13:14:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730 (owner: 10Ayounsi)
[13:14:32] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7001.magru.wmnet
[13:14:41] <icinga-wm_>	 PROBLEM - Host lvs7001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:14:53] <wikibugs>	 (03PS1) 10Effie Mouzeli: mcrouter ds: use in mw-debug in codfw and not eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038777
[13:15:15] <icinga-wm_>	 RECOVERY - Host lvs7001 is UP: PING OK - Packet loss = 0%, RTA = 115.70 ms
[13:15:42] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1156: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1038776 (owner: 10Marostegui)
[13:16:13] <icinga-wm_>	 PROBLEM - pybal on lvs7001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[13:16:15] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs7001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[13:16:26] <sukhe>	 ^ expected, resolving soon
[13:17:09] <icinga-wm_>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:17:09] <MatmaRex>	 i'm still holding out for a deployer, if anyone would like to volunteer
[13:17:10] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet
[13:17:13] <icinga-wm_>	 RECOVERY - pybal on lvs7001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[13:17:15] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs7001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:17:17] <wikibugs>	 (03PS2) 10Santiago Faci: MPIC chart: Added two new secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182)
[13:17:19] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet
[13:17:30] <wikibugs>	 (03CR) 10Bking: "Understood. I share the same concerns, and we talked about changing from a team-based notification model to a service-based notification m" [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking)
[13:17:47] <wikibugs>	 (03CR) 10Bking: [C:03+2] data-platform: add alert for WDQS MaxLag [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking)
[13:17:48] <wikibugs>	 (03PS1) 10Slyngshede: Attempt to fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525)
[13:18:10] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mcrouter ds: use in mw-debug in codfw and not eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038777 (owner: 10Effie Mouzeli)
[13:18:22] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: sync on production
[13:18:59] <wikibugs>	 (03Merged) 10jenkins-bot: data-platform: add alert for WDQS MaxLag [alerts] - 10https://gerrit.wikimedia.org/r/1038454 (https://phabricator.wikimedia.org/T361114) (owner: 10Bking)
[13:19:07] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[13:19:19] <effie>	 jouncebot: now
[13:19:19] <jouncebot>	 For the next 0 hour(s) and 40 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1300)
[13:19:33] <icinga-wm_>	 PROBLEM - Host kubernetes2033 is DOWN: PING CRITICAL - Packet loss = 100%
[13:19:33] <icinga-wm_>	 PROBLEM - Host kubernetes2030 is DOWN: PING CRITICAL - Packet loss = 100%
[13:19:33] <icinga-wm_>	 PROBLEM - Host kubernetes2035 is DOWN: PING CRITICAL - Packet loss = 100%
[13:19:41] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs7001 is OK: OK: 12 connections established with conf1009.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[13:19:41] <wikibugs>	 (03PS10) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[13:19:57] <wikibugs>	 (03Abandoned) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] (dev) - 10https://gerrit.wikimedia.org/r/1038730 (owner: 10Ayounsi)
[13:20:33] <jinxer-wm>	 FIRING: [3x] KubernetesCalicoDown: kubernetes2030.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:20:48] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[13:20:57] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[13:21:08] <logmsgbot>	 !log btullis@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[13:21:36] <wikibugs>	 (03PS2) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525)
[13:22:22] <logmsgbot>	 !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[13:23:10] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet
[13:23:26] <wikibugs>	 (03PS1) 10Cwhite: logstash: drop messages from datahub-mce-consumer [puppet] - 10https://gerrit.wikimedia.org/r/1038786 (https://phabricator.wikimedia.org/T366596)
[13:23:56] <wikibugs>	 (03PS11) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[13:24:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[13:24:58] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet
[13:25:16] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[13:25:30] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[13:25:44] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:27:09] <icinga-wm_>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:27:13] <icinga-wm_>	 PROBLEM - pybal on lvs7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[13:27:13] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs7002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[13:27:16] <sukhe>	 ^ expected
[13:27:22] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: drop messages from datahub-mce-consumer [puppet] - 10https://gerrit.wikimedia.org/r/1038786 (https://phabricator.wikimedia.org/T366596) (owner: 10Cwhite)
[13:27:45] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs7002 is CRITICAL: CRITICAL: 0 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[13:29:13] <wikibugs>	 (03CR) 10Volans: "approach LGTM, some details inline" [software/netbox-deploy] (dev) - 10https://gerrit.wikimedia.org/r/1038694 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[13:29:28] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet
[13:29:42] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet
[13:30:29] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "praise: spot on!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182) (owner: 10Santiago Faci)
[13:32:42] <wikibugs>	 (03CR) 10Btullis: logstash: drop messages from datahub-mce-consumer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038786 (https://phabricator.wikimedia.org/T366596) (owner: 10Cwhite)
[13:32:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Bumping db1194 weight', diff saved to https://phabricator.wikimedia.org/P64009 and previous config saved to /var/cache/conftool/dbconfig/20240604-133250-ladsgroup.json
[13:35:23] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet
[13:36:15] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in moss-be1002 - https://phabricator.wikimedia.org/T366153#9859578 (10VRiley-WMF) a:03VRiley-WMF
[13:36:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove thanos-fe-combined.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038781
[13:36:54] <wikibugs>	 (03PS1) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782
[13:37:44] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet
[13:38:17] <wikibugs>	 (03PS12) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[13:38:46] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038781 (owner: 10Muehlenhoff)
[13:39:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[13:40:12] <wikibugs>	 (03PS2) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782
[13:40:16] <wikibugs>	 (03PS1) 10Btullis: Update the logstash filters for datahub mae/mce consumer pods [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596)
[13:40:54] <wikibugs>	 (03PS2) 10Btullis: Update the logstash filters for datahub mae/mce consumer pods [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596)
[13:42:07] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet
[13:42:11] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2737/console" [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) (owner: 10Btullis)
[13:42:30] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet
[13:42:55] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC OK, all are false positives https://puppet-compiler.wmflabs.org/output/1038697/1077/" [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli)
[13:43:26] <wikibugs>	 (03PS5) 10Effie Mouzeli: memcached: switch to memcache user (role and profile) [puppet] - 10https://gerrit.wikimedia.org/r/1038697
[13:45:00] <wikibugs>	 (03PS1) 10Ladsgroup: rpc: Update function call in RunSingleJob [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038785 (https://phabricator.wikimedia.org/T363839)
[13:45:14] <wikibugs>	 (03PS1) 10Cwhite: logstash: expand datahub drop filters to match all consumers [puppet] - 10https://gerrit.wikimedia.org/r/1038787 (https://phabricator.wikimedia.org/T363856)
[13:45:45] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] memcached: switch to memcache user (role and profile) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli)
[13:46:52] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet
[13:46:59] <wikibugs>	 (03PS6) 10Effie Mouzeli: memcached: switch to memcache user (role and profile) [puppet] - 10https://gerrit.wikimedia.org/r/1038697
[13:47:03] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) (owner: 10Btullis)
[13:48:29] <wikibugs>	 (03PS3) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782
[13:49:59] <wikibugs>	 (03CR) 10Btullis: Update the logstash filters for datahub mae/mce consumer pods (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) (owner: 10Btullis)
[13:52:07] <wikibugs>	 (03CR) 10Btullis: "Ah, I will abandon the other similar change that I had started: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038783" [puppet] - 10https://gerrit.wikimedia.org/r/1038787 (https://phabricator.wikimedia.org/T363856) (owner: 10Cwhite)
[13:52:30] <wikibugs>	 (03Abandoned) 10Btullis: Update the logstash filters for datahub mae/mce consumer pods [puppet] - 10https://gerrit.wikimedia.org/r/1038783 (https://phabricator.wikimedia.org/T366596) (owner: 10Btullis)
[13:56:57] <wikibugs>	 06SRE, 10SRE-tools: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9859741 (10Volans) @Ladsgroup no, not really. It should be the one of the owners of the systems with raid0 that are interested in automating this step. So I guess `o11y` in this...
[13:58:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] logstash: expand datahub drop filters to match all consumers [puppet] - 10https://gerrit.wikimedia.org/r/1038787 (https://phabricator.wikimedia.org/T363856) (owner: 10Cwhite)
[13:59:17] <logmsgbot>	 !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl1001.eqiad.wmnet
[13:59:19] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7002.magru.wmnet
[13:59:22] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] logstash: expand datahub drop filters to match all consumers [puppet] - 10https://gerrit.wikimedia.org/r/1038787 (https://phabricator.wikimedia.org/T363856) (owner: 10Cwhite)
[13:59:59] <wikibugs>	 (03PS3) 10Hashar: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis)
[13:59:59] <wikibugs>	 (03PS1) 10Hashar: Use a wildcard TypeScript include for plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038810
[14:00:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl1001.eqiad.wmnet
[14:00:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis)
[14:00:52] <wikibugs>	 (03PS4) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782
[14:02:00] <wikibugs>	 06SRE, 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9859792 (10Jdforrester-WMF)
[14:02:17] <wikibugs>	 06SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 4 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#9859793 (10Jdforrester-WMF)
[14:02:48] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7002.magru.wmnet
[14:03:19] <icinga-wm_>	 PROBLEM - pybal on lvs7002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:03:21] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs7002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[14:03:27] <icinga-wm_>	 RECOVERY - Host kubernetes2030 is UP: PING OK - Packet loss = 0%, RTA = 30.36 ms
[14:04:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey)
[14:04:51] <icinga-wm_>	 PROBLEM - SSH on kubernetes2030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:05:11] <icinga-wm_>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:05:19] <icinga-wm_>	 RECOVERY - pybal on lvs7002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[14:05:21] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs7002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:06:04] <wikibugs>	 (03CR) 10Esanders: [C:03+2] Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz)
[14:06:53] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[14:06:58] <wikibugs>	 (03Merged) 10jenkins-bot: Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 (https://phabricator.wikimedia.org/T366093) (owner: 10Mvolz)
[14:07:43] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs7002 is OK: OK: 4 connections established with conf1009.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[14:07:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch maps/eqiad to PKI as well [puppet] - 10https://gerrit.wikimedia.org/r/1038815 (https://phabricator.wikimedia.org/T360778)
[14:08:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038815 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff)
[14:09:51] <icinga-wm_>	 PROBLEM - Host kubernetes2030 is DOWN: PING CRITICAL - Packet loss = 100%
[14:10:32] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002"
[14:14:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002"
[14:14:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:14:14] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-ctrl1001.eqiad.wmnet
[14:15:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Remove thanos-fe-combined.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038781 (owner: 10Muehlenhoff)
[14:16:25] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9859840 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl1001.eqiad....
[14:16:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove thanos-fe-combined.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038781 (owner: 10Muehlenhoff)
[14:22:25] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-worker-codfw
[14:22:31] <claime>	 :(
[14:22:52] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete thanos-query.discovery.wmnet.crt [puppet] - 10https://gerrit.wikimedia.org/r/1038818 (https://phabricator.wikimedia.org/T360414)
[14:23:44] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:24:12] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, but please test it for Dell hosts before merging it to be sure we're not breaking the current workflow. Feel free to use the sretest" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[14:24:52] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038818 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff)
[14:27:04] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs7003.magru.wmnet
[14:28:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] memcached: switch to memcache user (role and profile) [puppet] - 10https://gerrit.wikimedia.org/r/1038697 (owner: 10Effie Mouzeli)
[14:28:55] <wikibugs>	 (03PS5) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782
[14:30:09] <icinga-wm_>	 PROBLEM - BGP status on asw1-b3-magru.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:30:19] <sukhe>	 ^ expected, lvs7003 
[14:31:25] <wikibugs>	 (03PS6) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782
[14:33:09] <icinga-wm_>	 RECOVERY - BGP status on asw1-b3-magru.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:33:43] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs7003.magru.wmnet
[14:34:29] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038827
[14:34:54] <wikibugs>	 (03PS7) 10Elukey: WIP - sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782
[14:36:55] <wikibugs>	 10SRE-tools, 10observability: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9859944 (10Ladsgroup) Done. Thanks.
[14:36:56] <wikibugs>	 10SRE-tools, 10observability: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9859947 (10Ladsgroup)
[14:37:48] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609 (10Clement_Goubert) 03NEW p:05Triage→03High
[14:38:17] <wikibugs>	 (03PS8) 10Elukey: sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782
[14:38:31] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp4045*} and A:cp
[14:38:42] <wikibugs>	 (03PS2) 10Dr0ptp4kt: Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155)
[14:38:43] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:26] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706)
[14:43:00] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply
[14:43:36] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[14:46:24] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=kubernetes203(1|4).codfw.wmnet,cluster=kubernetes,service=kubesvc
[14:48:28] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[14:48:38] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kubernetes[2030,2033,2035].codfw.wmnet with reason: Hardware issue
[14:48:49] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp4045*} and A:cp
[14:48:54] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kubernetes[2030,2033,2035].codfw.wmnet with reason: Hardware issue
[14:49:04] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9859997 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=da38d2ec-3c5a-4c49-a0b8-5355aa47...
[14:49:12] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:50:29] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486) (owner: 10Bartosz Dziewoński)
[14:52:03] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486)
[14:52:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64012 and previous config saved to /var/cache/conftool/dbconfig/20240604-145203-root.json
[14:52:33] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[14:53:23] <wikibugs>	 (03PS22) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069)
[14:53:26] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:53:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1184: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1038827 (owner: 10Marostegui)
[14:55:09] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp3066*} and A:cp
[14:55:11] <icinga-wm_>	 RECOVERY - Host kubernetes2035 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms
[14:55:27] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1001
[14:56:42] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[14:56:44] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[14:57:07] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842) (owner: 10Ilias Sarantopoulos)
[14:57:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1001
[14:57:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:57:53] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486) (owner: 10Bartosz Dziewoński)
[14:58:35] <wikibugs>	 (03Merged) 10jenkins-bot: Show experimental login popup links on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038389 (https://phabricator.wikimedia.org/T366486) (owner: 10Bartosz Dziewoński)
[14:58:43] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:58:53] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "mutante, this patch now longer generates a puppet diff in prod.. In cloud it will produce an empty array, which should match the current s" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[15:00:04] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1500).
[15:00:57] <wikibugs>	 (03CR) 10Hashar: "I have rebase your change on top of https://gerrit.wikimedia.org/r/c/operations/software/gerrit/+/1038810/  to ensure TypeScript runs." [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis)
[15:02:08] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842) (owner: 10Ilias Sarantopoulos)
[15:02:09] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Update
[15:02:23] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Update
[15:02:53] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@ef680d8]: deploy phab2002 for T366605
[15:02:56] <stashbot>	 T366605: Deploy Phabricator/Phorge 2024-06-04 - https://phabricator.wikimedia.org/T366605
[15:03:09] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: set command for hf image and remove nllb [deployment-charts] - 10https://gerrit.wikimedia.org/r/1036297 (https://phabricator.wikimedia.org/T365842) (owner: 10Ilias Sarantopoulos)
[15:03:26] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@ef680d8]: deploy phab2002 for T366605 (duration: 00m 33s)
[15:03:40] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Update
[15:03:54] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Update
[15:03:59] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@ef680d8]: deploy phab1004 for T366605
[15:04:14] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[15:04:32] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@ef680d8]: deploy phab1004 for T366605 (duration: 00m 32s)
[15:04:48] <wikibugs>	 (03CR) 10Milimetric: [C:03+1] Bump XML dump schema to version 0.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038392 (https://phabricator.wikimedia.org/T365155) (owner: 10Dr0ptp4kt)
[15:05:00] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[15:05:17] <icinga-wm_>	 RECOVERY - Host kubernetes2033 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms
[15:05:41] <wikibugs>	 (03PS2) 10AikoChou: ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744)
[15:06:06] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp3066*} and A:cp
[15:06:18] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[15:06:52] <wikibugs>	 (03CR) 10Elukey: "Tested the following:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1037573 (https://phabricator.wikimedia.org/T365372) (owner: 10Elukey)
[15:07:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64013 and previous config saved to /var/cache/conftool/dbconfig/20240604-150710-root.json
[15:07:18] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update RevertRisk LA/ML/Wikidata's images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038736 (https://phabricator.wikimedia.org/T358744) (owner: 10AikoChou)
[15:08:14] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[15:08:19] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED
[15:08:21] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1004.mgmt.eqiad.wmnet with reboot policy FORCED
[15:08:27] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[15:08:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T352010)', diff saved to https://phabricator.wikimedia.org/P64014 and previous config saved to /var/cache/conftool/dbconfig/20240604-150835-ladsgroup.json
[15:08:38] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[15:09:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:10:16] <wikibugs>	 (03PS2) 10Majavah: openldap: cross-validate-accounts: Note shell users disabled in LDAP [puppet] - 10https://gerrit.wikimedia.org/r/999103
[15:11:12] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED
[15:11:53] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.85.0" for 294 hosts
[15:11:53] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_magru
[15:11:55] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest1001.mgmt.eqiad.wmnet with reboot policy FORCED
[15:11:58] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1001 to a new rack - kamila@cumin1002"
[15:12:33] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.85.0" completed for 294 hosts
[15:12:34] <wikibugs>	 (03PS1) 10FNegri: wikireplicas: Add conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/1038847
[15:12:41] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[15:12:43] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038847 (owner: 10FNegri)
[15:13:09] <wikibugs>	 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9860167 (10jijiki)
[15:13:14] <wikibugs>	 06SRE, 10Cloud-Services, 06serviceops: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9860169 (10jijiki) 05Open→03In progress
[15:13:26] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest1001.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[15:15:02] <wikibugs>	 (03PS1) 10Aklapper: Correct name of Herald option [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1038849
[15:15:09] <icinga-wm_>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 443, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:15:09] <icinga-wm_>	 RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 521, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:15:10] <wikibugs>	 (03PS3) 10EoghanGaffney: lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706)
[15:15:23] <icinga-wm_>	 RECOVERY - SSH on kubernetes2030 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[15:15:25] <icinga-wm_>	 RECOVERY - Host kubernetes2030 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms
[15:15:29] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Correct name of Herald option [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1038849 (owner: 10Aklapper)
[15:15:35] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1001 to a new rack - kamila@cumin1002"
[15:15:35] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:15:44] <wikibugs>	 (03CR) 10Urbanecm: [Beta] Enable CommunityConfiguration extension in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno)
[15:16:31] <wikibugs>	 (03CR) 10Paladox: [C:03+1] gerrit: remove mac algos no more supported by Mina SSHD [puppet] - 10https://gerrit.wikimedia.org/r/1038703 (https://phabricator.wikimedia.org/T366565) (owner: 10Hashar)
[15:18:11] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1038818 (https://phabricator.wikimedia.org/T360414) (owner: 10Muehlenhoff)
[15:18:15] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:18:46] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-ctrl1001.eqiad.wmnet
[15:18:56] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.ganeti.reboot-vm (exit_code=97) for VM aux-k8s-ctrl1001.eqiad.wmnet
[15:19:07] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-ctrl1001.eqiad.wmnet
[15:19:39] <wikibugs>	 (03CR) 10CDanis: [C:03+1] sre.k8s.reboot-nodes.py: rework alias and group parameters [cookbooks] - 10https://gerrit.wikimedia.org/r/1038782 (owner: 10Elukey)
[15:19:42] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:19:55] <wikibugs>	 (03PS2) 10FNegri: wikireplicas: Add conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/1038847
[15:20:03] <wikibugs>	 (03CR) 10FNegri: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038847 (owner: 10FNegri)
[15:20:19] <wikibugs>	 (03CR) 10Scott French: [C:03+2] changeprop: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030190 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[15:21:15] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1001
[15:21:15] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030190 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[15:21:17] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1001
[15:22:13] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1001.eqiad.wmnet
[15:22:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64015 and previous config saved to /var/cache/conftool/dbconfig/20240604-152216-root.json
[15:25:32] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860268 (10Jhancock.wm) when I put a faceplate on all three servers, I find the same error: The system Confi...
[15:25:38] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-ctrl1001.eqiad.wmnet
[15:25:45] <wikibugs>	 (03CR) 10Clément Goubert: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert)
[15:26:41] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, last sentence in commit message is outdated but should be fine for the initial test" [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[15:26:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T364299)', diff saved to https://phabricator.wikimedia.org/P64017 and previous config saved to /var/cache/conftool/dbconfig/20240604-152644-marostegui.json
[15:26:48] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[15:27:01] <logmsgbot>	 !log dcausse@deploy1002 Started deploy [airflow-dags/search@a279784]: search: bump to discolytics 0.24 and name n-triples dumps
[15:27:12] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply
[15:27:13] <logmsgbot>	 !log tchin@deploy1002 Started deploy [airflow-dags/analytics@a279784]: (no justification provided)
[15:27:28] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[15:27:28] <logmsgbot>	 !log dcausse@deploy1002 Finished deploy [airflow-dags/search@a279784]: search: bump to discolytics 0.24 and name n-triples dumps (duration: 00m 27s)
[15:27:40] <logmsgbot>	 !log tchin@deploy1002 Finished deploy [airflow-dags/analytics@a279784]: (no justification provided) (duration: 00m 27s)
[15:28:20] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1001.eqiad.wmnet
[15:28:38] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[15:29:10] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:29:14] <wikibugs>	 (03PS4) 10EoghanGaffney: lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706)
[15:29:24] <logmsgbot>	 !log tchin@deploy1002 Started deploy [airflow-dags/analytics_test@a279784]: (no justification provided)
[15:29:26] <wikibugs>	 (03PS5) 10EoghanGaffney: lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706)
[15:29:34] <logmsgbot>	 !log tchin@deploy1002 Finished deploy [airflow-dags/analytics_test@a279784]: (no justification provided) (duration: 00m 10s)
[15:31:10] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-ctrl1002.eqiad.wmnet
[15:31:32] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1002.eqiad.wmnet
[15:31:44] <wikibugs>	 (03PS1) 10Urbanecm: [beta] arwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038852 (https://phabricator.wikimedia.org/T364895)
[15:32:39] <wikibugs>	 (03PS2) 10Urbanecm: [beta] arwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038852 (https://phabricator.wikimedia.org/T364892)
[15:34:01] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Update the quickdatacopy to use /var/lib/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[15:34:01] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] [beta] arwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038852 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm)
[15:34:03] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply
[15:34:06] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860331 (10Jhancock.wm) all servers are updated and are error free. if this happens again with any...
[15:34:40] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] arwiki: Enable CommunityConfiguration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038852 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm)
[15:34:53] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860334 (10Clement_Goubert) Thanks so much @Jhancock.wm
[15:35:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: sextant cache: Add new service major version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038857
[15:35:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: sextant cache: Allow defining mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038858
[15:35:56] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859
[15:35:57] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mw-mcrouter: Switch helmfile.d to use the newer cache module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860
[15:36:02] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[15:36:52] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for kubernetes[2030,2033,2035].codfw.wmnet
[15:36:53] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1002.eqiad.wmnet
[15:36:54] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes[2030,2033,2035].codfw.wmnet
[15:37:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (owner: 10Alexandros Kosiaris)
[15:37:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-mcrouter: Switch helmfile.d to use the newer cache module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860 (owner: 10Alexandros Kosiaris)
[15:37:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64018 and previous config saved to /var/cache/conftool/dbconfig/20240604-153722-root.json
[15:37:39] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=yes; selector: name=kubernetes203(0|3|5).codfw.wmnet,cluster=kubernetes,service=kubesvc
[15:37:41] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-ctrl1002.eqiad.wmnet
[15:38:05] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts2001.codfw.wmnet
[15:39:20] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859
[15:39:20] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: mw-mcrouter: Switch helmfile.d to use the newer cache module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860
[15:40:41] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_magru
[15:41:51] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: reboot failure for kubernetes2030.codfw.wmnet kubernetes2033.codfw.wmnet kubernetes2035.codfw.wmnet - https://phabricator.wikimedia.org/T366609#9860373 (10Clement_Goubert) 05Open→03Resolved Hosts repooled, uncordoned and set back to ac...
[15:41:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P64019 and previous config saved to /var/cache/conftool/dbconfig/20240604-154153-marostegui.json
[15:42:01] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2001.codfw.wmnet
[15:42:19] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[15:42:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012797 (https://phabricator.wikimedia.org/T360378) (owner: 10BryanDavis)
[15:42:50] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1001.eqiad.wmnet
[15:43:03] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM aux-k8s-etcd1001.eqiad.wmnet
[15:43:15] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:43:23] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1001.eqiad.wmnet
[15:44:06] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host miscweb2003.codfw.wmnet
[15:45:15] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 46 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:45:35] <wikibugs>	 (03CR) 10FNegri: [C:04-1] "This would break https://wikitech.wikimedia.org/wiki/Puppet/Coding_and_style_guidelines#Roles so I need to find another way" [puppet] - 10https://gerrit.wikimedia.org/r/1038847 (owner: 10FNegri)
[15:46:52] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet
[15:47:03] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1001.eqiad.wmnet
[15:47:22] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet
[15:47:49] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1003.eqiad.wmnet
[15:47:52] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1002.eqiad.wmnet
[15:48:03] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host miscweb2003.codfw.wmnet
[15:50:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:51:32] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s1
[15:51:45] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1002.eqiad.wmnet
[15:52:01] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3
[15:52:10] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1003.eqiad.wmnet
[15:52:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64020 and previous config saved to /var/cache/conftool/dbconfig/20240604-155228-root.json
[15:52:33] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1004.eqiad.wmnet
[15:52:54] <wikibugs>	 (03PS1) 10Mhorsey: Activate campaignEvents extension on Igbo wiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038862 (https://phabricator.wikimedia.org/T363199)
[15:53:06] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet
[15:53:23] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet
[15:53:34] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM aux-k8s-etcd1003.eqiad.wmnet
[15:54:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:55:39] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s1 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:55:39] <icinga-wm_>	 PROBLEM - MariaDB Replica SQL: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:55:39] <icinga-wm_>	 PROBLEM - MariaDB Replica IO: s3 on clouddb1013 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:55:42] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1013.eqiad.wmnet
[15:55:51] <jynus>	 :-)
[15:56:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Bumping db1194 weight', diff saved to https://phabricator.wikimedia.org/P64021 and previous config saved to /var/cache/conftool/dbconfig/20240604-155629-ladsgroup.json
[15:57:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P64022 and previous config saved to /var/cache/conftool/dbconfig/20240604-155701-marostegui.json
[15:57:26] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM aux-k8s-etcd1003.eqiad.wmnet
[15:57:39] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1004.eqiad.wmnet
[15:58:24] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host phab2002.codfw.wmnet
[15:59:14] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[15:59:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-parsoid - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:00:01] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[16:00:05] <jouncebot>	 jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1600).
[16:00:05] <jouncebot>	 pmiazga: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:10] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet
[16:01:13] <jhathaway>	 o/
[16:01:59] <jhathaway>	 dmed, pmiazga
[16:02:10] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] [beta] Add test2.wikimedia.beta.wmcloud.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[16:02:25] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in moss-be1002 - https://phabricator.wikimedia.org/T366153#9860503 (10VRiley-WMF) Since the server is no longer under warranty, we have swapped the HDD with a HDD from a decommissioned server.
[16:02:44] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdl) failed in moss-be1002 - https://phabricator.wikimedia.org/T366153#9860506 (10VRiley-WMF) 05Open→03Resolved
[16:02:59] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1005.eqiad.wmnet
[16:04:15] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply
[16:04:28] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab2002.codfw.wmnet
[16:04:41] <logmsgbot>	 !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply
[16:05:14] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply
[16:05:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:05:39] <logmsgbot>	 !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply
[16:06:39] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s1 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:07:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64023 and previous config saved to /var/cache/conftool/dbconfig/20240604-160735-root.json
[16:07:39] <icinga-wm_>	 RECOVERY - MariaDB Replica SQL: s3 on clouddb1013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:07:43] <icinga-wm_>	 RECOVERY - MariaDB Replica IO: s3 on clouddb1013 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:08:47] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet
[16:09:36] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1005.eqiad.wmnet
[16:09:48] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1013.eqiad.wmnet
[16:10:24] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3
[16:10:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:10:36] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s1
[16:10:42] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp7001.magru.wmnet
[16:11:15] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp7002.magru.wmnet
[16:12:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T364299)', diff saved to https://phabricator.wikimedia.org/P64024 and previous config saved to /var/cache/conftool/dbconfig/20240604-161210-marostegui.json
[16:12:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance
[16:12:13] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[16:12:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2137.codfw.wmnet with reason: Maintenance
[16:12:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2137 (T364299)', diff saved to https://phabricator.wikimedia.org/P64025 and previous config saved to /var/cache/conftool/dbconfig/20240604-161233-marostegui.json
[16:15:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:15:54] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp7001.magru.wmnet
[16:18:14] <wikibugs>	 (03PS4) 10EoghanGaffney: lists: Add option to block incoming mail [puppet] - 10https://gerrit.wikimedia.org/r/1038772
[16:20:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:22:03] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp7002.magru.wmnet
[16:22:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2203 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64028 and previous config saved to /var/cache/conftool/dbconfig/20240604-162241-root.json
[16:26:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:29:42] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1010.eqiad.wmnet
[16:31:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[16:31:53] <elukey>	 !log delete 3 pods in eventgate-main on wikikube-eqiad to test if envoy on them is in a weird state
[16:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:45] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] MPIC chart: Added two new secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182) (owner: 10Santiago Faci)
[16:34:16] <wikibugs>	 (03Merged) 10jenkins-bot: MPIC chart: Added two new secrets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038775 (https://phabricator.wikimedia.org/T365182) (owner: 10Santiago Faci)
[16:34:39] <wikibugs>	 (03CR) 10Volans: [C:03+1] "Sure, why not, suggestion inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038759 (owner: 10Klausman)
[16:35:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: ferm.service on mw1360:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:35:38] <wikibugs>	 10ops-codfw, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T364577#9860723 (10Andrew)
[16:35:51] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for Sohom Datta - https://phabricator.wikimedia.org/T366032#9860726 (10KFrancis) The NDA is complete.  Thanks!
[16:36:26] <logmsgbot>	 !log dcaro@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1010.eqiad.wmnet
[16:38:33] <wikibugs>	 (03PS13) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[16:38:34] <wikibugs>	 (03PS1) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869
[16:39:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 (owner: 10Ayounsi)
[16:39:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[16:40:25] <jinxer-wm>	 RESOLVED: [3x] SystemdUnitFailed: ferm.service on mw1360:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:41:54] <elukey>	 !log delete other 2 pods in eventgate-main on wikikube-eqiad to test if envoy on them is in a weird state
[16:41:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:01] <wikibugs>	 (03PS2) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869
[16:44:01] <wikibugs>	 (03PS14) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[16:44:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869 (owner: 10Ayounsi)
[16:44:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[16:47:36] <wikibugs>	 (03PS5) 10Clément Goubert: sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865
[16:49:49] <wikibugs>	 (03PS3) 10Ayounsi: Fix lots of CI errors [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1038869
[16:49:50] <wikibugs>	 (03PS15) 10Ayounsi: Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570
[16:50:15] <wikibugs>	 (03CR) 10Elukey: "Let's coordinate if possible, I have filed https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1038782 that shouldn't clash with yours" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[16:50:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Netbox-extra: Add bandit and prospector to CI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/905570 (owner: 10Ayounsi)
[16:50:55] <wikibugs>	 (03CR) 10Clément Goubert: "Yeah, I was in the process of doing that 😄" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[16:51:57] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[16:52:50] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[16:53:15] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp700[12].magru.wmnet,service=(cdn|ats-be)
[16:53:55] <wikibugs>	 (03PS7) 10Clément Goubert: sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865
[16:55:25] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 23 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:57:44] <wikibugs>	 (03PS2) 10Hnowlan: kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323)
[16:58:35] <wikibugs>	 (03PS9) 10Effie Mouzeli: [geoip::data::maxmind::ipinfo]: Use GeoLite2 instead of Enterprise data [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[16:58:39] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037528 (https://phabricator.wikimedia.org/T366272) (owner: 10Kosta Harlan)
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1700)
[17:00:13] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kubernetes: rename and reimage 3 api appservers, 2 appservers [puppet] - 10https://gerrit.wikimedia.org/r/1038757 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan)
[17:02:17] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:03:13] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz)
[17:07:17] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 788 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[17:08:52] <wikibugs>	 (03PS4) 10Gergő Tisza: multiversion: Support beta for upload hostname check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929
[17:08:52] <wikibugs>	 (03PS4) 10Gergő Tisza: multiversion: Add tests for MWMultiVersion::getMediaWiki() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930
[17:08:52] <wikibugs>	 (03PS8) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[17:09:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[17:11:49] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[17:13:51] <wikibugs>	 (03PS1) 10Ssingh: hiera: add profile::cache::base::use_noflow_iface_preup for magru cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1038872 (https://phabricator.wikimedia.org/T366606)
[17:14:27] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1001 to a new rack - kamila@cumin1002"
[17:15:06] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2744/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038872 (https://phabricator.wikimedia.org/T366606) (owner: 10Ssingh)
[17:15:18] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1001 to a new rack - kamila@cumin1002"
[17:15:18] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:16:05] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:03+2] hiera: add profile::cache::base::use_noflow_iface_preup for magru cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/1038872 (https://phabricator.wikimedia.org/T366606) (owner: 10Ssingh)
[17:22:11] <sukhe>	 !log sudo cumin 'A:cp and A:magru' 'run-puppet-agent'
[17:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:00] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[17:23:08] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9860880 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[17:27:30] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "LGTM. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert)
[17:29:57] <wikibugs>	 (03CR) 10Stoyofuku-wmf: [C:03+1] "Confirmed this is no longer used" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson)
[17:30:53] <wikibugs>	 (03CR) 10JMeybohm: "I think I failed to create a task last time (or I failed to find it)." [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[17:32:00] <wikibugs>	 (03PS2) 10Gergő Tisza: [POC][beta] Add rewrite rule for sso.wikimedia.beta.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162)
[17:33:03] <wikibugs>	 (03PS1) 10Jforrester: Add wikilambda-edit-monolingual-text-placeholder message to extension.json [extensions/WikiLambda] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1038828 (https://phabricator.wikimedia.org/T359782)
[17:39:11] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7002*} and A:cp
[17:39:16] <wikibugs>	 (03PS1) 10Stoyofuku-wmf: Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625)
[17:40:23] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9860978 (10BCornwall)
[17:40:43] <wikibugs>	 (03PS3) 10Stoyofuku-wmf: Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334)
[17:42:11] <wikibugs>	 (03CR) 10Tchanders: [C:03+1] [CheckUser] Stop writing old for event table migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz)
[17:49:00] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7002*} and A:cp
[17:51:33] <sukhe>	 !log sudo cumin 'A:cp-text and A:magru' "sed -i '/\sup ethtool -A eno12399np0/d' /etc/network/interfaces"
[17:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:53] <sukhe>	 !log sudo cumin 'A:cp-upload and A:magru' "sed -i '/\sup ethtool -A eno12399np0/d' /etc/network/interfaces"
[17:53:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:28] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on P{cp7014*} and A:cp
[17:54:41] <wikibugs>	 (03CR) 10Jdlrobson: [C:03+1] Disable font size options on specified pages for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038876 (https://phabricator.wikimedia.org/T366625) (owner: 10Stoyofuku-wmf)
[17:54:42] <wikibugs>	 (03PS8) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686)
[17:54:54] <wikibugs>	 (03CR) 10Dreamy Jazz: [CheckUser] Stop writing old for event table migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013386 (https://phabricator.wikimedia.org/T360686) (owner: 10Dreamy Jazz)
[18:00:04] <jouncebot>	 dduvall and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1800). nyaa~
[18:00:20] <dancy>	 Lurking.
[18:04:27] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on P{cp7014*} and A:cp
[18:04:55] <wikibugs>	 (03PS9) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[18:05:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[18:05:56] <wikibugs>	 (03CR) 10Dzahn: "alright! how about this: I disable puppet on prod phab, merge this, run it on cloud and if it breaks there I just revert, if not I enable " [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:07:10] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "sounds great" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:07:13] <wikibugs>	 (03PS1) 10CDobbins: purged: set use_pki to true for all eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1038881 (https://phabricator.wikimedia.org/T360506)
[18:08:24] <dduvall>	 dancy: o/
[18:11:48] <wikibugs>	 (03CR) 10Dzahn: "it fails in compiler like this: Error: Evaluation Error: Error while evaluating a Function Call, Failed to execute '/pdb/query/v4' on at l" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:12:37] <wikibugs>	 (03CR) 10Dzahn: "Since this already happens on the compiler hosts I would expect the same on the devtools hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:13:00] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Use `growthexperiments` DB list for enabling GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892)
[18:14:25] <wikibugs>	 (03CR) 10Dzahn: "can we lookup the list of host name in Hiera?" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:14:42] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038883 (https://phabricator.wikimedia.org/T361402)
[18:14:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038883 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot)
[18:15:37] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[18:15:44] <wikibugs>	 (03CR) 10Dzahn: "keep in mind if you just rename the resource itself and don't absent it then puppet won't remove the timer/service and you'll end up with " [puppet] - 10https://gerrit.wikimedia.org/r/1036686 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[18:15:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:15:46] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038883 (https://phabricator.wikimedia.org/T361402) (owner: 10TrainBranchBot)
[18:15:48] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[18:16:44] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2746/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038881 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[18:16:48] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1037621/2745/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:17:55] <wikibugs>	 (03CR) 10JHathaway: [V:03+1] "There is logic in `modules/wmflib/functions/puppetdb_query.pp` to return an empty array, if a puppetdb server is not present in an environ" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:18:26] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Looks good, let's plan to merge this on Wed Jun 5!" [puppet] - 10https://gerrit.wikimedia.org/r/1038881 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins)
[18:19:16] <wikibugs>	 (03CR) 10Dzahn: "even if it works in devtools this would still mean we can't compile changes anymore in the future" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:19:23] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9861173 (10cmooney)
[18:19:44] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9861174 (10cmooney)
[18:21:53] <wikibugs>	 (03CR) 10Michael Große: [C:03+1] "Sounds good to me, but this feels really like something where I would like us to get explicit approval from RelEng (SRE?) about before dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038882 (https://phabricator.wikimedia.org/T364892) (owner: 10Urbanecm)
[18:22:57] <wikibugs>	 (03PS1) 10Ssingh: haproxy: update systemd template for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/1038884
[18:23:30] <wikibugs>	 (03CR) 10Dzahn: "Or it needs a "if $realm = production" clause around lookup and something else in an else branch. Those realm checks are not recommended b" [puppet] - 10https://gerrit.wikimedia.org/r/1037621 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:23:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T364069)', diff saved to https://phabricator.wikimedia.org/P64031 and previous config saved to /var/cache/conftool/dbconfig/20240604-182342-marostegui.json
[18:23:47] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[18:24:13] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2747/console" [puppet] - 10https://gerrit.wikimedia.org/r/1038884 (owner: 10Ssingh)
[18:25:10] <wikibugs>	 (03CR) 10Urbanecm: [C:04-1] [Beta] Enable CommunityConfiguration extension in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno)
[18:26:57] <logmsgbot>	 !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.8  refs T361402
[18:27:00] <stashbot>	 T361402: 1.43.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T361402
[18:28:17] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on doc.wikimedia.org with reason: reboot T366555
[18:28:18] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on doc.wikimedia.org with reason: reboot T366555
[18:28:37] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on doc1003.eqiad.wmnet with reason: reboot T366555
[18:28:38] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on doc1003.eqiad.wmnet with reason: reboot T366555
[18:29:00] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade Eqiad row E-F Spines to JunOS 22.2R3 - https://phabricator.wikimedia.org/T366361#9861186 (10cmooney)
[18:30:19] <mutante>	 !log doc.wikimedia.org - very short downtime for maintenance
[18:30:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:48] <wikibugs>	 (03PS2) 10Ssingh: P:cache::haproxy: update systemd template for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/1038884
[18:32:02] <dduvall>	 train looks good
[18:32:15] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2748/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038884 (owner: 10Ssingh)
[18:35:45] <mutante>	 !log aphlict - (phab realtime notifications) - reboots
[18:35:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:51] <wikibugs>	 (03CR) 10Scott French: "Looks good! Only one notable comment / question." [puppet] - 10https://gerrit.wikimedia.org/r/1037868 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus)
[18:38:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P64032 and previous config saved to /var/cache/conftool/dbconfig/20240604-183850-marostegui.json
[18:40:28] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] P:cache::haproxy: update systemd template for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/1038884 (owner: 10Ssingh)
[18:41:19] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.142`. Pre-deploy tests passing on canary `wdqs1016`
[18:41:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:04] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@143ca33]: 0.3.142
[18:45:18] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.142` on canary `wdqs1016`; proceeding to rest of fleet
[18:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:07] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@143ca33]: 0.3.142 (duration: 02m 02s)
[18:46:20] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#9861253 (10CDanis) I discussed this with @Muehlenhoff in his evening/my morning.  `lang=irc 09:12:36 <mo...
[18:46:35] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations: moss-be1003 "Warning: The current total number of facts: 2830 exceeds the number of facts limit: 2048" - https://phabricator.wikimedia.org/T366563#9861256 (10CDanis) p:05High→03Medium
[18:47:50] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@43b966f]: 0.3.142
[18:48:15] <wikibugs>	 (03PS1) 10Jsn.sherman: InitialiseSettings: Enable AutoModerator on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038886 (https://phabricator.wikimedia.org/T362622)
[18:48:59] <ryankemper>	 !log [WDQS Deploy] Forgot to run the command to set git hash to tip of origin/master so deploy was a partial no-op. Re-rolling...
[18:49:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:50:57] <wikibugs>	 (03PS1) 10JHathaway: devtools: update puppetmaster and pubkey [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395)
[18:51:46] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:53:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P64033 and previous config saved to /var/cache/conftool/dbconfig/20240604-185358-marostegui.json
[18:57:28] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "confirmed the puppetmaster for devtools moved to puppetmaster-1003. haven't checked where you got the key from, but lgtm. it's a change to" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[18:57:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:57:53] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] devtools: update puppetmaster and pubkey [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[19:00:43] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@43b966f]: 0.3.142 (duration: 12m 53s)
[19:03:14] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] "will do, is there a doc on doing that somewhere?" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[19:04:15] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "well, my comment was because I don't know that" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[19:06:23] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[19:06:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:35] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[19:06:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:06:43] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[19:06:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:09:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T364069)', diff saved to https://phabricator.wikimedia.org/P64034 and previous config saved to /var/cache/conftool/dbconfig/20240604-190906-marostegui.json
[19:09:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[19:09:12] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[19:09:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[19:09:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T364069)', diff saved to https://phabricator.wikimedia.org/P64035 and previous config saved to /var/cache/conftool/dbconfig/20240604-190931-marostegui.json
[19:11:15] <jinxer-wm>	 FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:11:20] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "but that would be the project puppet-diffs, not devtools, where it would have to be deployed I think" [puppet] - 10https://gerrit.wikimedia.org/r/1038887 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway)
[19:12:33] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[19:12:35] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance
[19:13:43] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:13:57] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on releases1003.eqiad.wmnet with reason: reboot T366555
[19:14:10] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on releases1003.eqiad.wmnet with reason: reboot T366555
[19:16:15] <jinxer-wm>	 RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 21.79% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:16:36] <mutante>	 !log releases.wikimedia.org - short downtime for maintenance
[19:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:53] <wikibugs>	 (03CR) 10Eevans: [C:03+2] cassandra: create new commons_impact_analytics role [puppet] - 10https://gerrit.wikimedia.org/r/1038409 (https://phabricator.wikimedia.org/T361835) (owner: 10Eevans)
[19:32:51] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint2002.wikimedia.org with reason: reboot T366555
[19:33:05] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on contint2002.wikimedia.org with reason: reboot T366555
[19:35:05] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189)
[19:36:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[19:36:47] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861507 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[19:36:48] <wikibugs>	 (03PS2) 10BCornwall: ncmonitor: Add SSH credentials support [puppet] - 10https://gerrit.wikimedia.org/r/1038890 (https://phabricator.wikimedia.org/T355189)
[19:37:34] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on gerrit2002.wikimedia.org with reason: reboot T366555
[19:37:40] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[19:37:41] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9861511 (10Scott_French) Thanks for the update, @SGupta-WMF - that's great!  T...
[19:37:46] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861512 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[19:37:47] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on gerrit2002.wikimedia.org with reason: reboot T366555
[19:37:55] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[19:38:00] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861516 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[19:38:04] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on gerrit-replica.wikimedia.org with reason: reboot T366555
[19:38:05] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on gerrit-replica.wikimedia.org with reason: reboot T366555
[19:38:10] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 14), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9861517 (10Scott_French)
[19:38:28] <mutante>	 !log https://gerrit-replica.wikimedia.org - short downtime for maintenance
[19:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:40:13] <mutante>	 jouncebot: nowandnext
[19:40:13] <jouncebot>	 For the next 0 hour(s) and 19 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T1800)
[19:40:13] <jouncebot>	 In 0 hour(s) and 19 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T2000)
[19:40:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T364299)', diff saved to https://phabricator.wikimedia.org/P64036 and previous config saved to /var/cache/conftool/dbconfig/20240604-194031-marostegui.json
[19:40:34] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[19:44:22] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[19:44:33] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861558 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[19:46:07] <wikibugs>	 (03PS3) 10Pppery: [pawiki] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T366434)
[19:47:48] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001']
[19:49:23] <logmsgbot>	 !log ecarg@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[19:49:25] <logmsgbot>	 !log ecarg@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[19:55:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P64037 and previous config saved to /var/cache/conftool/dbconfig/20240604-195539-marostegui.json
[19:59:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001']
[20:00:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240604T2000).
[20:00:04] <jouncebot>	 pppery, pmiazga, tgr, and toyofuku: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:08] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861609 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[20:00:10] <Pppery>	 Here
[20:00:16] <Pppery>	 This is my first time doing this, though
[20:00:30] <tgr|away>	 o/
[20:00:40] <toyofuku>	 Also here, and it's my second 🙃
[20:01:05] <wikibugs>	 (03CR) 10Jforrester: "Eh, fine, you've convinced me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester)
[20:09:23] <Pppery>	 Hello?
[20:10:32] <Amir1>	 Pppery: I'm not on window right now but I can deploy this sooon
[20:10:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137', diff saved to https://phabricator.wikimedia.org/P64038 and previous config saved to /var/cache/conftool/dbconfig/20240604-201047-marostegui.json
[20:14:00] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] [pawiki] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T366434) (owner: 10Pppery)
[20:14:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T366434) (owner: 10Pppery)
[20:14:39] <wikibugs>	 (03Merged) 10jenkins-bot: [pawiki] Enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037945 (https://phabricator.wikimedia.org/T366434) (owner: 10Pppery)
[20:15:08] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1037945|[pawiki] Enable wgMinervaEnableSiteNotice (T366434)]]
[20:15:12] <stashbot>	 T366434: Enable SiteNotice in Mobile View on Punjabi Wikipedia - https://phabricator.wikimedia.org/T366434
[20:15:45] <wikibugs>	 (03CR) 10Hashar: plugins: Add wm-schedule-deployment plugin (031 comment) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis)
[20:17:45] <logmsgbot>	 !log ladsgroup@deploy1002 pppery and ladsgroup: Backport for [[gerrit:1037945|[pawiki] Enable wgMinervaEnableSiteNotice (T366434)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:19:12] <Pppery>	 Can confirm I see the sitenotice at pa.m.wikipedia.org with X-Wikimedia-Debug set up and don't when it isn't set up, so looks good
[20:19:58] <logmsgbot>	 !log ladsgroup@deploy1002 pppery and ladsgroup: Continuing with sync
[20:20:08] <Amir1>	 moving forward. thanks
[20:21:39] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1358.eqiad.wmnet
[20:21:50] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1358.eqiad.wmnet
[20:22:06] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts mw1358.eqiad.wmnet
[20:22:15] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts mw1358.eqiad.wmnet
[20:22:54] <wikibugs>	 (03CR) 10Urbanecm: [C:03+1] "lgtm. can we get it merged?" [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe)
[20:23:28] <wikibugs>	 (03PS6) 10Pmiazga: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281)
[20:25:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2137 (T364299)', diff saved to https://phabricator.wikimedia.org/P64039 and previous config saved to /var/cache/conftool/dbconfig/20240604-202554-marostegui.json
[20:25:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[20:25:59] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[20:26:05] <wikibugs>	 (03CR) 10Pmiazga: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[20:26:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance
[20:27:22] <jhathaway>	 !log vacuuming pcc db
[20:27:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:32] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1037945|[pawiki] Enable wgMinervaEnableSiteNotice (T366434)]] (duration: 13m 24s)
[20:28:35] <stashbot>	 T366434: Enable SiteNotice in Mobile View on Punjabi Wikipedia - https://phabricator.wikimedia.org/T366434
[20:28:43] <Amir1>	 Pppery: deployed
[20:29:44] <Amir1>	 I need to be afk for a bit, if someone else can take over, that'd be amazing
[20:31:19] <tgr|away>	 will do
[20:31:29] <toyofuku>	 ty ty 
[20:33:23] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[20:33:59] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga)
[20:34:30] <logmsgbot>	 !log tgr@deploy1002 Started scap: Backport for [[gerrit:1035749|beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (T355281)]]
[20:34:34] <stashbot>	 T355281: Set up some beta cluster wikis with different registrable domain - https://phabricator.wikimedia.org/T355281
[20:37:54] <logmsgbot>	 !log tgr@deploy1002 tgr and pmiazga: Backport for [[gerrit:1035749|beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (T355281)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:38:41] <wikibugs>	 (03PS1) 10Pppery: [jawikinews] Set $wgArticleCountMethod to any [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038897 (https://phabricator.wikimedia.org/T364189)
[20:39:10] <logmsgbot>	 !log tgr@deploy1002 tgr and pmiazga: Continuing with sync
[20:42:26] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] "Generally looks good, one nit on an awkward comments. Could add more nits on some python bits, but they are generally irrelevant." [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[20:43:22] <urbanecm>	 hi pmiazga :)
[20:43:27] <pmiazga>	 o/ 
[20:44:56] <urbanecm>	 pmiazga: (recapping from -releng) you wanted me to deploy something. can do, once tgr|away is done with the patch / window, as appropriate.
[20:47:06] <pmiazga>	 cool. thank you. Mine is no-op for prod
[20:47:43] <logmsgbot>	 !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1035749|beta: Introduce new test2wiki on test2.wikipedia.beta.wmcloud.org (T355281)]] (duration: 13m 12s)
[20:47:46] <stashbot>	 T355281: Set up some beta cluster wikis with different registrable domain - https://phabricator.wikimedia.org/T355281
[20:47:58] <tgr|away>	 ^ I think that was the one
[20:48:37] <pmiazga>	 so if prod works, everything is good.Nice, thank you tgr|away! Looks like for last 40 mins I was looking into empty #wikimedia-releng channel and I was wondering why no one deploys now
[20:48:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 06serviceops: hw troubleshooting: firmware upgrade for mw1358.eqiad.wmnet - https://phabricator.wikimedia.org/T366583#9861710 (10Jclark-ctr) 05Open→03Resolved manually updated firmware  iDRAC Firmware Version  7.00.00.171  BIOS Version...
[20:49:06] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) (owner: 10Gergő Tisza)
[20:49:22] <wikibugs>	 (03PS1) 10Pppery: [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411)
[20:49:42] <wikibugs>	 (03PS3) 10Gergő Tisza: [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415)
[20:50:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) (owner: 10Pppery)
[20:50:41] <wikibugs>	 (03CR) 10Gergő Tisza: [C:03+2] [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) (owner: 10Gergő Tisza)
[20:51:05] <wikibugs>	 (03PS2) 10Pppery: [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411)
[20:51:21] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Remove references to upload.beta.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038230 (https://phabricator.wikimedia.org/T366415) (owner: 10Gergő Tisza)
[20:51:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411) (owner: 10Pppery)
[20:51:46] <wikibugs>	 (03PS5) 10Gergő Tisza: multiversion: Support beta for upload hostname check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929
[20:51:54] <wikibugs>	 (03PS5) 10Gergő Tisza: multiversion: Add tests for MWMultiVersion::getMediaWiki() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930
[20:52:09] <wikibugs>	 (03PS10) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[20:52:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929 (owner: 10Gergő Tisza)
[20:52:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by tgr@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930 (owner: 10Gergő Tisza)
[20:52:41] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[20:52:46] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861730 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[20:52:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[20:53:13] <wikibugs>	 (03Merged) 10jenkins-bot: multiversion: Support beta for upload hostname check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037929 (owner: 10Gergő Tisza)
[20:53:19] <wikibugs>	 (03Merged) 10jenkins-bot: multiversion: Add tests for MWMultiVersion::getMediaWiki() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037930 (owner: 10Gergő Tisza)
[20:53:50] <logmsgbot>	 !log tgr@deploy1002 Started scap: Backport for [[gerrit:1037929|multiversion: Support beta for upload hostname check]], [[gerrit:1037930|multiversion: Add tests for MWMultiVersion::getMediaWiki()]]
[20:56:34] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001']
[20:58:33] <logmsgbot>	 !log tgr@deploy1002 tgr: Backport for [[gerrit:1037929|multiversion: Support beta for upload hostname check]], [[gerrit:1037930|multiversion: Add tests for MWMultiVersion::getMediaWiki()]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:59:08] <wikibugs>	 (03PS1) 10Pppery: [ptwikinews] Set atom feed link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038901 (https://phabricator.wikimedia.org/T356003)
[20:59:17] <tgr|away>	 20:56:06 Check 'check_testservers_k8s' failed: Sending to mwdebug.discovery.wmnet...
[20:59:20] <tgr|away>	 https://techconduct.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_remnant.yaml:159) Status code: expected 200, got 503.
[20:59:35] <tgr|away>	 error went away on retry so fingers crossed...
[21:01:19] <urbanecm>	 tgr|away: can you please ping me once done? :)
[21:01:57] <logmsgbot>	 !log tgr@deploy1002 tgr: Continuing with sync
[21:05:50] <wikibugs>	 (03PS4) 10Stoyofuku-wmf: Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334)
[21:06:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001']
[21:07:46] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[21:08:00] <wikibugs>	 (03PS11) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[21:08:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[21:10:23] <logmsgbot>	 !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1037929|multiversion: Support beta for upload hostname check]], [[gerrit:1037930|multiversion: Add tests for MWMultiVersion::getMediaWiki()]] (duration: 16m 33s)
[21:10:46] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1001.eq...
[21:10:58] <tgr|away>	 urbanecm: done
[21:11:34] <urbanecm>	 thanks tgr|away 
[21:12:17] <urbanecm>	 i see an unmerged patch by toyofuku as well – did we decide to skip it, or should that be done as well?
[21:12:27] <urbanecm>	 also pmiazga what is it you wanted deployed?
[21:12:48] <toyofuku>	 I mean, I'm here if someone's willing to deploy it
[21:13:13] <toyofuku>	 As of right now I'm unqualified to do so myself 😭 I promise to pay it back when I'm trained up
[21:13:16] <urbanecm>	 i can do that :)
[21:13:31] <toyofuku>	 thank you!!
[21:13:32] <urbanecm>	 (deployment, although i can help with deployment advice too if needed)
[21:13:45] <tgr|away>	 oh sorry, don't know how I missed that
[21:13:51] <toyofuku>	 haha I think I'm good in that area - shadowing tomorrow
[21:14:01] <toyofuku>	 (all good!)
[21:14:05] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) (owner: 10Stoyofuku-wmf)
[21:14:22] <urbanecm>	 toyofuku: enjoy the shadowing then!
[21:14:35] <toyofuku>	 💜
[21:14:43] <wikibugs>	 (03Merged) 10jenkins-bot: Disable font size options on specified pages for most wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038444 (https://phabricator.wikimedia.org/T366334) (owner: 10Stoyofuku-wmf)
[21:15:04] <urbanecm>	 pmiazga: can you link your patch as well please?
[21:15:44] <tgr|away>	 urbanecm: I think that was https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1035749 ?
[21:16:30] <urbanecm>	 which is already deployed (and was when pmiazga pinged me asking for a deployment of "a couple of things")
[21:16:33] <urbanecm>	 so...probably not
[21:16:48] <wikibugs>	 (03PS12) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162)
[21:17:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza)
[21:18:21] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1038444|Disable font size options on specified pages for most wikis (T366334)]]
[21:18:24] <stashbot>	 T366334: Enable different default font size on different pages for Vector 2022 in production - https://phabricator.wikimedia.org/T366334
[21:19:19] <wikibugs>	 (03PS23) 10Ryan Kemper: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[21:19:57] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs: extract categories reload to its own cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse)
[21:20:12] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[21:21:23] <logmsgbot>	 !log urbanecm@deploy1002 toyofuku and urbanecm: Backport for [[gerrit:1038444|Disable font size options on specified pages for most wikis (T366334)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:21:34] <urbanecm>	 toyofuku: can you test your patch at mwdebug, please? :)
[21:21:52] <toyofuku>	 Yep, doing so now!
[21:23:16] <toyofuku>	 Looks good - thank you so much!
[21:23:42] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: extract categories reload to its own cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse)
[21:24:11] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse)
[21:24:54] <logmsgbot>	 !log urbanecm@deploy1002 toyofuku and urbanecm: Continuing with sync
[21:24:56] <urbanecm>	 proceeding!
[21:27:23] <wikibugs>	 (03PS7) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester)
[21:27:34] <urbanecm>	 pmiazga: last ping...?
[21:27:55] <wikibugs>	 (03PS3) 10Pppery: [zghwiki] Add patroller and autopatrolled groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038899 (https://phabricator.wikimedia.org/T357411)
[21:28:10] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:28:15] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:31:52] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[21:32:48] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:32:53] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:33:31] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1038444|Disable font size options on specified pages for most wikis (T366334)]] (duration: 15m 10s)
[21:33:34] <stashbot>	 T366334: Enable different default font size on different pages for Vector 2022 in production - https://phabricator.wikimedia.org/T366334
[21:33:37] <urbanecm>	 toyofuku: and done
[21:33:42] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[21:33:44] <urbanecm>	 anything else?
[21:33:50] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:33:54] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:34:45] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[21:34:54] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:34:59] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:36:33] <wikibugs>	 (03CR) 10BryanDavis: plugins: Add wm-schedule-deployment plugin (033 comments) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512) (owner: 10BryanDavis)
[21:36:47] <wikibugs>	 (03PS4) 10BryanDavis: plugins: Add wm-schedule-deployment plugin [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038465 (https://phabricator.wikimedia.org/T366512)
[21:39:43] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[21:40:42] <wikibugs>	 (03PS5) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[21:41:06] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:41:12] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:42:15] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9861911 (10Pppery)
[21:42:31] <wikibugs>	 (03PS1) 10Andrew Bogott: wmfkeystonehooks: use project_id rather than project_name for auth [puppet] - 10https://gerrit.wikimedia.org/r/1038907 (https://phabricator.wikimedia.org/T343158)
[21:43:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] wmfkeystonehooks: use project_id rather than project_name for auth [puppet] - 10https://gerrit.wikimedia.org/r/1038907 (https://phabricator.wikimedia.org/T343158) (owner: 10Andrew Bogott)
[21:57:22] <wikibugs>	 (03CR) 10BryanDavis: [C:03+1] Use a wildcard TypeScript include for plugins [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1038810 (owner: 10Hashar)
[21:58:58] <wikibugs>	 (03PS6) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[21:59:24] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[21:59:29] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[22:00:23] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-ctrl1001.eqiad.wmnet with OS bullseye
[22:00:31] <wikibugs>	 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9861978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1001.eqiad....
[22:01:50] <wikibugs>	 06SRE, 06serviceops: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293#9861981 (10Jdforrester-WMF)
[22:01:54] <wikibugs>	 (03PS7) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[22:02:07] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[22:02:15] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[22:07:25] <wikibugs>	 (03PS8) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[22:08:12] <wikibugs>	 (03PS9) 10Ryan Kemper: wdqs.data-reload: fix regex escaping [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069)
[22:08:21] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[22:09:00] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcontrol2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T364577#9861989 (10Papaul) a:03Jhancock.wm @Jhancock.wm can you please proceed with this and resolve the task once done.  Thanks
[22:13:20] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, and 2 others: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567#9862002 (10TheDJ) In the last 7 days there were 85 log entries for this warning.  48 of these were on labswiki, triggered...
[22:16:57] <tzatziki>	 !log removing three files for legal compliance
[22:16:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:29:15] <tzatziki>	 !log removing two files for legal compliance
[22:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:52] <mutante>	 jouncebot: nowandnext
[22:34:52] <jouncebot>	 No deployments scheduled for the next 7 hour(s) and 25 minute(s)
[22:34:52] <jouncebot>	 In 7 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240605T0600)
[22:35:29] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-Uploading: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9862056 (10TheDJ) I think we can close this ticket ? I'm sure some incidental problems might still e...
[22:35:32] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint.wikimedia.org with reason: reboot T366555
[22:35:32] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on contint.wikimedia.org with reason: reboot T366555
[22:36:03] <mutante>	 !log CI - (integration.wikimedia.org)  short downtime for maintenance
[22:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:36:10] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint1002.wikimedia.org with reason: reboot T366555
[22:36:11] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on contint1002.wikimedia.org with reason: reboot T366555
[22:39:17] <icinga-wm_>	 PROBLEM - SSH on contint1002 is CRITICAL: connect to address 208.80.154.132 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:40:10] <jinxer-wm>	 FIRING: ProbeDown: Service contint1002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:46:48] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint1002.wikimedia.org with reason: reboot T366555
[22:46:48] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on contint1002.wikimedia.org with reason: reboot T366555
[22:47:12] <tzatziki>	 !log removing one file for legal compliance
[22:47:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:51] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on contint.wikimedia.org with reason: reboot T366555
[22:47:52] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 0:10:00 on contint.wikimedia.org with reason: reboot T366555
[22:50:47] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240527/ using stat1009.eqiad.wmnet)
[22:53:17] <icinga-wm_>	 RECOVERY - SSH on contint1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:55:10] <jinxer-wm>	 RESOLVED: ProbeDown: Service contint1002:1443 has failed probes (http_integration_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#contint1002:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:57:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:58:10] <wikibugs>	 (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn)
[22:58:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn)
[22:59:51] <wikibugs>	 (03PS4) 10Dzahn: admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715)
[23:00:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn)
[23:01:30] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-Uploading: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9862096 (10Bawolff) 05Open→03Resolved a:03Bawolff The biggest known issue at this point is...
[23:06:36] <wikibugs>	 (03PS5) 10Dzahn: admin: convert mareikeheuer to analytics-privatedata with shell [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715)
[23:09:39] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 0:10:00 on miscweb1003.eqiad.wmnet with reason: reboot T366555
[23:09:53] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on miscweb1003.eqiad.wmnet with reason: reboot T366555
[23:15:28] <tzatziki>	 !log removing one file for legal compliance
[23:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:19:27] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Make Chqaz admin of Wikija-g mailing list - https://phabricator.wikimedia.org/T365933#9862118 (10Dzahn) Would it be helpful if you contact the original admins or we reset to the original admins from T340380?
[23:31:12] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9862139 (10Dzahn) We have an existing list "wikipedia-bn@lists.wikimedia.org" for Bengali Wikipedia.  This new list seems to be across projects, so wikimedia, and based on language alone....
[23:38:21] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038789
[23:38:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1038789 (owner: 10TrainBranchBot)
[23:42:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2147.codfw.wmnet with reason: Maintenance
[23:42:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2147.codfw.wmnet with reason: Maintenance
[23:42:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2147 (T364299)', diff saved to https://phabricator.wikimedia.org/P64040 and previous config saved to /var/cache/conftool/dbconfig/20240604-234228-marostegui.json
[23:42:31] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299