[00:38:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1014647 [00:38:09] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1014647 (owner: 10TrainBranchBot) [00:53:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2005.codfw.wmnet with OS bullseye [00:53:47] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9664152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye [00:59:40] (SystemdUnitFailed) firing: (2) wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1014647 (owner: 10TrainBranchBot) [01:06:19] !log [WDQS] Restarted `wdqs-blazegraph` and `wdqs-updater` on `wdqs1013` and depooled to catch up on lag [01:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:46] !log T358882 Updated remote cluster seeds for new master state [01:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:50] T358882: Decommission elastic2037-2054 - https://phabricator.wikimedia.org/T358882 [01:07:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2005.codfw.wmnet with reason: host reimage [01:09:25] (SystemdUnitFailed) resolved: (2) wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:10:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2005.codfw.wmnet with reason: host reimage [01:31:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2005.codfw.wmnet with OS bullseye [01:31:51] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9664201 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2005.codfw.wmnet with OS bullseye completed:... [01:35:23] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2037*,elastic2038*,elastic2041*,elastic2042*,elastic2045*,elastic2046*,elastic2047*,elastic2050*,elastic2051*,elastic2052*,elastic2039*,elastic2040*,elastic2043*,elastic2044*,elastic2048*,elastic2053*,elastic2054* for prepare for decom of hosts - ryankemper@cumin2002 - T358882 [01:35:28] T358882: Decommission elastic2037-2054 - https://phabricator.wikimedia.org/T358882 [01:35:28] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2037*,elastic2038*,elastic2041*,elastic2042*,elastic2045*,elastic2046*,elastic2047*,elastic2050*,elastic2051*,elastic2052*,elastic2039*,elastic2040*,elastic2043*,elastic2044*,elastic2048*,elastic2053*,elastic2054* for prepare for decom of hosts - ryankemper@cumin2002 - T358882 [01:37:49] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov200[56] - https://phabricator.wikimedia.org/T355355#9664207 (10Papaul) [01:46:45] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: 14Q3:rack/setup/install dbprov200[56] - 14https://phabricator.wikimedia.org/T355355#9664214 (10Papaul) 05Open→03Resolved 14With the 2 SSD's back in the server, same issue. Doing more troubleshooting, I found out that when the... [01:49:12] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for bblack - https://phabricator.wikimedia.org/T361046#9664219 (10BBlack) [02:37:20] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:39] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2044-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [02:44:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2044-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [02:49:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (7) Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [02:51:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:54:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (9) Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [02:59:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (11) Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [03:02:20] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (13) Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [03:17:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (15) Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [03:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:21:45] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:24:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (16) Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [05:02:23] (03PS1) 10KartikMistry: Update MinT to 2024-03-26-120044-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014729 (https://phabricator.wikimedia.org/T347930) [05:10:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:12:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:15:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:17:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:23:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:28:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [05:55:00] (03CR) 10Fabfur: [C:03+2] depool esams for text cluster drive upgrade [dns] - 10https://gerrit.wikimedia.org/r/1014514 (https://phabricator.wikimedia.org/T360430) (owner: 10Ssingh) [05:57:46] !log running authdns-update on dns1004 to depool ESAMS (T360430) [05:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:50] T360430: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430 [05:59:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 804.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T0600) [06:00:52] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9664389 (10Fabfur) ESAMS DC started depooling @05:58UTC [06:04:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 804.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:15:40] * kart_ updating MinT [06:17:02] I'll wait for sometime though, any deployment happening with current window? [06:28:08] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2024-03-26-120044-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014729 (https://phabricator.wikimedia.org/T347930) (owner: 10KartikMistry) [06:30:47] (03Merged) 10jenkins-bot: Update MinT to 2024-03-26-120044-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014729 (https://phabricator.wikimedia.org/T347930) (owner: 10KartikMistry) [06:32:52] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:38:32] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:48:55] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:50:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:57:46] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [07:00:06] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [07:01:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 879.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:04:21] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:09:24] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [07:11:02] !log Updated MinT to 2024-03-26-120044-production (T347930, T355304, T349487) [07:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:08] T347930: Odia Language Translation Number not translating - https://phabricator.wikimedia.org/T347930 [07:11:09] T355304: Enable Softcatalà models for more language pairs in MinT test instance - https://phabricator.wikimedia.org/T355304 [07:11:09] T349487: Improve MinT punctuation support for Japanese - https://phabricator.wikimedia.org/T349487 [07:14:08] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: preparing for new disk [07:14:28] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: preparing for new disk [07:14:52] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9664436 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e71791c7-a0fa-43b5-81ae-e92b275e5cc3) set by fabfur@cumin1002 for 1 day, 0:00:00 on 8 host(s) and... [07:17:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:20] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:19:25] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:21:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 863.5ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:24:47] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9664439 (10Fabfur) [07:24:54] (CirrusSearchNodeIndexingNotIncreasing) firing: (16) Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [07:29:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 893.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:41:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [07:43:48] (03CR) 10Giuseppe Lavagetto: "Generally LGTM, I'm still on the fence if I'd prefer having something like /__replication/__ROOT__ as a special case for replicating the f" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [07:44:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 842.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:51:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:00:05] Amir1 and Urbanecm: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T0800). [08:00:05] gmodena: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:28] Amir1 urbanecm ping. I'll be around for the backport process [08:08:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 821.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:08:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:08:59] gmodena: I will deploy [08:09:05] hashar ack [08:09:22] hashar thanks! [08:09:43] the change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/983905 is in flagged as being in merge conflict [08:10:00] so I guess something else changed in wmf-config/ext-EventStreamConfig.php ? :) [08:10:54] mmm... I don't see a merge conflict warn in gerrit (ui) [08:11:03] hmm [08:11:45] I definitely had the big redish {Merge Conflict} on the top left [08:12:36] hashar ack. I rebased on master (via gerrit UI), and got no conflict error [08:13:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 821.2ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:13:21] * 71c351425f - Update mediawiki.web_ui_actions stream config (19 hours ago) | [08:13:21] | wmf-config/ext-EventStreamConfig.php | 6 +----- [08:13:22] so yeah [08:13:28] definitely had another patch doing something [08:15:16] gmodena: I imagine that can't really be tested can it? [08:15:56] hashar the patch can be tested post merge, by calling a mw action api endpoint (from mwdebug) [08:16:47] !log hashar@deploy1002 Started scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] [08:16:53] T314956: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 [08:16:53] T351117: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 [08:17:11] wikibugs is lagged ;) [08:17:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 839.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:18:36] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from proton_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [08:19:02] <_joe_> sigh [08:19:02] gmodena: that is being pushed ot the test servers [08:19:05] <_joe_> proton [08:19:20] <_joe_> hashar: what are you deploying rn? [08:19:36] jouncebot: now [08:19:36] For the next 0 hour(s) and 40 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T0800) [08:19:37] :) [08:19:46] <_joe_> hashar: no I mean what patch [08:19:53] <_joe_> there are pages firing [08:20:15] <_joe_> so I'd like to know what has just been deployed [08:20:17] it is not even live [08:20:20] <_joe_> ah the event platform one? [08:20:23] <_joe_> ok [08:20:24] the patch is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/983905 [08:20:30] <_joe_> just wanted to make sure i could exclude it [08:20:32] <_joe_> !incidents [08:20:33] !log hashar@deploy1002 otto and hashar: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:20:33] 4549 (UNACKED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad) [08:20:40] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:20:41] <_joe_> !ack 4549 [08:20:41] 4549 (ACKED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad) [08:20:42] doesn't mean that the scap deploy does not have a side effect on whatever is currently happening [08:21:01] TIL of strenbot... [08:21:12] that tracks pages? [08:21:25] gmodena: your patch is on the mwdebug servers if you wanna test it [08:21:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:22:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 839.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:23:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 801.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:23:19] hashar tested on mwdebug1001. LGTM. [08:23:30] \o/ [08:25:45] !log hashar@deploy1002 otto and hashar: Continuing with sync [08:25:52] a I forgot to press y [08:25:55] or well I did [08:25:57] hmm [08:26:15] * hashar ignores the glitch [08:27:30] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 807.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:34:03] hashar I see the patch is live in prod. [08:34:08] hashar thanks for your help [08:36:46] it is still more or less deploying ;) [08:37:47] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:983905|Add webrequest.frontend.rc0 stream (T314956 T351117)]] (duration: 20m 59s) [08:37:52] done! [08:37:52] T314956: [Event Platform] Declare webrequest as an Event Platform stream - https://phabricator.wikimedia.org/T314956 [08:37:52] T351117: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 [08:38:09] !log UTC morning backport window completed [08:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:37] I am going to upgrade Jenkins [08:40:15] well [08:42:18] I cant right now so that will be for later :] [08:48:43] !log hashar@deploy1002 Started deploy [releng/jenkins-deploy@b3ccf85] (releasing): Upgrade Jenkins from 2.426.3 to 2.440.2 on release hosts # T360759 [08:48:46] (GatewayBackendErrorsHigh) resolved: rest-gateway: elevated 5xx errors from proton_cluster in eqiad #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=eqiad%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [08:48:47] T360759: Jenkins core security advisory - 2024-03-20 - https://phabricator.wikimedia.org/T360759 [08:48:56] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.eqiad.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:48:56] <_joe_> I love well executed plans [08:49:13] yeah it is messy [08:49:21] I can upgrade one but not the other :) [08:49:36] it is not any urgent anyway [08:51:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:51:50] (ProbeDown) firing: Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#releases1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:54:34] !log hashar@deploy1002 Finished deploy [releng/jenkins-deploy@b3ccf85] (releasing): Upgrade Jenkins from 2.426.3 to 2.440.2 on release hosts # T360759 (duration: 05m 51s) [08:54:39] T360759: Jenkins core security advisory - 2024-03-20 - https://phabricator.wikimedia.org/T360759 [08:56:50] (ProbeDown) firing: (3) Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:05:19] ^ yes it is broken [09:07:07] !log Downgraded release Jenkins back to 2.426.3 [09:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:50] (ProbeDown) resolved: (3) Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:12:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: provisionning db2219.codfw.wmnet - T355422 [09:12:47] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [09:12:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2119.codfw.wmnet with reason: provisionning db2219.codfw.wmnet - T355422 [09:13:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: provisionning db2219.codfw.wmnet - T355422 [09:13:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: provisionning db2219.codfw.wmnet - T355422 [09:14:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2119 in db2219 for T355422', diff saved to https://phabricator.wikimedia.org/P58937 and previous config saved to /var/cache/conftool/dbconfig/20240327-091444-arnaudb.json [09:15:45] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2119.codfw.wmnet onto db2219.codfw.wmnet [09:19:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: provisionning db2220.codfw.wmnet - T355422 [09:19:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: provisionning db2220.codfw.wmnet - T355422 [09:19:26] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [09:19:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: provisionning db2220.codfw.wmnet - T355422 [09:19:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: provisionning db2220.codfw.wmnet - T355422 [09:20:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2120 in db2220 for T355422', diff saved to https://phabricator.wikimedia.org/P58938 and previous config saved to /var/cache/conftool/dbconfig/20240327-092030-arnaudb.json [09:22:36] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2120.codfw.wmnet onto db2220.codfw.wmnet [09:33:14] hashar: I just tried to deploy the Jenkins upgrade to the scap3-dev environment and it worked fine, I can also see the error has something to do with the CasC configuration, it doesn't seem related to matrix-auth as mentioned here AFAICS: https://phabricator.wikimedia.org/T361084 [09:33:29] how did you deploy the repo? did you use the `deploy.sh` script? [09:45:47] !log poweroff A:esams and A:cp-text [09:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:58] (ProbeDown) firing: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:50:08] oh I see [09:50:16] <_joe_> sukhe: ahem [09:50:18] <_joe_> :D [09:50:19] sorry folks, we should downtime this [09:50:23] <_joe_> it's ok [09:50:24] no impact, right? [09:50:26] <_joe_> !incidents [09:50:26] 4550 (UNACKED) [8x] ProbeDown sre (probes/service esams) [09:50:26] 4549 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad) [09:50:30] no, expected [09:50:31] <_joe_> !ack 4550 [09:50:32] 4550 (ACKED) [8x] ProbeDown sre (probes/service esams) [09:50:51] let me check if we are missing something else as well on the downtiming [09:52:20] (JobUnavailable) firing: (2) Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:52:25] (SystemdUnitFailed) firing: ceph-0fee72ae-ec18-11ee-b973-bc97e1bb7c18@mgr.moss-be1002.tpqaym.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:55:38] jouncebot: nowandnext [09:55:38] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [09:55:38] In 1 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1100) [09:56:18] hashar: I can't see a reason why the Jenkins release deploy failed and I can't reproduce it in the dev environment, I'm going to try deploying again [09:56:53] I have made some comments in the sub task [09:56:56] but yeah that is a mystery :( [09:58:22] hashar: how did you deploy the upgrade? [09:58:44] went to the deployment server and I ran the shell helpers cript at the root [09:59:23] cd /srv/deployment/releng/jenkins-deploy/ [09:59:25] ./deploy.sh [10:00:06] ok, I'm going to try that same thing again, I have changed the logger so we should get more info if something explodes: https://gitlab.wikimedia.org/repos/releng/jenkins-deploy/-/merge_requests/55 [10:01:09] (03PS1) 10Slyngshede: Usability improvements for SSH key management. [software/bitu] - 10https://gerrit.wikimedia.org/r/1014992 (https://phabricator.wikimedia.org/T359536) [10:01:13] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@661e531] (releasing): (no justification provided) [10:01:17] (03CR) 10Giuseppe Lavagetto: [C:03+1] Add support for an optional ignored-keys pattern [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1008944 (owner: 10Scott French) [10:01:21] (03PS12) 10Gmodena: Add webrequest.frontend.rc0 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [10:01:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [10:01:43] (03PS1) 10NMW03: Restrict local uploads to uploader user group in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 [10:01:53] (03Merged) 10jenkins-bot: Add webrequest.frontend.rc0 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/983905 (https://phabricator.wikimedia.org/T314956) (owner: 10Ottomata) [10:02:06] ok, I can see stacktraces now [10:02:25] (SystemdUnitFailed) resolved: ceph-0fee72ae-ec18-11ee-b973-bc97e1bb7c18@mgr.moss-be1002.tpqaym.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:02:35] !log jnuche@deploy1002 deploy aborted: (no justification provided) (duration: 01m 21s) [10:02:58] to roll back I amended the deploy script on the deploy1002 to comment out `apt install jenkins` [10:03:11] and installed the previous deb I have found in /var/cache/apt/archive [10:03:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to analytics-privatedata-users for mpham - 14https://phabricator.wikimedia.org/T360641#9664499 (10jcrespo) →14Duplicate dup:03T270438 [10:03:17] 06SRE, 10LDAP-Access-Requests: 14LDAP access to the wmf group for Mike Pham - 14https://phabricator.wikimedia.org/T270438#9664496 (10jcrespo) [10:03:49] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for bblack - https://phabricator.wikimedia.org/T361046#9664505 (10jcrespo) Pending approval from #data-engineering 's list of people that can approve that access: @odimitrijevic @Milimetric @WDoranWMF or @Ahoelzl. [10:03:53] (03CR) 10Jcrespo: [C:03+1] "Ok for the patch, but requires approval from Data Engineering: https://phabricator.wikimedia.org/T361046#9664504" [puppet] - 10https://gerrit.wikimedia.org/r/1014590 (https://phabricator.wikimedia.org/T361046) (owner: 10BBlack) [10:04:05] (03PS4) 10DCausse: wdqs: add x-monitoring-query [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) [10:04:09] (03PS2) 10DCausse: updateQueryServiceLag: tune the min query rate on a pooled server [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) [10:04:22] (03PS1) 10Arnaudb: mariadb: toggle notifications for db2216 [puppet] - 10https://gerrit.wikimedia.org/r/1014649 (https://phabricator.wikimedia.org/T355422) [10:04:22] !log powercycling backup1005 [10:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:34] (ProbeDown) firing: (3) Service releases1003:443 has failed probes (http_releases_jenkins_wikimedia_org_login_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:05:14] hashar: ok, think I need to do that, a simple install to previous version didn't work [10:05:34] (03CR) 10ArielGlenn: MachineVision extension is being sunsetted, so stop doing dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [10:05:52] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9664538 (10jcrespo) @DBu-WMF Hi, we are discussing how to proceed, as handling postmaster access is a new process for us. The first question is **what dom... [10:07:31] (03PS4) 10AOkoth: trafficserver: miscweb(security) failover to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1014044 (https://phabricator.wikimedia.org/T350796) [10:07:35] (03PS2) 10TheDJ: Restrict local uploads to uploader user group in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (owner: 10NMW03) [10:07:44] mmmh, can't find the previous package, damn [10:07:51] (03PS3) 10TheDJ: Restrict local uploads to uploader user group in azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014648 (https://phabricator.wikimedia.org/T360847) (owner: 10NMW03) [10:08:15] (03CR) 10AOkoth: [C:03+2] trafficserver: miscweb(security) failover to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1014044 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [10:08:37] jnuche: sudo dpkg -i /var/cache/apt/archives/jenkins_2.426.3_all.deb [10:09:25] the thing is it logs Failed ConfigurationAsCode.init [10:09:34] but that does not give any further information :/ [10:09:35] plural archive... thanks [10:09:43] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:09:59] yeah, that's why the change I posted above, so we get more log info [10:10:10] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:10:24] ah I missed that in the log spam [10:10:58] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:11:12] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:12:40] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@661e531] (releasing): (no justification provided) [10:13:12] (03CR) 10Stevemunene: [C:03+1] Decommission an-tool1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1014432 (https://phabricator.wikimedia.org/T353782) (owner: 10Brouberol) [10:13:48] (03CR) 10Stevemunene: "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [10:14:14] (03CR) 10Stevemunene: [C:03+1] ats: drop mapping rule redirecting to hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1014550 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:14:18] (03CR) 10Volans: [C:03+1] "LGTM, thanks for the patch!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:14:26] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@661e531] (releasing): (no justification provided) (duration: 01m 46s) [10:14:55] (03PS1) 10Arnaudb: mariadb: add comment to ensure reimage of db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1014650 (https://phabricator.wikimedia.org/T355422) [10:15:19] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9664707 (10Fabfur) [10:15:31] bleh, I removed sudo apt-get install -y jenkins from the installer but it still upgraded the Jenkins version [10:15:36] let's see [10:16:03] (03CR) 10Klausman: [C:03+2] admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:16:07] (03CR) 10Phuedx: Update the WikiLambda instrumentation to use core interaction events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [10:16:11] (03PS7) 10Phuedx: Update the WikiLambda instrumentation to use core interaction events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [10:16:37] (03PS8) 10Majavah: alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) [10:16:50] (03Merged) 10jenkins-bot: admin_ng: Add network policy to allow LW isvcs to access ML Cassandra [deployment-charts] - 10https://gerrit.wikimedia.org/r/1012668 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [10:17:08] (03CR) 10Btullis: [C:03+1] Decommission an-tool1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1014432 (https://phabricator.wikimedia.org/T353782) (owner: 10Brouberol) [10:18:06] 10ops-eqiad, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9664751 (10jcrespo) It seems the RAID controller has gone haywire, as there is no bootable medium, and it is stuck in an endless network boot. The RAID controllers has been mapp... [10:18:12] 10ops-eqiad, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9664753 (10jcrespo) [10:18:20] (03CR) 10Majavah: [C:03+2] "Thanks! I'll send a Puppet patch next to add the required config." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:18:31] azehaze [10:18:51] No signature of method: permissions() is applicable for argument types: [10:18:51] (java.util.ArrayList) values: [[USER:Job/Read:anonymous, USER:Job/Read:jenkinsrelapi, ...]] [10:19:00] which leads to conf/releasing/casc/jobs/docpub.groovy [10:19:01] !log brouberol@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [10:19:18] !log brouberol@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [10:19:21] and some properties.authorizationMatrix() [10:19:25] (03PS1) 10Majavah: P:spicerack: add alertmanager instance config [puppet] - 10https://gerrit.wikimedia.org/r/1015000 (https://phabricator.wikimedia.org/T360932) [10:19:45] !log klausman@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [10:19:58] (03CR) 10Jcrespo: [C:03+1] mariadb: add comment to ensure reimage of db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1014650 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [10:20:31] !log klausman@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [10:20:43] (03CR) 10Arnaudb: [C:03+2] mariadb: add comment to ensure reimage of db2202 [puppet] - 10https://gerrit.wikimedia.org/r/1014650 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [10:21:10] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1743/co" [puppet] - 10https://gerrit.wikimedia.org/r/1015000 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:23:15] (03Merged) 10jenkins-bot: alertmanager: Add support for multiple alertmanager instances [software/spicerack] - 10https://gerrit.wikimedia.org/r/1014099 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:23:17] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:23:20] hashar: yeah, just saw that [10:23:33] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:23:38] (03CR) 10Volans: [C:03+1] "LGTM, can be merged anytime, so the config will be there for when spicerack will be released" [puppet] - 10https://gerrit.wikimedia.org/r/1015000 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:23:41] but I don't get why the revert didn't fix, no plugin version has changed [10:24:24] (03CR) 10Btullis: Decommission an-coord100[12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [10:24:38] also, I commented `sudo apt-get install -y jenkins` in `scap/scripts/update_jenkins.sh` but that the deploy still upgraded Jenkins, I don't see where else we can be upgrading the package [10:24:44] (03CR) 10Btullis: [C:03+1] ats: drop mapping rule redirecting to hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1014550 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:24:48] (03CR) 10Majavah: [V:03+1 C:03+2] P:spicerack: add alertmanager instance config [puppet] - 10https://gerrit.wikimedia.org/r/1015000 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:24:56] (03CR) 10Btullis: [C:03+1] cache: remove caching config for hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1014553 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:25:05] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:25:15] (03CR) 10Btullis: [C:03+1] cumin: remove hue alias [puppet] - 10https://gerrit.wikimedia.org/r/1014554 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:25:28] (03CR) 10Btullis: [C:03+1] site: change an-tool1009 role back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1014555 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:25:29] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:25:40] (03CR) 10Btullis: [C:03+1] idp: drop hue client configuration [puppet] - 10https://gerrit.wikimedia.org/r/1014556 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:26:14] (03CR) 10JMeybohm: Migrate datahub to use external-services for CAS IDP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014065 (https://phabricator.wikimedia.org/T331894) (owner: 10Btullis) [10:26:35] (03CR) 10Brouberol: [C:03+2] Decommission an-tool1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1014432 (https://phabricator.wikimedia.org/T353782) (owner: 10Brouberol) [10:27:01] (03PS6) 10Brouberol: Decommission an-coord100[12] [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) [10:27:10] (03PS1) 10Klausman: Revert "admin_ng: Add network policy to allow LW isvcs to access ML Cassandra" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014710 [10:28:04] (03CR) 10Brouberol: [C:03+2] ats: drop mapping rule redirecting to hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1014550 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:28:44] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on releases1003.eqiad.wmnet with reason: Troubleshooting jenkins update [10:28:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on releases1003.eqiad.wmnet with reason: Troubleshooting jenkins update [10:30:41] (03CR) 10Volans: [C:03+1] "Sorry spot an issue" [puppet] - 10https://gerrit.wikimedia.org/r/1015000 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:31:08] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@661e531] (releasing): (no justification provided) [10:31:48] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@661e531] (releasing): (no justification provided) (duration: 00m 40s) [10:32:16] (03PS1) 10Majavah: hieradata: spicerack: use plaintext HTTP for alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1015002 (https://phabricator.wikimedia.org/T360932) [10:32:52] (03CR) 10Brouberol: [C:03+2] cache: remove caching config for hue.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1014553 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:33:06] (03CR) 10Brouberol: [C:03+2] cumin: remove hue alias [puppet] - 10https://gerrit.wikimedia.org/r/1014554 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:33:15] (03CR) 10Majavah: [V:03+1 C:03+2] P:spicerack: add alertmanager instance config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1015000 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:33:16] (03CR) 10Btullis: "nit: It says aqs in the commit message, instead of hue." [puppet] - 10https://gerrit.wikimedia.org/r/1014557 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:33:18] ok, managed to get Jenkins back... my bad for not verifying the revert was easy [10:33:23] thanks hashar for the tips [10:33:35] and I need to find out why the exception doesn't reproduce in the dev environment [10:33:48] (03CR) 10Volans: [C:03+1] "ship it :)" [puppet] - 10https://gerrit.wikimedia.org/r/1015002 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:34:03] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9664786 (10RobH) ESAMS remote hands began hands on work at 11:10 CET and it is now ongoing. [10:34:20] (03PS2) 10Brouberol: hue: remove manifests and configuration [puppet] - 10https://gerrit.wikimedia.org/r/1014557 (https://phabricator.wikimedia.org/T341895) [10:34:38] jnuche: lets move to -releng cause this channel has wayy too many bots nowadays :) [10:35:09] (03CR) 10Brouberol: "Fixed, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1014557 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:35:26] (03PS3) 10Brouberol: hue: remove manifests and configuration [puppet] - 10https://gerrit.wikimedia.org/r/1014557 (https://phabricator.wikimedia.org/T341895) [10:36:12] (03CR) 10Brouberol: [C:03+2] site: change an-tool1009 role back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1014555 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:36:23] (03CR) 10Brouberol: [C:03+2] idp: drop hue client configuration [puppet] - 10https://gerrit.wikimedia.org/r/1014556 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [10:36:37] (03PS2) 10Brouberol: idp: drop hue client configuration [puppet] - 10https://gerrit.wikimedia.org/r/1014556 (https://phabricator.wikimedia.org/T341895) [10:36:43] (03CR) 10JMeybohm: [C:03+1] role::docker_registry_ha::registry: set nginx's tmpfs size in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1014534 (https://phabricator.wikimedia.org/T360637) (owner: 10Elukey) [10:37:33] (03CR) 10Klausman: [C:03+2] Revert "admin_ng: Add network policy to allow LW isvcs to access ML Cassandra" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014710 (owner: 10Klausman) [10:40:49] (03Merged) 10jenkins-bot: Revert "admin_ng: Add network policy to allow LW isvcs to access ML Cassandra" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014710 (owner: 10Klausman) [10:42:57] (03PS1) 10David Caro: karma: skip wmcloud from default silences [puppet] - 10https://gerrit.wikimedia.org/r/1015003 (https://phabricator.wikimedia.org/T320973) [10:49:20] (03CR) 10Filippo Giunchedi: [C:03+1] icinga: add cdobbins [puppet] - 10https://gerrit.wikimedia.org/r/1014589 (owner: 10CDobbins) [10:49:43] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Update hiera entries for alert2001 to use Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [10:49:50] (03CR) 10Filippo Giunchedi: [C:03+1] alert: Update hiera entries for alert1001 to use Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [10:51:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2120.codfw.wmnet onto db2220.codfw.wmnet [10:52:21] (03PS1) 10Btullis: Make datahub networkpolicy include/template consistent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014652 (https://phabricator.wikimedia.org/T359423) [10:52:37] (03CR) 10Majavah: [C:03+2] hieradata: spicerack: use plaintext HTTP for alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/1015002 (https://phabricator.wikimedia.org/T360932) (owner: 10Majavah) [10:53:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 25%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58940 and previous config saved to /var/cache/conftool/dbconfig/20240327-105353-arnaudb.json [10:53:53] (03PS7) 10Jcrespo: mediabackups: Add newly setup storage host backup2011 [puppet] - 10https://gerrit.wikimedia.org/r/995189 (https://phabricator.wikimedia.org/T334069) [10:54:22] (03PS58) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [10:54:22] (03PS1) 10AOkoth: miscweb: remove profile::microsites::security [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) [10:54:23] (03CR) 10JMeybohm: "Needs a version bump" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014652 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [10:54:36] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1745/console" [puppet] - 10https://gerrit.wikimedia.org/r/1015003 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [10:54:42] (03PS2) 10AOkoth: miscweb: remove profile::microsites::security [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) [10:54:49] (03CR) 10Jcrespo: [C:03+2] mediabackups: Add newly setup storage host backup2011 [puppet] - 10https://gerrit.wikimedia.org/r/995189 (https://phabricator.wikimedia.org/T334069) (owner: 10Jcrespo) [10:55:10] (03PS2) 10Btullis: Make datahub networkpolicy include/template consistent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014652 (https://phabricator.wikimedia.org/T359423) [10:57:09] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1746/co" [puppet] - 10https://gerrit.wikimedia.org/r/1015003 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [10:58:09] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T360862#9664844 (10fgiunchedi) Also cc @VRiley-WMF if you could help with this? thank you! [10:58:39] (03PS1) 10Klausman: deployment_server: Add external service block for Cassandra/ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/1015006 (https://phabricator.wikimedia.org/T360428) [10:59:49] (03CR) 10Klausman: [C:03+1] role::docker_registry_ha::registry: set nginx's tmpfs size in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1014534 (https://phabricator.wikimedia.org/T360637) (owner: 10Elukey) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1100) [11:00:05] elukey: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:20] (03PS2) 10David Caro: karma: skip wmcloud from default silences [puppet] - 10https://gerrit.wikimedia.org/r/1015003 (https://phabricator.wikimedia.org/T320973) [11:00:46] (03PS1) 10Brouberol: external-services: enable in all ml k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015008 (https://phabricator.wikimedia.org/T360428) [11:01:50] (03PS2) 10Klausman: deployment_server: Add external service block for Cassandra/ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/1015006 (https://phabricator.wikimedia.org/T360428) [11:02:18] (03CR) 10Klausman: [C:03+1] external-services: enable in all ml k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015008 (https://phabricator.wikimedia.org/T360428) (owner: 10Brouberol) [11:02:49] (03CR) 10Filippo Giunchedi: [C:03+1] karma: skip wmcloud from default silences [puppet] - 10https://gerrit.wikimedia.org/r/1015003 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [11:03:35] o/ [11:04:39] (03CR) 10Brouberol: [C:03+2] external-services: enable in all ml k8s clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015008 (https://phabricator.wikimedia.org/T360428) (owner: 10Brouberol) [11:06:26] (03PS3) 10Majavah: Fixes for Puppet certificate cleaning on Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 [11:06:59] (03CR) 10Klausman: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1747/co" [puppet] - 10https://gerrit.wikimedia.org/r/1015006 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [11:07:51] (03CR) 10Brouberol: [C:03+1] deployment_server: Add external service block for Cassandra/ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/1015006 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [11:08:18] (03CR) 10Klausman: [V:03+1 C:03+2] deployment_server: Add external service block for Cassandra/ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/1015006 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [11:08:46] (03PS1) 10Cparle: MachineVision being sunsetted - remove dumps scripts [puppet] - 10https://gerrit.wikimedia.org/r/1015009 (https://phabricator.wikimedia.org/T347967) [11:09:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 50%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58941 and previous config saved to /var/cache/conftool/dbconfig/20240327-110858-arnaudb.json [11:09:14] (03CR) 10CI reject: [V:04-1] Fixes for Puppet certificate cleaning on Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 (owner: 10Majavah) [11:09:44] Hey folks! I am going to run a maintenance for the Docker Registry nodes [11:10:00] if you need to deploy Mediawiki or a K8s service please ping me :) [11:10:12] (03PS4) 10Majavah: Fixes for Puppet certificate cleaning on Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 [11:10:55] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on registry2003.codfw.wmnet with reason: Increase tmpfs for nginx [11:11:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on registry2003.codfw.wmnet with reason: Increase tmpfs for nginx [11:11:19] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on registry2004.codfw.wmnet with reason: Increase tmpfs for nginx [11:11:33] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on registry2004.codfw.wmnet with reason: Increase tmpfs for nginx [11:12:53] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=registry2003.codfw.wmnet [11:12:53] (03PS1) 10Cparle: MachineVision being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1015010 [11:12:59] (03CR) 10CI reject: [V:04-1] Fixes for Puppet certificate cleaning on Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 (owner: 10Majavah) [11:13:05] !log brouberol@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [11:13:19] !log brouberol@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [11:13:30] (03CR) 10CI reject: [V:04-1] MachineVision being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1015010 (owner: 10Cparle) [11:13:31] brouberol: o/ [11:13:51] I am running maintenance on the Docker Registry nodes, can you wait a sec before proceeding? [11:14:09] (03PS5) 10Majavah: Fixes for Puppet certificate cleaning on Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 [11:14:41] (03PS2) 10Cparle: MachineVision being sunsetted [puppet] - 10https://gerrit.wikimedia.org/r/1015010 [11:15:39] !log expand vram for registry200[3,4] from 4G to 6G - T360637 [11:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:43] T360637: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637 [11:16:40] (03PS7) 10Jcrespo: mediabackups: Add newly setup storage host backup1011 [puppet] - 10https://gerrit.wikimedia.org/r/995188 (https://phabricator.wikimedia.org/T334069) [11:16:41] (03PS1) 10Jcrespo: Add rclone to the list of packages to install on backup workers [puppet] - 10https://gerrit.wikimedia.org/r/1015011 (https://phabricator.wikimedia.org/T334069) [11:16:53] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM registry2003.codfw.wmnet [11:17:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:17:23] (03CR) 10CI reject: [V:04-1] Add rclone to the list of packages to install on backup workers [puppet] - 10https://gerrit.wikimedia.org/r/1015011 (https://phabricator.wikimedia.org/T334069) (owner: 10Jcrespo) [11:17:55] Elukey: go ahead, sorry [11:17:57] îl [11:18:12] I’m out for lunch [11:18:13] (03CR) 10Majavah: [C:03+2] Fixes for Puppet certificate cleaning on Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1013240 (owner: 10Majavah) [11:18:38] (03CR) 10Elukey: [V:03+1 C:03+2] role::docker_registry_ha::registry: set nginx's tmpfs size in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1014534 (https://phabricator.wikimedia.org/T360637) (owner: 10Elukey) [11:19:19] brouberol: <3 [11:19:40] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:40] taavi: if you get my patch in your puppet-merge please go ahead :) [11:19:51] elukey: I did not [11:19:59] super [11:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:24:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 75%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58942 and previous config saved to /var/cache/conftool/dbconfig/20240327-112405-arnaudb.json [11:24:54] (CirrusSearchNodeIndexingNotIncreasing) firing: (16) Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:25:32] (03PS1) 10Majavah: hieradata: use alias not lookup [puppet] - 10https://gerrit.wikimedia.org/r/1015012 [11:26:19] (03CR) 10Majavah: [C:03+2] hieradata: use alias not lookup [puppet] - 10https://gerrit.wikimedia.org/r/1015012 (owner: 10Majavah) [11:26:48] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry2003.codfw.wmnet [11:29:26] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=registry2003.codfw.wmnet [11:29:48] first registry node done, proceeding in a bit with the second [11:31:43] (03PS1) 10Majavah: P:openstack: puppetserver: fix to use IPs not hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1015014 [11:32:10] (03CR) 10CI reject: [V:04-1] P:openstack: puppetserver: fix to use IPs not hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1015014 (owner: 10Majavah) [11:32:20] (03CR) 10JMeybohm: [C:03+1] Make datahub networkpolicy include/template consistent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014652 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [11:32:28] (03PS2) 10Majavah: P:openstack: puppetserver: fix to use IPs not hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1015014 [11:32:34] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=registry2004.codfw.wmnet [11:33:25] !log elukey@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM registry2004.codfw.wmnet [11:35:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) Will create a clone of db2119.codfw.wmnet onto db2219.codfw.wmnet [11:36:13] (03CR) 10Majavah: [C:03+2] P:openstack: puppetserver: fix to use IPs not hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1015014 (owner: 10Majavah) [11:36:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2119 (re)pooling @ 25%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58943 and previous config saved to /var/cache/conftool/dbconfig/20240327-113619-arnaudb.json [11:36:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on 12 hosts with reason: Maint T352010 [11:36:58] !log rename ens5 to ens13 in /etc/network/interfaces of registry2004 - T360637 [11:37:03] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 12 hosts with reason: Maint T352010 [11:39:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2120 (re)pooling @ 100%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58944 and previous config saved to /var/cache/conftool/dbconfig/20240327-113911-arnaudb.json [11:39:42] !log elukey@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM registry2004.codfw.wmnet [11:39:56] (03PS2) 10Clément Goubert: restbase: Migrate backend traffic to mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1014493 (https://phabricator.wikimedia.org/T358213) [11:39:56] (03PS1) 10Clément Goubert: restbase: Moving 50% of mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1015016 (https://phabricator.wikimedia.org/T358213) [11:41:09] !log run `apt-get clean` on registry2004 to free some space on the root partition [11:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:25] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=registry2004.codfw.wmnet [11:44:59] all done!! [11:49:41] (03PS1) 10Majavah: openstack: puppetcertleaks: fix list command [puppet] - 10https://gerrit.wikimedia.org/r/1015017 [11:50:24] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9664951 (10Fabfur) [11:51:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2119 (re)pooling @ 50%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58945 and previous config saved to /var/cache/conftool/dbconfig/20240327-115125-arnaudb.json [11:53:51] (03PS3) 10AOkoth: miscweb: remove profile::microsites::security [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) [11:54:09] (03PS3) 10Elukey: Add the amd-pytorch base image for ML workloads [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) [11:54:18] (03CR) 10Elukey: [V:03+2 C:03+2] Add the amd-pytorch base image for ML workloads [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1013335 (https://phabricator.wikimedia.org/T360638) (owner: 10Elukey) [11:54:55] 06SRE, 10MW-on-K8s, 10RESTBase, 06serviceops, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9664964 (10Clement_Goubert) Things to keep an eye on: - Upstream error rate is higher on `mw-api-int` than bare-metal {F43515489} - Connection esta... [11:56:30] (03PS1) 10Ssingh: repool esams (text cluster maint completed) [dns] - 10https://gerrit.wikimedia.org/r/1015018 [12:00:54] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9664975 (10Fabfur) [12:04:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for 8 hosts [12:04:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 8 hosts [12:05:33] (03CR) 10Fabfur: [C:03+1] "ok for me" [dns] - 10https://gerrit.wikimedia.org/r/1015018 (owner: 10Ssingh) [12:06:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2119 (re)pooling @ 75%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58946 and previous config saved to /var/cache/conftool/dbconfig/20240327-120630-arnaudb.json [12:07:52] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=esams,cluster=cache_text [12:08:54] (03CR) 10Hnowlan: [C:03+1] restbase: Moving 50% of mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1015016 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [12:09:58] (ProbeDown) resolved: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:10:02] cool [12:12:42] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9665004 (10Fabfur) [12:14:40] (03CR) 10Ssingh: [C:03+2] repool esams (text cluster maint completed) [dns] - 10https://gerrit.wikimedia.org/r/1015018 (owner: 10Ssingh) [12:15:03] !log running authdns-update to repool esams [12:15:05] (03CR) 10Jelto: [C:03+1] "lgtm. There is another comment in site.pp (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/manife" [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [12:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:18] (03CR) 10Ladsgroup: [C:03+1] "It's green in icinga" [puppet] - 10https://gerrit.wikimedia.org/r/1014649 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [12:15:35] (03CR) 10Arnaudb: [C:03+2] mariadb: toggle notifications for db2216 [puppet] - 10https://gerrit.wikimedia.org/r/1014649 (https://phabricator.wikimedia.org/T355422) (owner: 10Arnaudb) [12:17:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 5%: Post clone (dst)', diff saved to https://phabricator.wikimedia.org/P58947 and previous config saved to /var/cache/conftool/dbconfig/20240327-121752-arnaudb.json [12:18:08] (03PS4) 10AOkoth: miscweb: remove profile::microsites::security [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) [12:18:30] (03CR) 10AOkoth: "Ack." [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [12:18:54] (03PS5) 10AOkoth: miscweb: remove profile::microsites::security [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) [12:19:09] (03PS1) 10Brouberol: external-services: create namespace in aux/ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015021 [12:19:43] (03PS2) 10Brouberol: external-services: create namespace in aux/ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015021 (https://phabricator.wikimedia.org/T360428) [12:20:01] (03CR) 10Klausman: [C:03+1] external-services: create namespace in aux/ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015021 (https://phabricator.wikimedia.org/T360428) (owner: 10Brouberol) [12:20:21] (03CR) 10Majavah: [C:03+2] Remove old toolserver_legacy code [puppet] - 10https://gerrit.wikimedia.org/r/1014027 (owner: 10Majavah) [12:20:40] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [12:21:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2119 (re)pooling @ 100%: Post clone (src)', diff saved to https://phabricator.wikimedia.org/P58948 and previous config saved to /var/cache/conftool/dbconfig/20240327-122136-arnaudb.json [12:22:20] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:23:11] (03CR) 10Brouberol: [C:03+2] external-services: create namespace in aux/ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015021 (https://phabricator.wikimedia.org/T360428) (owner: 10Brouberol) [12:24:22] !log brouberol@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [12:25:03] !log brouberol@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [12:26:26] !log brouberol@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [12:26:38] (03PS2) 10Jcrespo: Add rclone to the list of packages to install on backup workers [puppet] - 10https://gerrit.wikimedia.org/r/1015011 (https://phabricator.wikimedia.org/T334069) [12:26:49] !log brouberol@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [12:27:33] !log brouberol@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [12:27:58] (03PS3) 10Jcrespo: Add rclone to the list of packages to install on backup workers [puppet] - 10https://gerrit.wikimedia.org/r/1015011 (https://phabricator.wikimedia.org/T334069) [12:28:14] !log brouberol@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [12:28:32] (03CR) 10Jcrespo: [C:03+2] Add rclone to the list of packages to install on backup workers [puppet] - 10https://gerrit.wikimedia.org/r/1015011 (https://phabricator.wikimedia.org/T334069) (owner: 10Jcrespo) [12:30:03] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [12:30:18] !log brouberol@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:30:40] (03PS1) 10Majavah: openstack: Restart designate-sink after changing hooks [puppet] - 10https://gerrit.wikimedia.org/r/1015023 [12:32:17] (03CR) 10Hnowlan: [C:03+1] changeprop: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014538 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [12:32:33] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1015023 (owner: 10Majavah) [12:32:49] (03PS8) 10Jcrespo: mediabackups: Add newly setup storage host backup1011 [puppet] - 10https://gerrit.wikimedia.org/r/995188 (https://phabricator.wikimedia.org/T334069) [12:32:53] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:32:58] (03CR) 10Majavah: [V:03+1 C:03+2] openstack: Restart designate-sink after changing hooks [puppet] - 10https://gerrit.wikimedia.org/r/1015023 (owner: 10Majavah) [12:33:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 10%: Post clone (dst)', diff saved to https://phabricator.wikimedia.org/P58949 and previous config saved to /var/cache/conftool/dbconfig/20240327-123258-arnaudb.json [12:33:07] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:33:17] btw, I should have mentioned [12:33:50] !log redeploying external-services in all k8s clusters to account for the newly exposed ml-cassandra cluster - T360428 [12:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:58] T360428: Add Istio (and related) config to allow LW isvcs to talk to ML Cassandra machines - https://phabricator.wikimedia.org/T360428 [12:34:08] (03CR) 10Jcrespo: [C:03+2] mediabackups: Add newly setup storage host backup1011 [puppet] - 10https://gerrit.wikimedia.org/r/995188 (https://phabricator.wikimedia.org/T334069) (owner: 10Jcrespo) [12:34:48] !log brouberol@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [12:35:03] (03CR) 10Hnowlan: [C:03+1] changeprop: Add base.external-services-networkpolicy:1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014539 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [12:35:10] (03CR) 10Hnowlan: [C:03+1] changeprop: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014540 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [12:35:15] !log brouberol@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [12:35:43] !log brouberol@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:35:45] (03CR) 10Hnowlan: [C:03+1] "Thanks for doing all of these!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014542 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [12:35:59] !log brouberol@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:36:29] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:36:45] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:36:49] all this does is create a Service and its associated Endpoints resource [12:37:18] (03PS2) 10Cparle: Sunsetting MachineVision extension, so remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013284 (https://phabricator.wikimedia.org/T352884) [12:37:19] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:37:44] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:37:47] all done [12:38:04] (03CR) 10Clément Goubert: [C:03+1] "I think it would be wise wait until after the codfw repool to merge, but otherwise LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1009539 (https://phabricator.wikimedia.org/T347054) (owner: 10Volans) [12:39:54] (03PS2) 10Cparle: Removing MachineVision events, extension is being sunsetted [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013101 (https://phabricator.wikimedia.org/T347970) [12:40:12] (03CR) 10Brouberol: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1014556 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [12:40:22] (03PS4) 10Brouberol: hue: remove manifests and configuration [puppet] - 10https://gerrit.wikimedia.org/r/1014557 (https://phabricator.wikimedia.org/T341895) [12:42:18] (03PS3) 10Brouberol: hue: drop CNAME DNS record [dns] - 10https://gerrit.wikimedia.org/r/1014549 (https://phabricator.wikimedia.org/T341895) [12:46:16] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1015003 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [12:48:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 15%: Post clone (dst)', diff saved to https://phabricator.wikimedia.org/P58950 and previous config saved to /var/cache/conftool/dbconfig/20240327-124805-arnaudb.json [12:53:55] (03PS1) 10Majavah: P:puppetserver::git: do not mark directories as safe [puppet] - 10https://gerrit.wikimedia.org/r/1015032 [12:57:23] (03PS2) 10Majavah: P:puppetserver::git: do not mark directories as safe [puppet] - 10https://gerrit.wikimedia.org/r/1015032 [12:59:00] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1751/co" [puppet] - 10https://gerrit.wikimedia.org/r/1015032 (owner: 10Majavah) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1300). [13:00:04] cormacparle, Daimona, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] o/ [13:00:13] * Lucas_WMDE can’t deploy [13:00:17] * cormacparle waves [13:01:02] urbanecm: are you around? [13:02:05] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maint T352010 [13:02:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maint T352010 [13:02:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:02:57] \o [13:03:04] I can deploy [13:03:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 20%: Post clone (dst)', diff saved to https://phabricator.wikimedia.org/P58951 and previous config saved to /var/cache/conftool/dbconfig/20240327-130310-arnaudb.json [13:05:37] cormacparle: Which of your patches needs to go first? [13:05:52] doens't matter [13:06:25] I'll do both at the same time. [13:06:30] 👍 [13:07:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013101 (https://phabricator.wikimedia.org/T347970) (owner: 10Cparle) [13:07:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013284 (https://phabricator.wikimedia.org/T352884) (owner: 10Cparle) [13:09:00] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1013101|Removing MachineVision events, extension is being sunsetted (T347970)]], [[gerrit:1013284|Sunsetting MachineVision extension, so remove config (T352884)]] [13:09:06] T347970: [L] MachineVision: archive and remove all events and event schemas - https://phabricator.wikimedia.org/T347970 [13:09:06] T352884: Undeploy and archive the MachineVision extension - https://phabricator.wikimedia.org/T352884 [13:11:11] (03PS4) 10Dreamy Jazz: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014620 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [13:11:19] (03PS4) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [13:11:24] (03PS4) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [13:15:29] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1096.eqiad.wmnet with OS bullseye [13:16:27] (03CR) 10Majavah: [V:03+1] "I tested provisioning a new puppetserver with this applied and did not hit any Git permission errors. So not sure what the original issue " [puppet] - 10https://gerrit.wikimedia.org/r/1015032 (owner: 10Majavah) [13:16:50] K8s build-and-push-container-images is taking longer than usual, but I presume that this is because since the switch we have more on k8s now. [13:17:00] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@9df0d43] (releasing): (no justification provided) [13:17:02] *the datacentre switch [13:17:21] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@9df0d43] (releasing): (no justification provided) (duration: 00m 20s) [13:17:55] no, it's because one of the commits you merged is rebuilding all of i18n [13:18:03] Ah. I see. [13:18:16] I had assumed that was contained in the rebuildLocalisationCache part of the deploy [13:18:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 25%: Post clone (dst)', diff saved to https://phabricator.wikimedia.org/P58952 and previous config saved to /var/cache/conftool/dbconfig/20240327-131816-arnaudb.json [13:19:07] for bare metal hardware, yes. for k8s, i18n is included in the container images [13:19:15] 👍 [13:19:33] wmf-config/extension-list should generally be updated in a separate commit deployed separately, btw. I don't /think/ anything will break, but if something does rolling back will also take this long [13:19:49] sorry :/ [13:21:44] (03PS7) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [13:22:30] (03CR) 10Btullis: [C:03+2] Make datahub networkpolicy include/template consistent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014652 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [13:22:41] Moved to the next step on the backport. It took about 12 mins to do that step. [13:22:49] (03CR) 10Btullis: [C:03+2] Update the from_address for burrow notification emails [puppet] - 10https://gerrit.wikimedia.org/r/1014491 (https://phabricator.wikimedia.org/T358675) (owner: 10Btullis) [13:23:39] (03Merged) 10jenkins-bot: Make datahub networkpolicy include/template consistent [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014652 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [13:27:22] (03CR) 10Brouberol: charts/kserve-inference: Wire up generated network policy for LW services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:28:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:28:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2115.codfw.wmnet with reason: provisionning db2215.codfw.wmnet - T355422 [13:28:39] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [13:28:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2115.codfw.wmnet with reason: provisionning db2215.codfw.wmnet - T355422 [13:28:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2215.codfw.wmnet with reason: provisionning db2215.codfw.wmnet - T355422 [13:29:15] (03CR) 10David Caro: [C:03+2] karma: skip wmcloud from default silences [puppet] - 10https://gerrit.wikimedia.org/r/1015003 (https://phabricator.wikimedia.org/T320973) (owner: 10David Caro) [13:29:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2215.codfw.wmnet with reason: provisionning db2215.codfw.wmnet - T355422 [13:29:40] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1096.eqiad.wmnet with reason: host reimage [13:30:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Cloning db2115 in db2215 for T355422', diff saved to https://phabricator.wikimedia.org/P58954 and previous config saved to /var/cache/conftool/dbconfig/20240327-133015-arnaudb.json [13:30:33] (03PS8) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [13:30:51] (03CR) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:31:26] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone Will create a clone of db2115.codfw.wmnet onto db2215.codfw.wmnet [13:32:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1096.eqiad.wmnet with reason: host reimage [13:33:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 38.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:33:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 50%: Post clone (dst)', diff saved to https://phabricator.wikimedia.org/P58955 and previous config saved to /var/cache/conftool/dbconfig/20240327-133322-arnaudb.json [13:35:47] (03CR) 10Majavah: [C:03+2] cloudnfs: Add missing dependency [puppet] - 10https://gerrit.wikimedia.org/r/1014523 (owner: 10Majavah) [13:37:30] (03CR) 10Majavah: [C:03+2] openstack: puppetcertleaks: fix list command [puppet] - 10https://gerrit.wikimedia.org/r/1015017 (owner: 10Majavah) [13:37:44] Dreamy_Jazz: is it still chugging along? [13:38:01] Yes, but you should be able to test on the testservers. [13:38:28] I was able to verify that Special:ApiSandbox no longer accepts the imagelabels label [13:38:50] But waiting for scap-cdb-rebuild before doing the final test [13:38:50] looks good on mwdebug1001 [13:38:53] Thanks [13:39:38] (03CR) 10Brouberol: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:41:20] !log dreamyjazz@deploy1002 dreamyjazz and cparle: Backport for [[gerrit:1013101|Removing MachineVision events, extension is being sunsetted (T347970)]], [[gerrit:1013284|Sunsetting MachineVision extension, so remove config (T352884)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:41:25] T347970: [L] MachineVision: archive and remove all events and event schemas - https://phabricator.wikimedia.org/T347970 [13:41:26] T352884: Undeploy and archive the MachineVision extension - https://phabricator.wikimedia.org/T352884 [13:41:30] !log dreamyjazz@deploy1002 dreamyjazz and cparle: Continuing with sync [13:41:37] Proceeding as test was already run [13:41:47] 👍 [13:42:51] Daimona: Do you have deployment rights or would you like me to deploy? [13:43:15] Hey! I don't have deployment rights [13:43:33] Sure. There should be some time after deploying the extension undeployment. [13:47:18] (03PS7) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [13:48:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 75%: Post clone (dst)', diff saved to https://phabricator.wikimedia.org/P58956 and previous config saved to /var/cache/conftool/dbconfig/20240327-134828-arnaudb.json [13:49:26] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), 13Patch-For-Review: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9665410 (10bking) Sure, I'm happy to create a new package. Curator itself... [13:49:48] (03CR) 10Elukey: charts/kserve-inference: Wire up generated network policy for LW services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:50:45] (03PS9) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [13:50:57] (03CR) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:51:02] Dreamy_Jazz: hi! thanks for taking the window. are you still deploying? [13:51:09] Yes [13:51:30] Still on the extension undeployment (first two patches) [13:51:49] I'm wondering whether we can stray into the Wikifunction Services UTC Afternoon window? [13:52:01] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4): create and deploy new Elastic Curator deb package - https://phabricator.wikimedia.org/T361105 (10bking) 03NEW [13:52:14] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4): create and deploy new Elastic Curator deb package - https://phabricator.wikimedia.org/T361105#9665431 (10bking) [13:52:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 34.74% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:53:18] (03CR) 10Elukey: [C:04-1] "This may not work, from what I see in the CI's diff these are the selectors:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [13:55:24] Dreamy_Jazz, effie will be repooling codfw at 1400, you may want to coordinate if you're going to run over [13:55:48] It is likely we will run over. [13:55:54] Dreamy_Jazz: I can wait [13:55:58] how long do you need [13:55:59] ? [13:56:29] (this is just a precaution, activating multi-dc should not be an issue) [13:56:29] I'm expecting about 10-20 mins. Ideally want to do two more deployments. [13:56:39] To get the calendar done [13:56:56] ok, I can hold for 20' [13:57:01] Thanks! [13:57:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 34.74% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:57:27] Just that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1014559 is a blocker for a temporary accounts testwiki release task. [13:57:39] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1096.eqiad.wmnet with OS bullseye [13:59:14] Dreamy_Jazz: for every minute after that 20' window, you will owe me one Japanese Kit Kat [13:59:22] and I always come to collect [13:59:25] !log bounce prometheus@k8s-aux in eqiad - T343529 [13:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:29] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [13:59:40] How are they different to normal ones? [13:59:50] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1013101|Removing MachineVision events, extension is being sunsetted (T347970)]], [[gerrit:1013284|Sunsetting MachineVision extension, so remove config (T352884)]] (duration: 50m 49s) [13:59:55] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [13:59:56] T347970: [L] MachineVision: archive and remove all events and event schemas - https://phabricator.wikimedia.org/T347970 [13:59:56] T352884: Undeploy and archive the MachineVision extension - https://phabricator.wikimedia.org/T352884 [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1400) [14:00:08] (03PS3) 10Dreamy Jazz: Prevent new user names matching the temporary account pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [14:00:10] Dreamy_Jazz: you have no idea https://www.bokksumarket.com/products/japanese-kit-kat-whole-wheat-biscuit [14:00:20] (03CR) 10TrainBranchBot: "Approved by dreamyjazz@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [14:00:21] (03CR) 10Xcollazo: "+Ben, for awareness." [puppet] - 10https://gerrit.wikimedia.org/r/1013368 (https://phabricator.wikimedia.org/T347967) (owner: 10Cparle) [14:00:40] Going to do my config patch first. [14:01:02] They look very nice :) [14:01:29] it looks like my deploy is finished Dreamy_Jazz ? [14:01:36] It is. [14:02:03] Special:ApiSandbox is no longer showing the imagelabels prop as a valid option, so it seemed to work. [14:02:07] great - thank you! [14:02:25] (03Merged) 10jenkins-bot: Prevent new user names matching the temporary account pattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [14:02:53] !log dreamyjazz@deploy1002 Started scap: Backport for [[gerrit:1014559|Prevent new user names matching the temporary account pattern (T361021 T349506)]] [14:02:55] (03PS1) 10Effie Mouzeli: traffic: Pool codfw for user traffic (switchover #8) [dns] - 10https://gerrit.wikimedia.org/r/1015037 (https://phabricator.wikimedia.org/T357547) [14:02:58] T361021: New accounts with names beginning with ~2 are created - https://phabricator.wikimedia.org/T361021 [14:02:58] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [14:03:14] (03CR) 10Bking: [C:03+1] wdqs: add x-monitoring-query [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [14:03:25] (03CR) 10Ssingh: [C:03+1] traffic: Pool codfw for user traffic (switchover #8) [dns] - 10https://gerrit.wikimedia.org/r/1015037 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [14:03:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2216 (re)pooling @ 100%: Post clone (dst)', diff saved to https://phabricator.wikimedia.org/P58957 and previous config saved to /var/cache/conftool/dbconfig/20240327-140334-arnaudb.json [14:03:52] Daimona: Will you be around for the next 10-20 mins? [14:03:57] Sure! [14:04:23] Great. This change is being backported much faster, so there should be time. [14:04:42] Nice :) [14:05:40] (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:05:56] !log dreamyjazz@deploy1002 dreamyjazz and tchanders: Backport for [[gerrit:1014559|Prevent new user names matching the temporary account pattern (T361021 T349506)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:06:13] !log dreamyjazz@deploy1002 dreamyjazz and tchanders: Continuing with sync [14:06:18] Test successful. [14:06:47] (03CR) 10Dreamy Jazz: [C:03+2] Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:06:48] (03CR) 10Dreamy Jazz: [C:03+2] Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:06:50] (03CR) 10Dreamy Jazz: [C:03+2] Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014620 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:07:45] (03Merged) 10jenkins-bot: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014620 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:07:45] (03CR) 10Ssingh: icinga: add cdobbins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014589 (owner: 10CDobbins) [14:07:48] (03CR) 10CI reject: [V:04-1] Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:08:41] (03CR) 10Brouberol: "The selectors are based off our `base.meta.metadata` template, injecting the `app`, `chart`, `release`, and `heritage` on workloads, as we" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [14:09:01] (03PS4) 10Dreamy Jazz: Remove old CampaignEvents DB config (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:09:09] (03PS5) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:09:14] (03PS5) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:09:17] (03CR) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:09:19] (03CR) 10Dreamy Jazz: [C:03+2] Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:09:21] (03CR) 10CI reject: [V:04-1] Remove old CampaignEvents DB config (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:09:22] (03CR) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:09:24] (03CR) 10Dreamy Jazz: [C:03+2] Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:09:44] !log jiji@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [14:09:47] !log jiji@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [14:10:39] (03Merged) 10jenkins-bot: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014621 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:10:42] (03Merged) 10jenkins-bot: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014623 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:10:57] (03PS1) 10JMeybohm: kubernetes::node restart rsyslog if too many fd's are blocked by inotify [puppet] - 10https://gerrit.wikimedia.org/r/1015039 (https://phabricator.wikimedia.org/T357616) [14:11:02] (03CR) 10Klausman: "One idea was to make a 1.0.1 of that template (in modules/) that also checks `app-wmf`, and then update this change along the same lines." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [14:11:16] * Dreamy_Jazz still waiting for my change to deploy... [14:11:59] Daimona: On second thoughts there might not be enough time to get your backports in. [14:12:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.38% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:12:45] I pre-merged them to speed it up, but there isn't enough time to get them merged. [14:12:50] That's fine [14:13:00] I'll revert the merges and re-create the commits. [14:13:26] (03PS1) 10Dreamy Jazz: Revert "Add virtual domain mapping for CampaignEvents (beta)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014724 [14:13:30] (03CR) 10Dreamy Jazz: [C:03+2] Revert "Add virtual domain mapping for CampaignEvents (beta)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014724 (owner: 10Dreamy Jazz) [14:14:14] (03Merged) 10jenkins-bot: Revert "Add virtual domain mapping for CampaignEvents (beta)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014724 (owner: 10Dreamy Jazz) [14:14:19] (03PS1) 10Dreamy Jazz: Revert "Add virtual domain mapping for CampaignEvents (prod)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014725 [14:14:23] (03CR) 10Dreamy Jazz: [C:03+2] Revert "Add virtual domain mapping for CampaignEvents (prod)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014725 (owner: 10Dreamy Jazz) [14:15:07] (03PS1) 10Dreamy Jazz: Revert "Add setting to determine if CampaignEvents should use the global DB" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015046 [14:15:10] (03Merged) 10jenkins-bot: Revert "Add virtual domain mapping for CampaignEvents (prod)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014725 (owner: 10Dreamy Jazz) [14:15:18] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1752/c" [puppet] - 10https://gerrit.wikimedia.org/r/1015039 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [14:16:03] (03CR) 10Dreamy Jazz: [C:03+2] Revert "Add setting to determine if CampaignEvents should use the global DB" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015046 (owner: 10Dreamy Jazz) [14:16:19] (03PS4) 10Majavah: P:wmcs::metricsinfra: add meta monitoring app skeleton [puppet] - 10https://gerrit.wikimedia.org/r/966804 (https://phabricator.wikimedia.org/T288053) [14:16:19] (03PS4) 10Majavah: P:wmcs::metriscinfra: haproxy: add route for meta monitor service [puppet] - 10https://gerrit.wikimedia.org/r/966805 (https://phabricator.wikimedia.org/T288053) [14:16:43] (03PS1) 10Dreamy Jazz: Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015042 (https://phabricator.wikimedia.org/T348281) [14:16:46] (03Merged) 10jenkins-bot: Revert "Add setting to determine if CampaignEvents should use the global DB" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015046 (owner: 10Dreamy Jazz) [14:17:21] !log dreamyjazz@deploy1002 Finished scap: Backport for [[gerrit:1014559|Prevent new user names matching the temporary account pattern (T361021 T349506)]] (duration: 14m 28s) [14:17:27] T361021: New accounts with names beginning with ~2 are created - https://phabricator.wikimedia.org/T361021 [14:17:27] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [14:17:45] effie: Over to you [14:17:59] !log Afternoon UTC backport window done (extended by 17 mins) [14:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:03] so not kit-kats for me [14:18:07] :( [14:18:07] no* [14:18:21] good thing I have a small stock :) [14:18:26] :D [14:18:56] Thanks again for delaying the repool. [14:19:08] !log Day 8: Pool active/active services on codfw - T357547 [14:19:10] -3 kit kats for e.ffie [14:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:12] np [14:19:12] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [14:19:13] :p [14:19:21] claime: not giving them away lol [14:19:25] :( [14:20:20] (03PS1) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015043 (https://phabricator.wikimedia.org/T348281) [14:20:29] I may make exceptions claime, do not lose hope [14:21:28] !log jiji@cumin1002 START - Cookbook sre.discovery.datacenter pool all active/active services in codfw: Pool active/active services on codfw - T357547 [14:21:32] (03PS1) 10Dreamy Jazz: Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015044 (https://phabricator.wikimedia.org/T348281) [14:21:55] (03PS5) 10Dreamy Jazz: Remove old CampaignEvents DB config (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014625 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:21:59] (03PS3) 10Dreamy Jazz: Remove old CampaignEvents DB config (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014626 (https://phabricator.wikimedia.org/T348281) (owner: 10Daimona Eaytoy) [14:22:08] (03CR) 10Dreamy Jazz: [C:03+1] Add setting to determine if CampaignEvents should use the global DB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015042 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [14:22:12] (03CR) 10Dreamy Jazz: [C:03+1] Add virtual domain mapping for CampaignEvents (prod) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015043 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [14:22:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 39.28% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:22:15] (03CR) 10Dreamy Jazz: [C:03+1] Add virtual domain mapping for CampaignEvents (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015044 (https://phabricator.wikimedia.org/T348281) (owner: 10Dreamy Jazz) [14:24:17] Daimona: I've re-created your config patches at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1015042 and given them a +1. [14:25:19] Thanks! I'll reschedule them [14:34:16] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for bblack - https://phabricator.wikimedia.org/T361046#9665618 (10WDoranWMF) Approved! [14:34:48] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for bblack - https://phabricator.wikimedia.org/T361046#9665620 (10BBlack) [14:37:20] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:58] (03CR) 10BBlack: "Done in https://phabricator.wikimedia.org/T361046#9665618" [puppet] - 10https://gerrit.wikimedia.org/r/1014590 (https://phabricator.wikimedia.org/T361046) (owner: 10BBlack) [14:39:15] (03PS4) 10BBlack: Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1014590 (https://phabricator.wikimedia.org/T361046) [14:40:23] (03CR) 10BBlack: [C:03+2] Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1014590 (https://phabricator.wikimedia.org/T361046) (owner: 10BBlack) [14:40:59] !log jiji@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all active/active services in codfw: Pool active/active services on codfw - T357547 [14:41:04] T357547: ☂️ Northward Datacentre Switchover (March 2024) - https://phabricator.wikimedia.org/T357547 [14:42:28] !log jiji@cumin1002 START - Cookbook sre.discovery.datacenter status all services in all: None - None [14:42:31] !log jiji@cumin1002 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) status all services in all: None - None [14:45:20] (03PS2) 10Effie Mouzeli: traffic: Pool codfw for user traffic (switchover #8) [dns] - 10https://gerrit.wikimedia.org/r/1015037 (https://phabricator.wikimedia.org/T357547) [14:45:32] !log Day 8: Pool codfw for user traffic - T357547 [14:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:22] (03CR) 10Effie Mouzeli: [C:03+2] traffic: Pool codfw for user traffic (switchover #8) [dns] - 10https://gerrit.wikimedia.org/r/1015037 (https://phabricator.wikimedia.org/T357547) (owner: 10Effie Mouzeli) [14:48:38] (03CR) 10David Martin: Update the WikiLambda instrumentation to use core interaction events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [14:48:45] (03PS5) 10DCausse: wdqs: add x-monitoring-query [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) [14:48:45] (03PS3) 10DCausse: updateQueryServiceLag: tune the min query rate on a pooled server [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) [14:49:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [14:49:26] (03CR) 10MNeisler: Update the WikiLambda instrumentation to use core interaction events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/992223 (https://phabricator.wikimedia.org/T350497) (owner: 10Santiago Faci) [14:49:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.codfw.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:50:27] (03PS1) 10Filippo Giunchedi: WIP: use oauth2-proxy for opensearch dashboards [puppet] - 10https://gerrit.wikimedia.org/r/1015045 [14:52:27] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [14:52:55] Kinda sorta forgot to bump replicas for proton in codfw as well [14:53:39] (03CR) 10CI reject: [V:04-1] WIP: use oauth2-proxy for opensearch dashboards [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (owner: 10Filippo Giunchedi) [14:54:52] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [14:55:22] (03PS6) 10DCausse: wdqs: add x-monitoring-query [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) [14:55:23] (03PS4) 10DCausse: updateQueryServiceLag: tune the min query rate on a pooled server [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) [14:55:32] (03CR) 10Bking: [C:03+1] hue: drop CNAME DNS record [dns] - 10https://gerrit.wikimedia.org/r/1014549 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [14:55:36] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from proton_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [14:55:41] yes yes [14:55:43] oh [14:55:46] ok :) [14:55:52] shush proton [14:55:59] again? [14:55:59] <_joe_> uhm [14:56:01] <_joe_> yep [14:56:02] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9665702 (10Fabfur) esams has been repooled at 12:15UTC [14:56:06] this has been going on since this morning I reckon ? [14:56:07] <_joe_> this time in codfw [14:56:12] <_joe_> effie: no it hasn't [14:56:13] yeah so I had forgotten to bump the replicas to 12 in codfw [14:56:18] I had only done eqiad [14:56:18] <_joe_> I was about to ask [14:56:20] <_joe_> :) [14:56:26] repooling codfw switched all the traffic there [14:56:28] <_joe_> !incidents [14:56:28] 4551 (ACKED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [14:56:28] 4550 (RESOLVED) [8x] ProbeDown sre (probes/service esams) [14:56:28] 4549 (RESOLVED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway eqiad) [14:56:29] boom [14:56:32] claime: no prob, now you actually owe me a kit-kat [14:56:33] <_joe_> !ack 4551 [14:56:34] 4551 (ACKED) GatewayBackendErrorsHigh sre (proton_cluster rest-gateway codfw) [14:57:01] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9665701 (10Fabfur) [14:57:05] claime: shall I depool it from codfw until you deploy? [14:57:11] it's deployed [14:57:31] it should recover, if it doesn't... means something else is wrong [14:57:37] alright alright [14:57:38] <_joe_> it's recovering [14:57:43] I am trying hard for a free meal here [14:58:18] (03CR) 10Sergio Gimeno: [C:03+1] Add CommunityConfiguration log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014634 (https://phabricator.wikimedia.org/T361072) (owner: 10Urbanecm) [14:58:45] (03PS4) 10Brouberol: hue: drop CNAME DNS record [dns] - 10https://gerrit.wikimedia.org/r/1014549 (https://phabricator.wikimedia.org/T341895) [14:59:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.codfw.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [15:00:06] (03CR) 10Filippo Giunchedi: "Just a sketch to give an idea of what I had in mind, also re: https://phabricator.wikimedia.org/T337818" [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (owner: 10Filippo Giunchedi) [15:00:36] (GatewayBackendErrorsHigh) resolved: rest-gateway: elevated 5xx errors from proton_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [15:00:44] (03CR) 10Brouberol: [C:03+2] hue: drop CNAME DNS record [dns] - 10https://gerrit.wikimedia.org/r/1014549 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [15:00:45] excellent [15:00:49] claime: tx [15:01:17] (03CR) 10Clément Goubert: [C:03+2] proton: triple replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1013995 (owner: 10Clément Goubert) [15:02:06] (03CR) 10Bking: [C:03+2] wdqs: add x-monitoring-query [puppet] - 10https://gerrit.wikimedia.org/r/1014551 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [15:02:20] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:47] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: 14Requesting access to analytics-privatedata-users for bblack - 14https://phabricator.wikimedia.org/T361046#9665789 (10BBlack) 05Open→03Resolved a:03BBlack [15:07:25] (SystemdUnitFailed) firing: git_pull_charts.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:47] !log Disabling puppet on P:restbase - T358213 [15:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:54] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [15:10:05] (03CR) 10Clément Goubert: [C:03+2] restbase: Moving 50% of mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1015016 (https://phabricator.wikimedia.org/T358213) (owner: 10Clément Goubert) [15:11:28] !log enabling and running puppet on restbase2021.codfw.wmnet - T358213 [15:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:24] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@1a343bf] (releasing): testing fix for T361084 [15:12:27] T361084: Upgrade matrix-auth for Jenkins 2.440 - https://phabricator.wikimedia.org/T361084 [15:12:44] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@1a343bf] (releasing): testing fix for T361084 (duration: 00m 20s) [15:14:26] !log enabling and running puppet on restbase1035.eqiad.wmnet - T358213 [15:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:31] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [15:14:52] !log brouberol@cumin2002 START - Cookbook sre.hosts.decommission for hosts an-tool1009.eqiad.wmnet [15:17:08] Looks good, proceeding with the rest of RESTbase hosts [15:17:15] !log enabling and running puppet on P:restbase - T358213 [15:17:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:01] !log brouberol@cumin2002 START - Cookbook sre.dns.netbox [15:19:40] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:21:45] !log brouberol@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-tool1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin2002" [15:23:19] !log andrewtavis-wmde@deploy1002 Started deploy [airflow-dags/wmde@36dee63]: (no justification provided) [15:23:27] !log andrewtavis-wmde@deploy1002 Finished deploy [airflow-dags/wmde@36dee63]: (no justification provided) (duration: 00m 08s) [15:23:33] !log brouberol@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-tool1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - brouberol@cumin2002" [15:23:34] !log brouberol@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:35] !log brouberol@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-tool1009.eqiad.wmnet [15:23:38] 10SRE-tools, 06Data-Platform-SRE, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4): create and deploy new Elastic Curator deb package - https://phabricator.wikimedia.org/T361105#9665882 (10Gehel) p:05Triage→03High [15:24:54] (CirrusSearchNodeIndexingNotIncreasing) firing: (16) Elasticsearch instance elastic2038-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:25:24] (03CR) 10Brouberol: Decommission an-coord100[12] (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1014533 (https://phabricator.wikimedia.org/T353774) (owner: 10Brouberol) [15:25:58] (03PS2) 10JMeybohm: kubernetes::node restart rsyslog if too many fd's are blocked by inotify [puppet] - 10https://gerrit.wikimedia.org/r/1015039 (https://phabricator.wikimedia.org/T357616) [15:30:05] 10ops-esams, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: esams text cp nvme upgrade - https://phabricator.wikimedia.org/T360430#9665926 (10RobH) a:05RobH→03Fabfur Reassigning from myself over to @fabfur for reimaging at #traffic's leisure. [15:30:46] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 14): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1753/c" [puppet] - 10https://gerrit.wikimedia.org/r/1015039 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [15:33:10] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@1a343bf] (releasing): deploying fix for T361084 to all targets [15:33:17] T361084: Upgrade matrix-auth for Jenkins 2.440 - https://phabricator.wikimedia.org/T361084 [15:33:29] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@1a343bf] (releasing): deploying fix for T361084 to all targets (duration: 00m 19s) [15:33:59] !log jnuche@deploy1002 Started deploy [releng/jenkins-deploy@1a343bf] (releasing): deploying fix for T361084 to all targets [15:34:41] (03CR) 10Andrew Bogott: [C:03+1] P:puppetserver::git: do not mark directories as safe [puppet] - 10https://gerrit.wikimedia.org/r/1015032 (owner: 10Majavah) [15:35:02] !log jnuche@deploy1002 Finished deploy [releng/jenkins-deploy@1a343bf] (releasing): deploying fix for T361084 to all targets (duration: 01m 03s) [15:36:54] (03CR) 10JMeybohm: [C:03+2] changeprop: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014540 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:36:57] (03CR) 10JMeybohm: [C:03+2] changeprop: Add base.external-services-networkpolicy:1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014539 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:36:59] (03CR) 10JMeybohm: [C:03+2] changeprop: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014538 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:37:06] (03PS10) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [15:37:25] (SystemdUnitFailed) resolved: git_pull_charts.service on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:53] (03CR) 10Majavah: [V:03+1 C:03+2] P:puppetserver::git: do not mark directories as safe [puppet] - 10https://gerrit.wikimedia.org/r/1015032 (owner: 10Majavah) [15:37:59] (03Merged) 10jenkins-bot: changeprop: Update mesh modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014538 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:38:01] (03Merged) 10jenkins-bot: changeprop: Add base.external-services-networkpolicy:1.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014539 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:38:04] (03Merged) 10jenkins-bot: changeprop: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014540 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [15:40:42] (03CR) 10Scott French: "I am definitely open to switching gears and special-casing the root-prefix case." [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [15:42:10] (03PS1) 10Clément Goubert: scap: Add retry_on_timeout to scap httpbb checks [puppet] - 10https://gerrit.wikimedia.org/r/1015071 (https://phabricator.wikimedia.org/T360867) [15:43:02] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: apply [15:43:48] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: apply [15:45:11] (03CR) 10Ahmon Dancy: [C:03+1] scap: Add retry_on_timeout to scap httpbb checks [puppet] - 10https://gerrit.wikimedia.org/r/1015071 (https://phabricator.wikimedia.org/T360867) (owner: 10Clément Goubert) [15:45:38] (03CR) 10Clément Goubert: [C:03+2] scap: Add retry_on_timeout to scap httpbb checks [puppet] - 10https://gerrit.wikimedia.org/r/1015071 (https://phabricator.wikimedia.org/T360867) (owner: 10Clément Goubert) [15:46:03] (03CR) 10Ahmon Dancy: "I notice that If6c7415ae9e5c has been abandoned. Should this change be abandoned too?" [puppet] - 10https://gerrit.wikimedia.org/r/810048 (owner: 10Giuseppe Lavagetto) [15:46:40] (03PS11) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [15:46:54] 06SRE, 10MW-on-K8s, 10RESTBase, 06serviceops, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9665984 (10Clement_Goubert) [15:47:19] 06SRE, 10MW-on-K8s, 10RESTBase, 06serviceops, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9665983 (10Clement_Goubert) 50% {F43529353} [15:50:32] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:50:38] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [15:50:44] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [15:51:21] !log 50% of backend RESTbase traffic to mw-api-int - T358213 [15:51:23] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [15:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:25] T358213: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213 [15:51:35] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.mysql.clone (exit_code=97) Will create a clone of db2115.codfw.wmnet onto db2215.codfw.wmnet [15:51:40] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 12 days, 0:00:00 on elastic2038.codfw.wmnet with reason: T358882 [15:51:43] T358882: Decommission elastic2037-2054 - https://phabricator.wikimedia.org/T358882 [15:51:44] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12 days, 0:00:00 on elastic2038.codfw.wmnet with reason: T358882 [15:52:33] (03CR) 10Brouberol: [C:03+1] "`selector: "wmf-app == 'kserve-inference' && release == 'main'"`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:52:37] (03CR) 10Elukey: charts/kserve-inference: Wire up generated network policy for LW services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:52:45] (03PS1) 10Tchanders: Scope temp user reserved pattern to temp users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015072 (https://phabricator.wikimedia.org/T361021) [15:52:59] (03CR) 10Elukey: "Ah no wait okok I see the file, please split it in two code changes, so others can review it separately :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:53:30] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: apply [15:53:47] (03CR) 10Brouberol: [C:03+1] charts/kserve-inference: Wire up generated network policy for LW services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:53:52] (03PS12) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [15:53:58] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [15:54:53] (03CR) 10Elukey: charts/kserve-inference: Wire up generated network policy for LW services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:55:11] !log bking@cumin2002 running puppet against A:wdqs-main to apply nginx changes T360993 [15:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:18] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [15:57:51] (03PS1) 10Klausman: modules: Add new version of external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) [15:58:14] claime: not sure if this is just a coincidence but there's a big spike in 400s underway - we've seen bigger in the past so hopefully it's just a matter of timing https://grafana.wikimedia.org/d/000000068/restbase?orgId=1&from=now-3h&to=now&viewPanel=14 [15:58:36] everything else looks fine though [15:59:04] (03PS2) 10Klausman: modules: Add new version of external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) [15:59:09] hnowlan: hmmm yeah, let's wait and see [15:59:18] and it's back down x) [15:59:20] and it's back down, heheh [15:59:47] (03CR) 10Klausman: modules: Add new version of external-services-networkpolicy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [15:59:58] https://www.youtube.com/watch?v=FMNJuSl91qY [16:02:21] (03CR) 10JMeybohm: [C:03+2] changeprop-jobqueue: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014542 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [16:03:25] (03Merged) 10jenkins-bot: changeprop-jobqueue: Migrate to base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1014542 (https://phabricator.wikimedia.org/T359423) (owner: 10JMeybohm) [16:05:28] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [16:06:29] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [16:07:39] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:07:48] (03CR) 10CDanis: [C:03+2] Revert "Block requests from "facebookexternalhit" UA" [puppet] - 10https://gerrit.wikimedia.org/r/1015051 (owner: 10CDanis) [16:07:58] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9666103 (10Jgreen) >>! In T360907#9664537, @jcrespo wrote: > @DBu-WMF Hi, we are discussing how to proceed, as handling postmaster access is a new process f... [16:08:35] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:08:39] (03PS1) 10Klausman: modules: add v1.0.1 of external-services-networkpolicy in prep for 1015074 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015077 (https://phabricator.wikimedia.org/T360428) [16:09:22] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:10:07] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:11:22] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, thank you for the bandaid" [puppet] - 10https://gerrit.wikimedia.org/r/1015039 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [16:12:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 22:00:00 on db[2115,2215].codfw.wmnet with reason: Downtime for analysis [16:12:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 22:00:00 on db[2115,2215].codfw.wmnet with reason: Downtime for analysis [16:15:15] (03CR) 10Andrea Denisse: [C:03+2] alert: Update hiera entries for alert2001 to use Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [16:15:50] (03PS13) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [16:15:54] (03CR) 10Dreamy Jazz: [C:03+1] Scope temp user reserved pattern to temp users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015072 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [16:16:27] (03PS4) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) [16:16:31] (03PS4) 10Klausman: modules: Change external-services-networkpolicy to allow specifying appname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) [16:16:43] (03CR) 10JMeybohm: [V:03+1 C:03+2] kubernetes::node restart rsyslog if too many fd's are blocked by inotify [puppet] - 10https://gerrit.wikimedia.org/r/1015039 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [16:17:01] (03CR) 10CI reject: [V:04-1] charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:17:26] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9666137 (10DBu-WMF) Trilogy is not an unknown vendor to us. They have been assisting us with Banners and Emails for several years. They also manage our co... [16:18:20] (03PS1) 10Ilias Sarantopoulos: ml-services: remove redundant deployments from ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015080 (https://phabricator.wikimedia.org/T361117) [16:19:03] (03PS5) 10Scott French: Improve support for mirroring the full keyspace [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) [16:21:04] !log denisse@cumin2002 START - Cookbook sre.puppet.migrate-host for host alert2001.wikimedia.org [16:21:23] !log denisse@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host alert2001.wikimedia.org [16:22:15] !log denisse@cumin2002 START - Cookbook sre.puppet.migrate-host for host alert2001.wikimedia.org [16:22:35] (03CR) 10Scott French: "This is now done. I do feel better about this obviating the manual cleanup when switching to / from full-keyspace mirroring, so yeah, good" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/1007738 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [16:24:45] (03CR) 10Klausman: "@Janis: can you comment on the `app` vs `app-wmf` issue that is open? Thankyo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:25:47] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Can we please add our vendor to Google Postmaster Tools - https://phabricator.wikimedia.org/T360907#9666211 (10bsisolak) We have access to Google Postmaster Tools, this is just with a new account that will allow us to send automated alerts based on a Spam... [16:28:05] !log denisse@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host alert2001.wikimedia.org [16:30:27] (03CR) 10JMeybohm: modules: Change external-services-networkpolicy to allow specifying appname (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:31:56] !log depool and restart swift-proxy on moss-fe2001 then repool T360913 [16:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:01] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [16:32:17] (03CR) 10JMeybohm: "I'm not sure that you mean exactly. AIUI you already proposed a change that would allow charts to override the label used in the selector " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:32:30] (03CR) 10Klausman: modules: Change external-services-networkpolicy to allow specifying appname (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:33:09] (03PS5) 10Klausman: modules: Change external-services-networkpolicy to allow specifying appname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) [16:34:48] !log restart swift-proxy on ms-fe2010 then repool T360913 [16:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:49] (03PS14) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [16:36:37] (03CR) 10CI reject: [V:04-1] charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:37:03] !log depool and restart swift-proxy on ms-fe2011 then repool T360913 [16:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:17] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [16:37:36] !log depool and restart swift-proxy on ms-fe2012 then repool T360913 [16:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:15] !log depool and restart swift-proxy on ms-fe2013 then repool T360913 [16:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:32] (03PS6) 10Klausman: modules: Change external-services-networkpolicy to allow specifying appname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) [16:40:17] (03PS7) 10Klausman: modules: Change external-services-networkpolicy to allow specifying appname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) [16:40:17] (03PS15) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [16:41:15] (03CR) 10CI reject: [V:04-1] charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:41:20] (03CR) 10Klausman: "This comment/request was a draft before that and is outdated and can be disregarded." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:46:45] (03PS16) 10Klausman: charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) [16:50:11] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q#:rack/setup/install (2) cloudbackup hosts - https://phabricator.wikimedia.org/T356216#9666319 (10Papaul) @Jhancock.wm this is what 2003 is showing on console ` ┌───────────────────────┤ [!!] Partition disks ├──────────... [16:50:11] (03CR) 10Brouberol: modules: add v1.0.1 of external-services-networkpolicy in prep for 1015074 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015077 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:50:24] (03CR) 10JMeybohm: [C:03+1] modules: Change external-services-networkpolicy to allow specifying appname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:51:09] (03CR) 10JMeybohm: [C:03+1] modules: add v1.0.1 of external-services-networkpolicy in prep for 1015074 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015077 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:51:26] (03CR) 10Klausman: modules: add v1.0.1 of external-services-networkpolicy in prep for 1015074 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015077 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:51:57] (03CR) 10JMeybohm: [C:03+1] modules: Change external-services-networkpolicy to allow specifying appname (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:52:21] (03CR) 10Brouberol: [C:03+1] modules: add v1.0.1 of external-services-networkpolicy in prep for 1015074 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015077 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:52:49] (03CR) 10Brouberol: [C:03+1] "Approved as you explained that you were using this method to highlight the real diff" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:52:59] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9666325 (10andrea.denisse) 05In progress→03Stalled [16:53:04] (03CR) 10JMeybohm: [C:04-1] charts/kserve-inference: Wire up generated network policy for LW services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:53:12] (03CR) 10Brouberol: [C:03+1] charts/kserve-inference: Wire up generated network policy for LW services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:53:40] (03PS8) 10Klausman: modules: Change external-services-networkpolicy to allow specifying appname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) [16:53:46] (03CR) 10Klausman: modules: Change external-services-networkpolicy to allow specifying appname (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:53:47] (03CR) 10JMeybohm: charts/kserve-inference: Wire up generated network policy for LW services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015029 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:53:54] 06SRE, 06Infrastructure-Foundations, 10MediaWiki-Email, 10observability: Consolidation and tracking of automated email alerts improvements across services - https://phabricator.wikimedia.org/T360902#9666337 (10andrea.denisse) [16:54:10] (03CR) 10Klausman: [C:03+2] modules: add v1.0.1 of external-services-networkpolicy in prep for 1015074 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015077 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:54:36] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9666323 (10andrea.denisse) Hi Infrastructure Foundations Team, We're currently facing a challenge w... [16:55:22] (03Merged) 10jenkins-bot: modules: add v1.0.1 of external-services-networkpolicy in prep for 1015074 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015077 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:55:55] (03PS9) 10Klausman: modules: Change external-services-networkpolicy to allow specifying appname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) [16:56:10] (03CR) 10JMeybohm: modules: Change external-services-networkpolicy to allow specifying appname (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:57:13] (03CR) 10Klausman: [C:03+2] modules: Change external-services-networkpolicy to allow specifying appname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [16:58:08] (03Merged) 10jenkins-bot: modules: Change external-services-networkpolicy to allow specifying appname [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015074 (https://phabricator.wikimedia.org/T360428) (owner: 10Klausman) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1700) [17:01:28] (03CR) 10Klausman: [C:03+1] ml-services: remove redundant deployments from ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015080 (https://phabricator.wikimedia.org/T361117) (owner: 10Ilias Sarantopoulos) [17:04:35] (03PS1) 10Papaul: Configure dbprov1005/1006 for Puppet 7 like for dbprov2005/2006 [puppet] - 10https://gerrit.wikimedia.org/r/1015082 (https://phabricator.wikimedia.org/T355353) [17:06:02] (03CR) 10Papaul: [C:03+2] Configure dbprov1005/1006 for Puppet 7 like for dbprov2005/2006 [puppet] - 10https://gerrit.wikimedia.org/r/1015082 (https://phabricator.wikimedia.org/T355353) (owner: 10Papaul) [17:07:24] (03PS1) 10Esanders: Enable wgVisualEditorAllowExternalLinkPaste at collabwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015083 [17:08:37] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9666364 (10Volans) The premise seems to mix different things. PuppetDB is a totally separated servic... [17:12:47] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host dbprov1006.eqiad.wmnet with OS bullseye [17:13:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9666369 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye [17:14:58] (03Abandoned) 10Mabualruz: MW Config - Rename the skin night mode classes to more readable classes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1012989 (https://phabricator.wikimedia.org/T359983) (owner: 10Mabualruz) [17:18:06] (03PS1) 10Hnowlan: cassandra-http-gateway: use cassandra module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 [17:21:44] (03PS1) 10Esanders: Set wgMFFallbackEditor to visual for most VE wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015086 (https://phabricator.wikimedia.org/T361134) [17:22:18] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891#9666437 (10Reedy) It's nearly every time I use it. Similarly, clicking "Manage this list" on https://lists.wikimedia.org/hyperkitty/list/mediawiki... [17:30:24] (03PS9) 10Dreamy Jazz: Add wgAutoCreateTempUser configuration for production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014526 (https://phabricator.wikimedia.org/T349506) [17:34:56] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891#9666482 (10Reedy) Similar for https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ And it even failed with a HTTP 5... [17:37:41] (03PS2) 10Hnowlan: cassandra-http-gateway: use cassandra module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 [17:39:28] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9666500 (10VRiley-WMF) a:03VRiley-WMF [17:40:20] jouncebot: nowandnext [17:40:21] For the next 0 hour(s) and 19 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1700) [17:40:21] In 0 hour(s) and 19 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1800) [17:40:21] In 0 hour(s) and 19 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1800) [17:41:26] (03PS1) 10Andrew Bogott: pcc-db1001.yaml: update key for cloudinfra-internal-puppetserver-1 [puppet] - 10https://gerrit.wikimedia.org/r/1015087 [17:42:06] (03CR) 10Andrew Bogott: [C:03+2] pcc-db1001.yaml: update key for cloudinfra-internal-puppetserver-1 [puppet] - 10https://gerrit.wikimedia.org/r/1015087 (owner: 10Andrew Bogott) [18:00:05] jeena and dancy: How many deployers does it take to do Train log triage with CPT deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1800). [18:00:05] jeena and dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T1800). [18:11:14] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015089 (https://phabricator.wikimedia.org/T360156) [18:11:15] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015089 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [18:12:00] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015089 (https://phabricator.wikimedia.org/T360156) (owner: 10TrainBranchBot) [18:13:32] (03PS3) 10Hnowlan: cassandra-http-gateway: use cassandra module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 [18:14:15] 10ops-eqiad, 06SRE, 10Observability-Metrics: Memory upgrade request for prometheus100[56] - https://phabricator.wikimedia.org/T360687#9666633 (10herron) Ah excellent! I thought we would have to order new. Yes in that case lets go ahead with 32Gig DDR4 2666 please. Thank you! [18:24:10] (03PS4) 10Hnowlan: cassandra-http-gateway: use cassandra module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015085 [18:25:33] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.24 refs T360156 [18:25:37] T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [18:34:10] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov1006.eqiad.wmnet with OS bullseye [18:34:24] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9666690 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host dbprov1006.eqiad.wmnet with OS bullseye executed w... [18:38:11] !log jhuneidi@deploy1002 Synchronized php: group1 wikis to 1.42.0-wmf.24 refs T360156 (duration: 12m 38s) [18:38:16] T360156: 1.42.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T360156 [18:46:17] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org/postorius is sloooow - https://phabricator.wikimedia.org/T353891#9666743 (10jcrespo) Thank you Reedy, I trust you, it was just that the title wasn't descriptive enough (exact url, logged in/logged out, etc.). The... [18:54:07] !log increasing volume size of backup2011 T334069 [18:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:11] T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space - https://phabricator.wikimedia.org/T334069 [18:57:06] (03PS1) 10Kimberly Sarabia: Updates config to deploy vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015095 (https://phabricator.wikimedia.org/T360628) [19:07:17] (03PS2) 10Cwhite: WIP: use oauth2-proxy for opensearch dashboards [puppet] - 10https://gerrit.wikimedia.org/r/1015045 (owner: 10Filippo Giunchedi) [19:11:26] (03CR) 10Jdrewniak: [C:03+1] Updates config to deploy vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015095 (https://phabricator.wikimedia.org/T360628) (owner: 10Kimberly Sarabia) [19:12:18] (03CR) 10Dzahn: "once this gets merged, would be nice if you can also delete the apache config in sites-enabled/sites-available and restart apache to leave" [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [19:13:35] 06SRE, 10Maps: Allow Wikimedia Maps usage on academic researches - https://phabricator.wikimedia.org/T361146 (10Klavomen) 03NEW [19:15:28] (03CR) 10Dzahn: [C:03+1] miscweb: remove profile::microsites::security [puppet] - 10https://gerrit.wikimedia.org/r/1015005 (https://phabricator.wikimedia.org/T350796) (owner: 10AOkoth) [19:16:48] (03CR) 10Dzahn: [C:03+2] peopleweb: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014612 (owner: 10Dzahn) [19:16:53] (03PS2) 10Dzahn: peopleweb: include ::profile::prometheus::apache_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1014612 [19:17:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:40] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:20:42] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:23:39] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1014612 (owner: 10Dzahn) [19:24:55] (03CR) 10Kosta Harlan: "Follow-up is in Ie98c7d9fcbdc812b5d8b4abfba6cb38497513c09" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1014559 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [19:25:19] (03PS2) 10Kosta Harlan: Scope temp user reserved pattern to temp users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015072 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [19:28:58] (03CR) 10Kosta Harlan: [C:03+1] "Should work OK for the next 975 years or so" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015072 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [19:39:15] (03CR) 10Dzahn: [C:03+2] vrts: switch envoy SSL provider to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014606 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [19:40:38] (03CR) 10Bking: [C:03+1] "I grepped thru the repo for "hue" and found some other files that might need cleaning up: https://phabricator.wikimedia.org/P58961 . If th" [puppet] - 10https://gerrit.wikimedia.org/r/1014557 (https://phabricator.wikimedia.org/T341895) (owner: 10Brouberol) [19:41:38] !log ticket.wikimedia.org - replacing envoy cert on backends [19:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:45:53] (03CR) 10TheDJ: [C:03+1] lists: Allow images from upload.wikimedia.org in CSP [puppet] - 10https://gerrit.wikimedia.org/r/987317 (https://phabricator.wikimedia.org/T353755) (owner: 10Legoktm) [19:46:26] (03CR) 10Dzahn: [C:03+2] "[vrts2001:/etc/ssl/localcerts] $ sudo openssl x509 -noout -ext subjectAltName -in /etc/envoy/ssl/discovery__ticket_discovery_wmnet_server." [puppet] - 10https://gerrit.wikimedia.org/r/1014606 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T2000). [20:00:05] Tchanders and kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] Hello [20:01:49] (03CR) 10Dzahn: [C:03+2] ssl: delete ticket.discovery.wmnet cert, migrated to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/1014607 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [20:02:02] Hi [20:04:03] Hi, I can run the backports today [20:04:10] tyty [20:04:49] * cjming thanks jeena [20:04:59] Tchanders: kimberly_sarabia Since you both have config patches would it be okay for them to go out at the same time? [20:05:16] jeena: Yes that's fine [20:05:17] That should be fine. [20:05:22] okay great [20:06:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015072 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [20:06:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jhuneidi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015095 (https://phabricator.wikimedia.org/T360628) (owner: 10Kimberly Sarabia) [20:07:27] (03Merged) 10jenkins-bot: Scope temp user reserved pattern to temp users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015072 (https://phabricator.wikimedia.org/T361021) (owner: 10Tchanders) [20:07:30] (03Merged) 10jenkins-bot: Updates config to deploy vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015095 (https://phabricator.wikimedia.org/T360628) (owner: 10Kimberly Sarabia) [20:07:58] !log jhuneidi@deploy1002 Started scap: Backport for [[gerrit:1015072|Scope temp user reserved pattern to temp users (T361021 T349506)]], [[gerrit:1015095|Updates config to deploy vector 2022 (T360628)]] [20:08:06] T361021: New accounts with names beginning with ~2 are created - https://phabricator.wikimedia.org/T361021 [20:08:06] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [20:08:06] T360628: Deploy Vector 2022 skin to Wikisource wikis, internal wikis, and wikipedias - https://phabricator.wikimedia.org/T360628 [20:08:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.codfw.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:10:30] !log jhuneidi@deploy1002 ksarabia and jhuneidi and tchanders: Backport for [[gerrit:1015072|Scope temp user reserved pattern to temp users (T361021 T349506)]], [[gerrit:1015095|Updates config to deploy vector 2022 (T360628)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:10] kimberly_sarabia: Tchanders please let me know when it's okay to continue to full sync [20:11:51] Ok checking. Need a few minutes [20:11:55] jeena: Please go ahead - it's not something testable [20:15:17] LGTM [20:15:24] thanks! [20:15:28] !log jhuneidi@deploy1002 ksarabia and jhuneidi and tchanders: Continuing with sync [20:15:36] (GatewayBackendErrorsHigh) firing: rest-gateway: elevated 5xx errors from proton_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [20:17:34] ^ Taking a look. [20:19:08] denisse: we had similar issue this morning with proton [20:26:56] !log jhuneidi@deploy1002 Finished scap: Backport for [[gerrit:1015072|Scope temp user reserved pattern to temp users (T361021 T349506)]], [[gerrit:1015095|Updates config to deploy vector 2022 (T360628)]] (duration: 18m 57s) [20:27:08] T361021: New accounts with names beginning with ~2 are created - https://phabricator.wikimedia.org/T361021 [20:27:09] T349506: Set temporary user pattern configuration on production ahead of testwiki deployment - https://phabricator.wikimedia.org/T349506 [20:27:09] T360628: Deploy Vector 2022 skin to Wikisource wikis, internal wikis, and wikipedias - https://phabricator.wikimedia.org/T360628 [20:28:44] UTC late backport window completed [20:29:46] Thanks jeena! [20:30:06] you're welcome! [20:31:01] Thank you! [20:31:37] yw! [20:38:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://proton.svc.codfw.wmnet:4030 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [20:38:57] ok [20:40:10] (03PS1) 10Dzahn: delete ticket.discovery.wmnet dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1015111 (https://phabricator.wikimedia.org/T360413) [20:41:01] (03PS2) 10Dzahn: delete ticket.discovery.wmnet dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1015111 (https://phabricator.wikimedia.org/T360413) [20:44:06] (GatewayBackendErrorsHigh) resolved: rest-gateway: elevated 5xx errors from proton_cluster in codfw #page - https://wikitech.wikimedia.org/wiki/API_Gateway#How_to_debug_it - https://grafana.wikimedia.org/d/UOH-5IDMz/api-and-rest-gateway?orgId=1&refresh=30s&viewPanel=57&var-datasource=codfw%20prometheus/k8s&var-instance=rest-gateway - https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh [20:45:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:45:34] (03CR) 10Dzahn: [V:03+2 C:03+2] delete ticket.discovery.wmnet dummy key, migrated to cfssl [labs/private] - 10https://gerrit.wikimedia.org/r/1015111 (https://phabricator.wikimedia.org/T360413) (owner: 10Dzahn) [20:47:52] 06SRE, 06collaboration-services, 13Patch-For-Review: Phase out cergen for Collaboration Services services - https://phabricator.wikimedia.org/T360413#9667093 (10Dzahn) [20:48:00] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9667094 (10Dzahn) [21:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240327T2100) [21:23:33] (03CR) 10Andrea Denisse: [C:03+1] "Looks good to me, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1014048 (https://phabricator.wikimedia.org/T360950) (owner: 10Cwhite) [21:29:41] (03CR) 10Ladsgroup: "I'll deploy it soon" [puppet] - 10https://gerrit.wikimedia.org/r/987317 (https://phabricator.wikimedia.org/T353755) (owner: 10Legoktm) [21:35:19] (03PS1) 10Bking: elasticsearch: remove soon-to-be-decommed codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1015119 (https://phabricator.wikimedia.org/T358882) [21:37:55] (03PS2) 10Bking: elasticsearch: remove soon-to-be-decommed codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1015119 (https://phabricator.wikimedia.org/T358882) [21:38:19] (03CR) 10Ryan Kemper: [C:03+1] elasticsearch: remove soon-to-be-decommed codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1015119 (https://phabricator.wikimedia.org/T358882) (owner: 10Bking) [21:41:20] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye [21:41:46] (03CR) 10Bking: [C:03+2] elasticsearch: remove soon-to-be-decommed codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/1015119 (https://phabricator.wikimedia.org/T358882) (owner: 10Bking) [21:46:51] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic[2038-2048,2050-2054].codfw.wmnet [21:48:35] (03PS1) 10Ryan Kemper: elastic: decom elastic20[37-54] [puppet] - 10https://gerrit.wikimedia.org/r/1015123 (https://phabricator.wikimedia.org/T358882) [21:49:05] (03CR) 10Bking: [C:03+1] elastic: decom elastic20[37-54] [puppet] - 10https://gerrit.wikimedia.org/r/1015123 (https://phabricator.wikimedia.org/T358882) (owner: 10Ryan Kemper) [21:49:20] (03CR) 10Bking: [C:03+2] elastic: decom elastic20[37-54] [puppet] - 10https://gerrit.wikimedia.org/r/1015123 (https://phabricator.wikimedia.org/T358882) (owner: 10Ryan Kemper) [22:00:27] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:15:21] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Decom asw-a-codfw switch stack - https://phabricator.wikimedia.org/T358244#9667344 (10Papaul) [22:16:57] !log T360993 [WDQS Deploy] Gearing up for deploy of wdqs `0.3.138`. Pre-deploy tests passing on canary `wdqs1003` [22:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:03] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [22:17:10] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@143ca33]: 0.3.138 [22:17:55] !log T360993 [WDQS Deploy] Tests passing following deploy of `0.3.138` on canary `wdqs1003`; proceeding to rest of fleet [22:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:39] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic2106-production-search-omega-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:24:17] (03PS1) 10Dzahn: aptrepo: allow gitlab versions from 16.7 to 16.9 [puppet] - 10https://gerrit.wikimedia.org/r/1015136 (https://phabricator.wikimedia.org/T361165) [22:25:57] (03CR) 10EoghanGaffney: [C:03+1] aptrepo: allow gitlab versions from 16.7 to 16.9 [puppet] - 10https://gerrit.wikimedia.org/r/1015136 (https://phabricator.wikimedia.org/T361165) (owner: 10Dzahn) [22:27:42] (03CR) 10Dzahn: [C:03+2] aptrepo: allow gitlab versions from 16.7 to 16.9 [puppet] - 10https://gerrit.wikimedia.org/r/1015136 (https://phabricator.wikimedia.org/T361165) (owner: 10Dzahn) [22:28:35] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@143ca33]: 0.3.138 (duration: 11m 24s) [22:30:19] !log T360993 [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [22:30:22] !log T360993 [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [22:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:23] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [22:30:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:28] !log T360993 [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [22:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:27] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:45:07] (03PS1) 10Tim Starling: Fix index usage when searching for page titles [extensions/Linter] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1015054 (https://phabricator.wikimedia.org/T360865) [22:46:57] (03PS1) 10Tim Starling: Fix index usage when searching for page titles [extensions/Linter] (wmf/1.42.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1015055 (https://phabricator.wikimedia.org/T360865) [23:03:29] 10SRE-swift-storage, 10TimedMediaHandler-Transcode: 14Purge videos after move - 14https://phabricator.wikimedia.org/T156914#9667436 (10TheDJ) →14Duplicate dup:03T113191 [23:04:16] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host elastic2088.codfw.wmnet with OS bullseye [23:10:24] (03PS1) 10Gergő Tisza: Enter deprecation trial for third-party cookie blocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015145 (https://phabricator.wikimedia.org/T359957) [23:15:18] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1026.eqiad.wmnet with reason: Decommissioning — T354561 [23:15:23] T354561: Decommission restbase10[19-27] - https://phabricator.wikimedia.org/T354561 [23:15:32] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1026.eqiad.wmnet with reason: Decommissioning — T354561 [23:16:50] (03PS1) 10Kimberly Sarabia: Revert donatewiki and thankyouwiki for fundraising [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015146 (https://phabricator.wikimedia.org/T360628) [23:17:16] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:40] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:21:39] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic2106-production-search-omega-codfw is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:21:54] !log on releases1003: uploaded 80 missing old MediaWiki releases T190369 [23:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:58] T190369: Big holes in the MediaWiki release archive - https://phabricator.wikimedia.org/T190369 [23:23:49] jouncebot: nowandnext [23:23:49] No deployments scheduled for the next 6 hour(s) and 36 minute(s) [23:23:49] In 6 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T0600) [23:23:49] In 6 hour(s) and 36 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240328T0600) [23:26:09] !log dzahn@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release [23:39:03] (03PS1) 10Ebernhardson: cirrus: Move small wiki traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015152 [23:40:24] We broke a cluster in codfw, shipping ^ to move traffic [23:41:17] (03PS2) 10Ebernhardson: cirrus: Move small wiki traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015152 [23:41:23] (03CR) 10Ebernhardson: [C:03+2] cirrus: Move small wiki traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015152 (owner: 10Ebernhardson) [23:42:32] (03Merged) 10jenkins-bot: cirrus: Move small wiki traffic to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015152 (owner: 10Ebernhardson) [23:43:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-parsoid (k8s) 868.1ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:48:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-parsoid (k8s) 872.6ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [23:49:40] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:1015152|cirrus: Move small wiki traffic to eqiad]] [23:51:23] (03PS2) 10Gergő Tisza: Enter deprecation trial for third-party cookie blocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015145 (https://phabricator.wikimedia.org/T359957) [23:52:09] !log ebernhardson@deploy1002 ebernhardson: Backport for [[gerrit:1015152|cirrus: Move small wiki traffic to eqiad]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:53:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:53:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:53:28] !log ebernhardson@deploy1002 ebernhardson: Continuing with sync [23:55:39] (03CR) 10Krinkle: [C:03+1] Enter deprecation trial for third-party cookie blocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015145 (https://phabricator.wikimedia.org/T359957) (owner: 10Gergő Tisza) [23:58:24] !log T360993 [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [23:58:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:30] T360993: WDQS lag propagation to wikidata not working as intended - https://phabricator.wikimedia.org/T360993 [23:59:10] (03PS1) 10Ebernhardson: cirrus: Move small wiki traffic to eqiad (take two) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015157 [23:59:42] (03PS2) 10Ebernhardson: cirrus: Move small wiki traffic to eqiad (take two) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1015157