[00:04:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 4.4954521379340635s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:16:57] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup: also remove expired backups via delete-expired [puppet] - 10https://gerrit.wikimedia.org/r/972915 (owner: 10Andrew Bogott) [00:38:59] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972516 [00:39:01] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972516 (owner: 10TrainBranchBot) [00:50:09] PROBLEM - Check systemd state on analytics1077 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:13] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972516 (owner: 10TrainBranchBot) [00:58:12] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972516 (owner: 10TrainBranchBot) [01:46:33] RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:09:31] (03PS1) 10Ssingh: test_dns: fix production IPv6 networks [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/972920 [02:10:41] (03CR) 10Ssingh: [V: 03+2 C: 03+2] test_dns: fix production IPv6 networks [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/972920 (owner: 10Ssingh) [02:11:42] (03PS1) 10Marostegui: dbproxy102[2,4]: Promote db1119 to standby [puppet] - 10https://gerrit.wikimedia.org/r/972921 (https://phabricator.wikimedia.org/T350022) [02:12:39] (03CR) 10Marostegui: "Please review that the IPs and hostnames do match!" [puppet] - 10https://gerrit.wikimedia.org/r/972921 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui) [02:29:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:38:13] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:51] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:53:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:04:42] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:27:33] (03PS1) 10Andrea Denisse: icinga: Remove unnecessary python-phabricator Python2 dependency [puppet] - 10https://gerrit.wikimedia.org/r/972925 (https://phabricator.wikimedia.org/T333615) [03:40:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:43:35] (03CR) 10Andrea Denisse: "PCC results: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/358/" [puppet] - 10https://gerrit.wikimedia.org/r/972925 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [03:44:39] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:50:30] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:53:13] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:27:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:40:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:15:37] (03PS1) 10Andrea Denisse: ircecho: Migrate IRC Echo from Python 2 to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) [05:22:30] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:52:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:54:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:09:15] (03PS2) 10MPGuy2824: InitialiseSettings-labs: Remove values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968397 (https://phabricator.wikimedia.org/T331595) [06:12:38] (03CR) 10MPGuy2824: "Related patch which was previously merged: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/965395/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968397 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824) [06:22:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:26:07] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 35 hosts with reason: Primary switchover s1 T350142 [06:26:11] T350142: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T350142 [06:26:36] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 35 hosts with reason: Primary switchover s1 T350142 [06:27:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set db2103 with weight 0 T350142', diff saved to https://phabricator.wikimedia.org/P53174 and previous config saved to /var/cache/conftool/dbconfig/20231109-062725-arnaudb.json [06:47:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:47:06] (03CR) 10Stevemunene: [C: 03+2] switch druid host to run data_purge job [puppet] - 10https://gerrit.wikimedia.org/r/970272 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [06:53:19] (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/969990 (https://phabricator.wikimedia.org/T350142) (owner: 10Gerrit maintenance bot) [06:53:52] (03PS2) 10Arnaudb: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/969991 (https://phabricator.wikimedia.org/T350142) (owner: 10Gerrit maintenance bot) [07:00:00] !log Starting s1 codfw failover from db2112 to db2103 - T350142 [07:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T0700) [07:00:05] kormat, marostegui, and Amir1: I, the Bot under the Fountain, call upon thee, The Deployer, to do Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T0700). [07:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set s1 codfw as read-only for maintenance - T350142', diff saved to https://phabricator.wikimedia.org/P53175 and previous config saved to /var/cache/conftool/dbconfig/20231109-070012-arnaudb.json [07:00:13] T350142: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T350142 [07:00:23] ready [07:04:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Promote db2103 to s1 primary and set section read-write T350142', diff saved to https://phabricator.wikimedia.org/P53176 and previous config saved to /var/cache/conftool/dbconfig/20231109-070410-arnaudb.json [07:05:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:06:46] (03CR) 10Arnaudb: [C: 03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/969991 (https://phabricator.wikimedia.org/T350142) (owner: 10Gerrit maintenance bot) [07:08:13] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depool db2112 T350142', diff saved to https://phabricator.wikimedia.org/P53177 and previous config saved to /var/cache/conftool/dbconfig/20231109-070936-arnaudb.json [07:09:41] T350142: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T350142 [07:10:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:13:24] arnaudb: let me know if there is any issue with dbctl (not expecting them ;) ) [07:15:27] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [07:15:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [07:31:25] (03CR) 10Volans: [C: 04-1] "Has this been tested? Running 2to3 is just a starting point for migrating a script to Python3. For example the shebang seems wrong to me." [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [07:31:27] (03PS2) 10Muehlenhoff: Failover idp.w.o to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/971952 [07:35:47] !log installing openjdk-8 security updates [07:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:36] (03Abandoned) 10Awight: admin: add wmde-fisch to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/739334 (https://phabricator.wikimedia.org/T295781) (owner: 10Awight) [07:44:16] (03PS1) 10Filippo Giunchedi: alertmanager: route criticals to o11y IRC [puppet] - 10https://gerrit.wikimedia.org/r/973073 [07:44:48] (03CR) 10Slyngshede: sre.ganeti.*: customize lock arguments (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [07:45:51] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: route criticals to o11y IRC [puppet] - 10https://gerrit.wikimedia.org/r/973073 (owner: 10Filippo Giunchedi) [07:48:04] (03PS1) 10Ayounsi: Move pki policy to correct form/to zone [homer/public] - 10https://gerrit.wikimedia.org/r/973075 [07:49:13] (03CR) 10Ayounsi: [C: 03+2] Move pki policy to correct form/to zone [homer/public] - 10https://gerrit.wikimedia.org/r/973075 (owner: 10Ayounsi) [07:49:46] (03Merged) 10jenkins-bot: Move pki policy to correct form/to zone [homer/public] - 10https://gerrit.wikimedia.org/r/973075 (owner: 10Ayounsi) [07:50:57] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] "```" [puppet] - 10https://gerrit.wikimedia.org/r/972921 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui) [07:51:00] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) As for the network, do you want a public IP or should this rather run on a private IP and then get served from our CDN? For m... [07:53:13] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:55:40] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: etcd::v3::kubernetes::staging [07:56:40] (03PS1) 10Kosta Harlan: ipoid: Set MYSQL_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/973076 (https://phabricator.wikimedia.org/T346861) [08:00:04] Amir1, apergos, and jnuche: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T0800). [08:00:20] (03PS1) 10Kosta Harlan: ipoid: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973077 [08:00:36] morning! it's that time of the day, when I check the calendar for backport and config deployments and find not a one scheduled. and there are no trainees signed up to learn all the scap ins and outs either, luckily. which means... we'll see you here again next week, have a great day/evening! [08:01:15] (03PS1) 10Giuseppe Lavagetto: kube-state-metrics: add build-depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973078 (https://phabricator.wikimedia.org/T350366) [08:02:01] (03PS1) 10Muehlenhoff: Switch etcd::v3::kubernetes::staging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973079 (https://phabricator.wikimedia.org/T349619) [08:04:00] <_joe_> jouncebot: next [08:04:00] In 0 hour(s) and 55 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T0900) [08:04:07] <_joe_> ok, so I can go now :) [08:04:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972343 (https://phabricator.wikimedia.org/T240685) (owner: 10Giuseppe Lavagetto) [08:05:19] (03Merged) 10jenkins-bot: mediawiki: add statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/972343 (https://phabricator.wikimedia.org/T240685) (owner: 10Giuseppe Lavagetto) [08:07:01] !log oblivian@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [08:07:01] !log oblivian@deploy2002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [08:07:13] !log oblivian@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [08:07:13] !log oblivian@deploy2002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [08:09:19] (03CR) 10Slyngshede: [C: 03+2] Improve installation and setup procedures for running locally. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/972834 (owner: 10Slyngshede) [08:10:14] (03CR) 10Slyngshede: [C: 03+2] Alert on degraded MD RAID devices. [alerts] - 10https://gerrit.wikimedia.org/r/972692 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:10:35] (03CR) 10Muehlenhoff: [C: 03+2] Switch etcd::v3::kubernetes::staging to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973079 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:10:37] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973077 (owner: 10Kosta Harlan) [08:10:40] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Set MYSQL_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/973076 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [08:11:34] (03Merged) 10jenkins-bot: ipoid: Set MYSQL_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/973076 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [08:11:36] (03Merged) 10jenkins-bot: ipoid: Bump to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973077 (owner: 10Kosta Harlan) [08:12:03] (03Merged) 10jenkins-bot: Alert on degraded MD RAID devices. [alerts] - 10https://gerrit.wikimedia.org/r/972692 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:12:10] (03CR) 10Effie Mouzeli: [C: 03+1] ipoid: Set MYSQL_HOST [deployment-charts] - 10https://gerrit.wikimedia.org/r/973076 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [08:12:21] PROBLEM - Checks that the airflow database for airflow wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [08:12:43] (03Merged) 10jenkins-bot: Improve installation and setup procedures for running locally. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/972834 (owner: 10Slyngshede) [08:13:13] PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [08:13:15] PROBLEM - Check systemd state on an-airflow1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-node-textfile-prometheus-check-certificate-expiry.service,wmf_auto_restart_airflow-scheduler@wmde.service,wmf_auto_restart_airflow-webserver@wmde.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:16] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [08:14:05] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:22] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [08:15:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: etcd::v3::kubernetes::staging [08:16:19] (03PS1) 10Kosta Harlan: ipoid: Make cronjob names match naming rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/973081 [08:16:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host kubestagetcd2001.codfw.wmnet [08:17:20] (03CR) 10Brouberol: [C: 03+1] Switch datahub to use the new an-mariadb servers instead of an-coord (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972823 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [08:17:23] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Make cronjob names match naming rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/973081 (owner: 10Kosta Harlan) [08:18:11] (03Merged) 10jenkins-bot: ipoid: Make cronjob names match naming rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/973081 (owner: 10Kosta Harlan) [08:19:24] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [08:19:45] (03CR) 10Brouberol: [C: 03+2] Setup partman reuse recipe for an-druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/972851 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [08:19:52] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [08:20:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubestagetcd2001.codfw.wmnet [08:21:09] (03CR) 10Strainu: "Review request, if you have time. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) (owner: 10Strainu) [08:21:57] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: etcd::v3::dse_k8s_etcd [08:22:34] (03CR) 10Strainu: [namespaces] Use correct diacritics in Romanian (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972473 (https://phabricator.wikimedia.org/T350739) (owner: 10Strainu) [08:23:06] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) [08:23:24] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) [08:23:36] (03PS1) 10Muehlenhoff: Switch role: etcd::v3::dse_k8s_etcd to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973082 (https://phabricator.wikimedia.org/T349619) [08:24:00] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) a:05Clement_Goubert→03Joe [08:25:35] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [08:25:49] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [08:26:13] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971187 (https://phabricator.wikimedia.org/T347593) (owner: 10EoghanGaffney) [08:27:27] (03CR) 10Muehlenhoff: [C: 03+2] Switch role: etcd::v3::dse_k8s_etcd to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973082 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:35:45] !log restart vopsbot.service on alert1001 [08:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [08:43:13] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10JMeybohm) Couldn't we just add another mobileapps deployment (like a canary) that connects to mw-api-int and scale that up slowly while scaling the exis... [08:46:37] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [08:47:05] !log add 50G to prometheus/ml-serve in codfw [08:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:11] (03CR) 10Kosta Harlan: "This seems to not work. I created a `test` table on ipoid DB and am unable to DROP it." [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [08:48:18] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add alerts-triage on /triage [puppet] - 10https://gerrit.wikimedia.org/r/972335 (https://phabricator.wikimedia.org/T350014) (owner: 10Filippo Giunchedi) [08:50:31] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:base enable ethtool data collection [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [08:51:31] (03CR) 10JMeybohm: [C: 03+1] staging-eqiad: raise rdf-streaming-updater quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [08:57:50] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:06] jnuche and dduvall: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T0900). [09:01:00] (03PS1) 10Effie Mouzeli: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 [09:01:41] morning, we have a train blocker (https://phabricator.wikimedia.org/T350836) with a possible workaround [09:01:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/972925 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [09:01:57] I need to look into it so the train will rollout a bit later than usual [09:02:30] (03PS3) 10Arnaudb: mariadb: add db1238 and prepare db1138 retirement [puppet] - 10https://gerrit.wikimedia.org/r/972507 (https://phabricator.wikimedia.org/T344036) [09:02:45] 10SRE-OnFire, 10Patch-For-Review, 10User-fgiunchedi: Deploy alerts-triage app to production - https://phabricator.wikimedia.org/T350014 (10fgiunchedi) [09:03:09] (03CR) 10CI reject: [V: 04-1] admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 (owner: 10Effie Mouzeli) [09:04:50] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:58] (03Abandoned) 10Jcrespo: Add logger functionality to recover-dump, add logger statements, added unit test to test initializing logging [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [09:07:45] (03Abandoned) 10Jcrespo: Improved: regex-validation in cli/recover-dump and added unit test file in test/unit [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673693 (https://phabricator.wikimedia.org/T277754) (owner: 10DharmrajRathod98) [09:08:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::data_engineering [09:09:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:16] (03PS2) 10Effie Mouzeli: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 [09:12:37] (03PS1) 10Muehlenhoff: Switch insetup::data_engineering to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973085 (https://phabricator.wikimedia.org/T349619) [09:13:39] (03CR) 10Muehlenhoff: [C: 03+2] Switch insetup::data_engineering to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973085 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:14:29] (03CR) 10Kosta Harlan: admin_ng: increase ipoid quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 (owner: 10Effie Mouzeli) [09:15:23] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:16:02] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:18:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::data_engineering [09:19:35] (03PS3) 10Effie Mouzeli: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 [09:20:27] (03CR) 10Effie Mouzeli: admin_ng: increase ipoid quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 (owner: 10Effie Mouzeli) [09:24:18] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::infrastructure_foundations [09:26:12] PROBLEM - Disk space on titan2001 is CRITICAL: DISK CRITICAL - free space: /srv 21169MiB (2% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [09:26:57] (03PS1) 10Muehlenhoff: Switch insetup::infrastructure_foundations to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973087 (https://phabricator.wikimedia.org/T349619) [09:27:26] (03CR) 10Kosta Harlan: [C: 03+1] admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 (owner: 10Effie Mouzeli) [09:27:49] rolling out train in 5m [09:28:38] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:19] (03CR) 10Muehlenhoff: [C: 03+2] Switch insetup::infrastructure_foundations to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973087 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:31:10] (03PS7) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [09:31:31] (03CR) 10CI reject: [V: 04-1] [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [09:34:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:44] (03PS4) 10Effie Mouzeli: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 [09:35:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::infrastructure_foundations [09:39:15] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:41:38] !log jnuche@deploy2002 rebuilt and synchronized wikiversions files: Deploy 1.42.0-wmf.4 to group2 (labswiki staying at 1.42.0-wmf.3 due to T350836) [09:41:42] T350836: OAuth login to wikitech fails when running MediaWiki 1.42.0-wmf.4 - https://phabricator.wikimedia.org/T350836 [09:41:55] !log btullis@deploy2002 helmfile [codfw] START helmfile.d/services/datahub: sync on main [09:43:03] (03PS5) 10Effie Mouzeli: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 [09:46:24] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:46:36] RECOVERY - Disk space on titan2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=titan2001&var-datasource=codfw+prometheus/ops [09:47:22] (03CR) 10Muehlenhoff: [C: 03+2] Failover idp.w.o to idp1002 [dns] - 10https://gerrit.wikimedia.org/r/971952 (owner: 10Muehlenhoff) [09:47:40] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab] Add metrics for timing backups/restores [puppet] - 10https://gerrit.wikimedia.org/r/971187 (https://phabricator.wikimedia.org/T347593) (owner: 10EoghanGaffney) [09:47:52] (03PS1) 10Jaime Nuche: group2 wikis to 1.42.0-wmf.4 (labswiki staying at 1.42.0-wmf.3 due to T350836) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973118 (https://phabricator.wikimedia.org/T350080) [09:47:55] (03CR) 10Jaime Nuche: [C: 03+2] group2 wikis to 1.42.0-wmf.4 (labswiki staying at 1.42.0-wmf.3 due to T350836) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973118 (https://phabricator.wikimedia.org/T350080) (owner: 10Jaime Nuche) [09:48:39] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.4 (labswiki staying at 1.42.0-wmf.3 due to T350836) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973118 (https://phabricator.wikimedia.org/T350080) (owner: 10Jaime Nuche) [09:49:37] train complete [09:49:50] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:07] !log btullis@deploy2002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [09:51:24] !log btullis@deploy2002 helmfile [eqiad] START helmfile.d/services/datahub: sync on main [09:52:24] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host urldownloader2003.wikimedia.org [09:55:08] !log btullis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [09:55:10] (03PS1) 10Muehlenhoff: Switch urldownloader2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973119 (https://phabricator.wikimedia.org/T349619) [09:56:27] (03CR) 10JMeybohm: [C: 04-1] admin_ng: increase ipoid quota (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 (owner: 10Effie Mouzeli) [09:57:22] (03CR) 10Muehlenhoff: [C: 03+2] Switch urldownloader2003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973119 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:58:00] (03CR) 10Kamila Součková: [C: 03+1] kube-state-metrics: add build-depends [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973078 (https://phabricator.wikimedia.org/T350366) (owner: 10Giuseppe Lavagetto) [09:58:02] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10AndrewTavis_WMDE) I can confirm that I have access to `discovery.processed_external_sparql_query` now :) I'll resolve this, but please let me know i... [09:58:16] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10AndrewTavis_WMDE) 05Stalled→03Resolved [09:59:01] (03PS6) 10Effie Mouzeli: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 [10:03:50] (03Abandoned) 10Hashar: Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/972906 (owner: 10Hashar) [10:08:14] (03CR) 10MVernon: [C: 03+2] thanos: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945752 (owner: 10Muehlenhoff) [10:09:17] (03Abandoned) 10Jcrespo: Improved filename regex in cli/recover-dump [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/675010 (https://phabricator.wikimedia.org/T277754) (owner: 10Sahilgrewalhere) [10:09:52] (03Abandoned) 10Jcrespo: Add new methods in recover-dump to measure execution time [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/672150 (https://phabricator.wikimedia.org/T277160) (owner: 10H.krishna123) [10:10:41] (03Abandoned) 10Jcrespo: Add logger to recover-dump::to indicate actions taken [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/671942 (https://phabricator.wikimedia.org/T277162) (owner: 10Rohitesh20) [10:13:50] (03PS7) 10Effie Mouzeli: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 [10:16:17] (03Abandoned) 10Jcrespo: Improve logic and quality of life for local backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/820664 (owner: 10Jcrespo) [10:16:41] (03PS8) 10Effie Mouzeli: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 [10:17:07] (03CR) 10Ayounsi: [C: 03+1] Change 'anycast_gw' var in int config to represent type of IRB needed (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney) [10:17:46] (03PS9) 10Effie Mouzeli: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 [10:20:41] (03CR) 10JMeybohm: [C: 03+1] admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 (owner: 10Effie Mouzeli) [10:22:46] (03CR) 10Jcrespo: [C: 03+2] Tranferrer: Enable transfers other than misc, core or x1 sections [software/transferpy] - 10https://gerrit.wikimedia.org/r/972433 (https://phabricator.wikimedia.org/T284150) (owner: 10Jcrespo) [10:23:24] (03CR) 10Effie Mouzeli: [C: 03+2] admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 (owner: 10Effie Mouzeli) [10:24:27] (03CR) 10Hnowlan: [C: 03+1] admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 (owner: 10Effie Mouzeli) [10:25:59] (03Merged) 10jenkins-bot: admin_ng: increase ipoid quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973084 (owner: 10Effie Mouzeli) [10:27:15] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:28:26] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:31:34] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:32:02] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:37:26] (03PS4) 10Jelto: gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) [10:39:10] RECOVERY - Checks that the airflow database for airflow wmde is working properly on an-airflow1007 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow db check succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [10:40:56] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/362/con" [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [10:45:02] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [10:46:52] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:51:38] (03PS1) 10Hnowlan: {druid,cassandra}-http-gateway: checksum config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/973129 [10:53:03] (03PS9) 10Hashar: Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 [10:53:27] (03CR) 10CI reject: [V: 04-1] Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 (owner: 10Hashar) [10:53:56] (03CR) 10Effie Mouzeli: [C: 03+1] {druid,cassandra}-http-gateway: checksum config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/973129 (owner: 10Hnowlan) [10:55:39] (03CR) 10Hnowlan: [C: 03+2] {druid,cassandra}-http-gateway: checksum config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/973129 (owner: 10Hnowlan) [10:55:57] (03PS1) 10Mvolz: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/deployment-charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/973130 [10:55:59] (03PS1) 10Mvolz: Update Zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973131 [10:57:14] (03Merged) 10jenkins-bot: {druid,cassandra}-http-gateway: checksum config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/973129 (owner: 10Hnowlan) [10:57:53] (03PS2) 10Mvolz: Update Zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973131 [10:58:38] (03Abandoned) 10Mvolz: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/deployment-charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/973130 (owner: 10Mvolz) [11:00:04] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T1100) [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T1100) [11:00:45] (03CR) 10Mvolz: [C: 03+2] Update Zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973131 (owner: 10Mvolz) [11:01:33] (03Merged) 10jenkins-bot: Update Zotero to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973131 (owner: 10Mvolz) [11:02:12] RECOVERY - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [11:03:42] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [11:04:14] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:04:34] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:04:44] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [11:05:17] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [11:05:18] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:05:29] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [11:05:36] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [11:05:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:05:56] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [11:06:20] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [11:06:32] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [11:06:42] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:07:16] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:07:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) >>! In T350846#9318457, @JMeybohm wrote: > Couldn't we just add another mobileapps release (like a canary) that connects to mw-api-int and scale th... [11:08:34] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [11:08:45] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [11:08:46] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [11:08:52] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:09:01] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [11:09:03] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [11:09:17] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [11:13:38] (03CR) 10Btullis: [C: 03+2] Deploy multiple spark shufflers for yarn to production [puppet] - 10https://gerrit.wikimedia.org/r/964008 (https://phabricator.wikimedia.org/T344910) (owner: 10Btullis) [11:14:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host urldownloader2003.wikimedia.org [11:20:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: url_downloader [11:21:43] (03PS7) 10Jbond: mariadb - analytics: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) [11:21:46] (03PS1) 10Muehlenhoff: Switch url_downloader to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973132 (https://phabricator.wikimedia.org/T349619) [11:22:08] (03CR) 10Jbond: mariadb - analytics: update the ssl-ca value used by mariadb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968666 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [11:22:24] (03CR) 10Muehlenhoff: [C: 03+2] Switch url_downloader to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973132 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:26:17] (03PS1) 10Jelto: gitlab_runner: move profile::gitlab::runner::token to private/labs [puppet] - 10https://gerrit.wikimedia.org/r/973133 (https://phabricator.wikimedia.org/T344951) [11:26:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: url_downloader [11:27:26] (03CR) 10Majavah: [C: 03+2] P:openstack: keystone: sync fernet keys over cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/972737 (owner: 10Majavah) [11:27:33] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: trove: use cloud-private for memcached in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/972738 (owner: 10Majavah) [11:27:40] (03PS2) 10Majavah: P:openstack: trove: use cloud-private for memcached in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/972738 [11:28:05] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:28:20] (03CR) 10Jelto: [C: 03+2] gitlab_runner: move profile::gitlab::runner::token to private/labs [puppet] - 10https://gerrit.wikimedia.org/r/973133 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [11:35:23] (03CR) 10Jbond: "Change lgtm but see nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [11:38:43] <_joe_> !log disabled requestctl cache-text/wikifeeds_featured T350645 T346657 [11:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:48] T350645: feed/featured endpoint is showing error on zhwiki - https://phabricator.wikimedia.org/T350645 [11:38:49] T346657: Requests originating from zhwiki wikifeeds caused parsoid outage - https://phabricator.wikimedia.org/T346657 [11:38:52] (03PS1) 10Kamila Součková: enable kube-state-metrics prototype in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/973134 (https://phabricator.wikimedia.org/T264625) [11:39:00] (03PS1) 10Slyngshede: Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) [11:41:01] (03PS1) 10Jbond: cloud.yaml: add profile::pki::multirootca::cfssl_httpd_cert: false [puppet] - 10https://gerrit.wikimedia.org/r/973136 [11:42:26] (03CR) 10CI reject: [V: 04-1] Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [11:42:36] (03PS2) 10Jbond: cloud.yaml: add profile::pki::multirootca::cfssl_httpd_cert: false [puppet] - 10https://gerrit.wikimedia.org/r/973136 [11:43:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::generation::server::xmlfallback [11:43:22] (03CR) 10Jbond: [C: 03+2] cloud.yaml: add profile::pki::multirootca::cfssl_httpd_cert: false [puppet] - 10https://gerrit.wikimedia.org/r/973136 (owner: 10Jbond) [11:44:34] (03CR) 10Jbond: [C: 03+1] admin: add Martin Urbanec as group approver for stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972909 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [11:44:41] (03PS2) 10Slyngshede: Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) [11:45:10] PROBLEM - Check systemd state on ganeti2010 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:41] (03PS2) 10Jbond: puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) [11:45:54] (03PS1) 10Muehlenhoff: Switch dumps::generation::server::xmlfallback to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973137 (https://phabricator.wikimedia.org/T349619) [11:46:11] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [11:48:01] (03CR) 10CI reject: [V: 04-1] Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [11:49:19] (03CR) 10Muehlenhoff: [C: 03+2] Switch dumps::generation::server::xmlfallback to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973137 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:52:26] (03CR) 10CI reject: [V: 04-1] puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [11:53:52] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:54:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::generation::server::xmlfallback [11:55:42] (03PS3) 10Slyngshede: Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) [11:55:45] (03PS5) 10Cathal Mooney: Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) [12:02:59] (03PS1) 10Dreamy Jazz: Revert "CheckUser: Set 'debug' log level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972722 (https://phabricator.wikimedia.org/T345591) [12:03:07] (03PS2) 10Dreamy Jazz: Revert "CheckUser: Set 'debug' log level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972722 (https://phabricator.wikimedia.org/T345591) [12:08:25] !log installing python-reportlab security updates [12:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:02] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::generation::server::xmldumps [12:10:52] (03PS1) 10Muehlenhoff: Switch dumps::generation::server::xmldumps to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973141 (https://phabricator.wikimedia.org/T349619) [12:12:22] (03PS1) 10Btullis: Update the datahub images to address CVE-2023-4911 [deployment-charts] - 10https://gerrit.wikimedia.org/r/973142 (https://phabricator.wikimedia.org/T348647) [12:12:24] (03CR) 10Muehlenhoff: [C: 03+2] Switch dumps::generation::server::xmldumps to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973141 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:13:26] (03CR) 10Btullis: "The build pipeline for these images was triggered manually and is here: https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/pipe" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973142 (https://phabricator.wikimedia.org/T348647) (owner: 10Btullis) [12:15:17] (03CR) 10JMeybohm: [C: 04-1] etcd: update to use shared SSL CA (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:15:56] (03PS4) 10Hnowlan: wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) [12:16:16] (03PS3) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996 [12:16:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::generation::server::xmldumps [12:17:05] (03CR) 10Cathal Mooney: Adjust reimage cookbook config for DHCP binding clear workaround (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [12:17:17] (03PS1) 10Hnowlan: editor-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/973144 [12:17:52] (03CR) 10JMeybohm: [C: 04-1] "You'll need to add the networkpolicy as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973134 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:18:45] (03CR) 10Santiago Faci: [C: 03+1] "It look good! Thanks!!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973144 (owner: 10Hnowlan) [12:19:58] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:20:16] (03CR) 10Hnowlan: [C: 03+2] editor-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/973144 (owner: 10Hnowlan) [12:21:01] (03Merged) 10jenkins-bot: editor-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/973144 (owner: 10Hnowlan) [12:21:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: dumps::generation::server::misccrons [12:23:46] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [12:23:59] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [12:24:48] (03PS5) 10Hnowlan: wmnet: add records for mw-jobrunner [dns] - 10https://gerrit.wikimedia.org/r/972394 (https://phabricator.wikimedia.org/T349796) [12:27:12] (03PS1) 10Muehlenhoff: Switch dumps::generation::server::misccrons to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973147 (https://phabricator.wikimedia.org/T349619) [12:28:44] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/973149 (owner: 10L10n-bot) [12:28:51] (03CR) 10Muehlenhoff: [C: 03+2] Switch dumps::generation::server::misccrons to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973147 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:30:03] (03PS2) 10Hnowlan: service, conftool: add mw-jobrunner config [puppet] - 10https://gerrit.wikimedia.org/r/972442 (https://phabricator.wikimedia.org/T349796) [12:31:34] (03CR) 10Ladsgroup: production-m5.sql: Add DROP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [12:33:00] (03PS3) 10Hnowlan: service, conftool: add mw-jobrunner config [puppet] - 10https://gerrit.wikimedia.org/r/972442 (https://phabricator.wikimedia.org/T349796) [12:33:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dumps::generation::server::misccrons [12:35:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:35:24] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) [12:36:36] (03PS2) 10Kamila Součková: enable kube-state-metrics prototype in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/973134 (https://phabricator.wikimedia.org/T264625) [12:38:13] (03CR) 10Kamila Součková: enable kube-state-metrics prototype in eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973134 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:38:28] !log installing qemu security updates [12:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:56] (03CR) 10JMeybohm: [C: 03+1] enable kube-state-metrics prototype in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/973134 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:40:14] (03CR) 10Kamila Součková: [C: 03+2] enable kube-state-metrics prototype in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/973134 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:42:43] (03Merged) 10jenkins-bot: enable kube-state-metrics prototype in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/973134 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [12:42:52] RECOVERY - Check systemd state on ganeti2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:54] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:42:55] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:43:17] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:43:29] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:43:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [12:43:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:43:58] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:44:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T348183)', diff saved to https://phabricator.wikimedia.org/P53178 and previous config saved to /var/cache/conftool/dbconfig/20231109-124404-arnaudb.json [12:44:08] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:44:15] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:48:28] (03CR) 10Muehlenhoff: [apt-staging] Add rsync endpoint for ci->apt pipeline (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [12:49:34] (03CR) 10Jbond: "Already a good improvement however See inline i don't think __exit__ is called on sigterm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [12:53:23] (03PS3) 10Jbond: etcd: update to use shared SSL CA [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) [12:54:19] (03CR) 10Jbond: "thanks updated" [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:56:39] (03CR) 10Ladsgroup: production-m5.sql: Add DROP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [12:59:52] (03CR) 10Ladsgroup: "It's not applied on dbproxy1021" [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T1300) [13:02:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 15%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53179 and previous config saved to /var/cache/conftool/dbconfig/20231109-130225-arnaudb.json [13:02:27] (03CR) 10Ladsgroup: "Kostah: Can you try again?" [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [13:14:56] (03PS1) 10Dreamy Jazz: MediaModeration: Define virtual domains mapping config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973157 (https://phabricator.wikimedia.org/T350321) [13:17:01] (ProbeDown) firing: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:17:20] hi [13:17:28] anyone working on phab1004? [13:17:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 30%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53180 and previous config saved to /var/cache/conftool/dbconfig/20231109-131730-arnaudb.json [13:17:31] Phab loads fine for me [13:17:33] phab is up to me [13:17:38] I just got the page [13:17:41] is that production or a replacement? [13:17:47] That's the production instance [13:17:51] phab1004 should be the prod one [13:17:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 6 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:18:01] acked the alarm anyway [13:18:22] bblack: do you see any traffic-side issue? [13:18:59] not yet [13:19:00] jynus: please let eoghan and I try to understand what is going on since we are on call [13:19:02] I see there was a spike in latency [13:19:07] but I think there was some general network spike [13:19:07] yes, sorry [13:19:22] a lot of random things seem to have blipped a bit... [13:20:02] The metrics look fine: https://grafana.wikimedia.org/d/000000587/phabricator?orgId=1&from=now-1h&to=now [13:20:37] yeah so far my hypothesis is there was some wider network blip that happened, it this was the only service to happen to catch it for a page [13:20:38] (03PS4) 10Slyngshede: Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) [13:21:01] alright, so we generally agree that we will stand down and monitor? [13:21:06] Yeah, I think so. [13:21:14] +1 [13:21:30] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline and I'll give the next PS a test" [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [13:21:31] ok, excellent [13:21:37] (03PS2) 10Dreamy Jazz: MediaModeration: Define virtual domains mapping config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973157 (https://phabricator.wikimedia.org/T350321) [13:21:59] I see nothing obvious in phab's error logs, there's a brief uptick in the number of phab workers at the time but nothing that looks hugely significant. I think network blip sounds most likely. [13:22:01] (ProbeDown) resolved: (2) Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:22:11] Should I resolve the page or ... yeah, that works. [13:23:19] it would be nice to understand the root cause regardless, still digging a bit [13:23:49] I will have a look at NEL to see if that has some correlation over different services [13:23:52] (03CR) 10Slyngshede: Ensure that build directories are cleaned up (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [13:24:36] I'm not sure it was even a public-traffic event. it looked more like an internal link flap or whatever, in terms of the effects. [13:28:19] (03PS1) 10Filippo Giunchedi: titan: add public_domain to tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/973158 [13:28:21] (03PS1) 10Filippo Giunchedi: hieradata: update service catalog roles [puppet] - 10https://gerrit.wikimedia.org/r/973159 [13:28:23] (03PS1) 10Filippo Giunchedi: pontoon: add pki to o11y [puppet] - 10https://gerrit.wikimedia.org/r/973160 [13:32:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 45%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53183 and previous config saved to /var/cache/conftool/dbconfig/20231109-133235-arnaudb.json [13:33:57] regarding the phab pa.ge, Prometheus probes reported: Error for HTTP request" err="Get \"https://10.64.16.101:443/\": context deadline exceeded" for a few seconds. [13:33:57] https://logstash.wikimedia.org/goto/c38d5b9a0e170c93f59f08bbc9ab298f [13:37:17] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10ayounsi) 05Open→03Resolved Yep, looks good, thanks! [13:41:15] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host stat1009.eqiad.wmnet [13:42:38] (03PS1) 10Muehlenhoff: Switch stat1009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973162 (https://phabricator.wikimedia.org/T349619) [13:43:14] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [13:43:26] (03CR) 10Muehlenhoff: [C: 03+2] Switch stat1009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973162 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:46:01] (03CR) 10Brouberol: [C: 03+1] "Looks good, thanks for the pointer to the build pipeline!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/973142 (https://phabricator.wikimedia.org/T348647) (owner: 10Btullis) [13:46:53] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10ayounsi) 05Resolved→03Open Actually, the ssw interface is fixed, but the cr2-eqiad one didn't https://librenms.wikimedia.org/graphs/to=1699536900/id=11592/type=port_errors/from=1694180100/ phaultfinder updated the task descr... [13:47:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 60%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53184 and previous config saved to /var/cache/conftool/dbconfig/20231109-134740-arnaudb.json [13:49:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host stat1009.eqiad.wmnet [13:50:44] 10SRE-OnFire, 10User-fgiunchedi: Deploy alerts-triage app to production - https://phabricator.wikimedia.org/T350014 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi The scope of this task is done, we (ONFIRE) will followup with more context [13:51:01] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) For Search Platform, the best servers to experiment with (lowest risk) are: * relforge*: used for testing relevancy, no user traffic, very limited i... [13:52:24] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10cmooney) >>! In T342502#9309447, @Jclark-ctr wrote: > @cmooney i have not seen any new faults on this ticket. are you ok closing this ticket? Thanks yeah as Arzhel said the link looks clean now. We should mark the optic that... [13:53:08] (03CR) 10JMeybohm: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/972370 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [13:53:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:55:45] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: karapace [13:55:54] (03CR) 10Kosta Harlan: production-m5.sql: Add DROP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [13:57:03] (03PS1) 10Muehlenhoff: Switch karapace to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973163 (https://phabricator.wikimedia.org/T349619) [13:57:56] (03CR) 10Marostegui: [C: 03+2] production-m5.sql: Add DROP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972835 (https://phabricator.wikimedia.org/T305114) (owner: 10Marostegui) [13:58:14] (03PS5) 10Slyngshede: Ensure that build directories are cleaned up [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T1400). [14:00:05] Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:38] \o [14:01:05] (03CR) 10Muehlenhoff: [C: 03+2] Switch karapace to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973163 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:02:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53185 and previous config saved to /var/cache/conftool/dbconfig/20231109-140245-arnaudb.json [14:03:32] Anyone around to deploy? [14:05:11] (03PS4) 10Aqu: Enable support for statsd_exporters on non-ops instances [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [14:05:13] (03PS8) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [14:05:51] (03CR) 10CI reject: [V: 04-1] [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [14:05:53] (03CR) 10Jbond: [C: 03+2] mariadb - misc: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:06:05] (03PS7) 10Jbond: mariadb - misc: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) [14:06:10] (03CR) 10Jbond: [V: 03+2] mariadb - misc: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968667 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:06:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: karapace [14:10:07] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [14:10:36] I'll hang around for the remaining time for this backport window in-case anyone is free to deploy. [14:10:39] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::turnilo [14:12:15] (03PS1) 10Muehlenhoff: Switch analytics_cluster::turnilo to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973165 (https://phabricator.wikimedia.org/T349619) [14:13:22] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::turnilo to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973165 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:15:09] (03CR) 10Marostegui: [C: 03+1] mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:15:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Aklapper) I'm confused if this is a staff request (`@wikimedia.org` email address) or a volunteer request (@Urbanecm volunteer Phabricator... [14:15:43] (03CR) 10Jbond: [C: 03+2] mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:15:52] (03PS7) 10Jbond: mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) [14:15:58] (03CR) 10Jbond: [V: 03+2] mariadb - dedicated dbs: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968668 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:17:50] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 90%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53187 and previous config saved to /var/cache/conftool/dbconfig/20231109-141749-arnaudb.json [14:17:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::turnilo [14:18:56] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: schema update via T343198 [14:18:59] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: schema update via T343198 [14:19:00] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [14:19:34] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:21:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:13] !log restarting turnilo on an-tool1007 [14:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134 to test puppet changes', diff saved to https://phabricator.wikimedia.org/P53188 and previous config saved to /var/cache/conftool/dbconfig/20231109-142139-root.json [14:22:36] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::datahub::opensearch [14:24:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53189 and previous config saved to /var/cache/conftool/dbconfig/20231109-142419-root.json [14:26:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:26:11] (03PS1) 10Muehlenhoff: Switch analytics_cluster::datahub::opensearch to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973166 (https://phabricator.wikimedia.org/T349619) [14:26:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1224 to test puppet changes', diff saved to https://phabricator.wikimedia.org/P53190 and previous config saved to /var/cache/conftool/dbconfig/20231109-142621-root.json [14:28:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::datahub::opensearch to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973166 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:28:25] (03PS4) 10Jbond: prometheus-puppet-agent-stats: this timer sometime fails [puppet] - 10https://gerrit.wikimedia.org/r/971946 [14:28:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53191 and previous config saved to /var/cache/conftool/dbconfig/20231109-142837-root.json [14:28:58] (03CR) 10Marostegui: [C: 03+1] mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:29:00] (03CR) 10Jbond: "updated" [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [14:29:03] (03CR) 10CI reject: [V: 04-1] prometheus-puppet-agent-stats: this timer sometime fails [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond) [14:29:36] (03PS7) 10Jbond: mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) [14:29:41] (03CR) 10Jbond: [C: 03+2] mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:29:47] (03CR) 10Jbond: [V: 03+2 C: 03+2] mariadb - core: update the ssl-ca value used by mariadb [puppet] - 10https://gerrit.wikimedia.org/r/968669 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:30:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2171:3315 to test puppet changes', diff saved to https://phabricator.wikimedia.org/P53193 and previous config saved to /var/cache/conftool/dbconfig/20231109-143051-root.json [14:31:22] Dreamy_Jazz: I'm here and can deploy [14:31:31] Thanks. [14:31:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972722 (https://phabricator.wikimedia.org/T345591) (owner: 10Dreamy Jazz) [14:32:05] (03PS1) 10Urbanecm: mediawiki: Run expireTemporaryAccounts.php daily [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) [14:32:38] (03CR) 10CI reject: [V: 04-1] mediawiki: Run expireTemporaryAccounts.php daily [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:32:50] (03Merged) 10jenkins-bot: Revert "CheckUser: Set 'debug' log level" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972722 (https://phabricator.wikimedia.org/T345591) (owner: 10Dreamy Jazz) [14:32:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Host warmup', diff saved to https://phabricator.wikimedia.org/P53194 and previous config saved to /var/cache/conftool/dbconfig/20231109-143254-arnaudb.json [14:32:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::datahub::opensearch [14:33:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 10%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53195 and previous config saved to /var/cache/conftool/dbconfig/20231109-143301-root.json [14:33:28] (03PS2) 10Urbanecm: mediawiki: Run expireTemporaryAccounts.php daily [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) [14:33:35] Dreamy_Jazz: do we need to verify either of those? [14:33:40] *manually verify [14:33:58] We can't verify the virtual domains mapping change, except from it not breaking anything. [14:33:58] (03CR) 10CI reject: [V: 04-1] mediawiki: Run expireTemporaryAccounts.php daily [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:34:08] (03CR) 10Jbond: [C: 03+2] orchestrator: update SSL certs to use combined CA [puppet] - 10https://gerrit.wikimedia.org/r/972367 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [14:34:09] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:972722|Revert "CheckUser: Set 'debug' log level" (T345591)]] [14:34:13] T345591: Stop deletion of rows in the cu_useragent_clienthints table - https://phabricator.wikimedia.org/T345591 [14:34:27] The config change could be tested, but as it reverts to the status quo I don't think we would need to manually verify on mwdebug servers. [14:34:49] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:34:52] (03CR) 10Urbanecm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:34:57] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:35:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:35:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T348183)', diff saved to https://phabricator.wikimedia.org/P53196 and previous config saved to /var/cache/conftool/dbconfig/20231109-143508-arnaudb.json [14:35:14] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:35:41] !log kharlan@deploy2002 kharlan and dreamyjazz: Backport for [[gerrit:972722|Revert "CheckUser: Set 'debug' log level" (T345591)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:23] Dreamy_Jazz: yeah, i'll just sync both [14:36:26] !log kharlan@deploy2002 kharlan and dreamyjazz: Continuing with sync [14:36:45] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: ceph::server [14:37:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T348183)', diff saved to https://phabricator.wikimedia.org/P53197 and previous config saved to /var/cache/conftool/dbconfig/20231109-143739-arnaudb.json [14:38:01] (03CR) 10DannyS712: mediawiki: Run expireTemporaryAccounts.php daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:38:06] (03CR) 10Marostegui: [C: 03+2] dbproxy102[2,4]: Promote db1119 to standby [puppet] - 10https://gerrit.wikimedia.org/r/972921 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui) [14:38:35] (03PS1) 10Muehlenhoff: Switch ceph::server to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973169 (https://phabricator.wikimedia.org/T349619) [14:38:38] (03PS1) 10KartikMistry: testwiki: Enable the Unified Content Translation Dashboard [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973170 (https://phabricator.wikimedia.org/T337915) [14:38:53] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:54] PROBLEM - Check systemd state on datahubsearch1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:57] (03PS3) 10Urbanecm: mediawiki: Run expireTemporaryAccounts.php daily [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) [14:39:00] (03CR) 10Urbanecm: mediawiki: Run expireTemporaryAccounts.php daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:39:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53198 and previous config saved to /var/cache/conftool/dbconfig/20231109-143924-root.json [14:39:25] (03CR) 10Urbanecm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:39:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) >>! In T349619#9319228, @Gehel wrote: > For Search Platform, the best servers to experiment with (lowest risk) are: > > * relforge*: use... [14:40:54] (03CR) 10Muehlenhoff: [C: 03+2] Switch ceph::server to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973169 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:41:12] (03CR) 10Ottomata: [C: 03+2] test/refine - update refinery jar version for analytics test cluster refine job [puppet] - 10https://gerrit.wikimedia.org/r/972894 (https://phabricator.wikimedia.org/T321854) (owner: 10Ottomata) [14:41:53] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:972722|Revert "CheckUser: Set 'debug' log level" (T345591)]] (duration: 07m 43s) [14:41:57] T345591: Stop deletion of rows in the cu_useragent_clienthints table - https://phabricator.wikimedia.org/T345591 [14:42:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973157 (https://phabricator.wikimedia.org/T350321) (owner: 10Dreamy Jazz) [14:42:28] (03CR) 10Urbanecm: [C: 04-1] "do not merge before wmf.5 is deployed to all of production" [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:43:05] (03Merged) 10jenkins-bot: MediaModeration: Define virtual domains mapping config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973157 (https://phabricator.wikimedia.org/T350321) (owner: 10Dreamy Jazz) [14:43:08] RECOVERY - Check systemd state on datahubsearch1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:28] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:973157|MediaModeration: Define virtual domains mapping config (T350321)]] [14:43:29] (03PS1) 10Jbond: bird::anycast: move firewall rules to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973171 [14:43:33] T350321: [M] Create database table to store status of scans - https://phabricator.wikimedia.org/T350321 [14:43:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53199 and previous config saved to /var/cache/conftool/dbconfig/20231109-144342-root.json [14:44:09] 10SRE, 10Data-Platform-SRE, 10Infrastructure-Foundations, 10Product-Analytics, and 2 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10ayounsi) [14:44:48] (03PS1) 10Urbanecm: IP Masking: Set expiryAfterDays to 10 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) [14:44:51] !log kharlan@deploy2002 kharlan and dreamyjazz: Backport for [[gerrit:973157|MediaModeration: Define virtual domains mapping config (T350321)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:44:57] !log kharlan@deploy2002 kharlan and dreamyjazz: Continuing with sync [14:45:28] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/366/con" [puppet] - 10https://gerrit.wikimedia.org/r/973171 (owner: 10Jbond) [14:46:25] (03CR) 10Sergio Gimeno: [C: 03+1] IP Masking: Set expiryAfterDays to 10 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:46:31] !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host an-druid1005.eqiad.wmnet with OS bullseye [14:46:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ceph::server [14:46:44] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:47:14] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:47] (03CR) 10Sergio Gimeno: [C: 03+1] "lgtm code wise." [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm) [14:47:51] (03PS1) 10Marostegui: install_server: Do not reimage db1235 [puppet] - 10https://gerrit.wikimedia.org/r/973173 [14:48:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 25%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53200 and previous config saved to /var/cache/conftool/dbconfig/20231109-144806-root.json [14:48:27] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1235 [puppet] - 10https://gerrit.wikimedia.org/r/973173 (owner: 10Marostegui) [14:50:35] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:973157|MediaModeration: Define virtual domains mapping config (T350321)]] (duration: 07m 07s) [14:50:40] T350321: [M] Create database table to store status of scans - https://phabricator.wikimedia.org/T350321 [14:51:16] (03PS1) 10Jbond: DO NOT MERGE: test pcc [puppet] - 10https://gerrit.wikimedia.org/r/973174 [14:51:38] Dreamy_Jazz: all done [14:51:44] Thanks! [14:52:00] !log UTC afternoon deploys done [14:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:52:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P53201 and previous config saved to /var/cache/conftool/dbconfig/20231109-145246-arnaudb.json [14:52:50] PROBLEM - Check systemd state on htmldumper1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/367/con" [puppet] - 10https://gerrit.wikimedia.org/r/973174 (owner: 10Jbond) [14:53:02] PROBLEM - Check systemd state on ganeti2031 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:29] (03Abandoned) 10Jbond: DO NOT MERGE: test pcc [puppet] - 10https://gerrit.wikimedia.org/r/973174 (owner: 10Jbond) [14:53:53] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:21] (03PS9) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [14:54:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53202 and previous config saved to /var/cache/conftool/dbconfig/20231109-145428-root.json [14:54:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) Weight set to 1 for new hosts: ` $ sudo confctl select dc=eqiad,service=cdn,name='cp11.*' set/weight=1 The selector you chose has selected the following object... [14:56:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) Weight set to "100" for new hosts (ats-be): ` $ sudo confctl select dc=eqiad,service=ats-be,name='cp11.*' set/weight=100 The selector you chose has selected the... [14:57:09] (03PS1) 10Jbond: DO NOt MERGE: test pcc [puppet] - 10https://gerrit.wikimedia.org/r/973176 [14:57:15] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973177 [14:57:19] (03PS10) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [14:57:34] (03PS3) 10Volans: sre.discovery.datacenter: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) [14:57:36] (03PS3) 10Volans: sre.discovery.service-route: customize lock args [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) [14:57:38] (03PS2) 10Volans: sre.ganeti.*: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) [14:58:20] (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973177 [14:58:20] 10SRE, 10ops-eqiad, 10DC-Ops: eqiad: Master Tracking Ticket for eqiad expansion cage - https://phabricator.wikimedia.org/T296966 (10Jclark-ctr) [14:58:24] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973177 (owner: 10Kosta Harlan) [14:58:33] (03Abandoned) 10Jbond: DO NOt MERGE: test pcc [puppet] - 10https://gerrit.wikimedia.org/r/973176 (owner: 10Jbond) [14:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53203 and previous config saved to /var/cache/conftool/dbconfig/20231109-145846-root.json [14:58:59] (03CR) 10Volans: "ready" [cookbooks] - 10https://gerrit.wikimedia.org/r/967165 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [14:59:12] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [14:59:15] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [14:59:26] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973177 (owner: 10Kosta Harlan) [15:00:12] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [15:00:30] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [15:00:40] (03CR) 10Volans: "ready" [cookbooks] - 10https://gerrit.wikimedia.org/r/967166 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:01:28] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) 05Open→03Resolved >>! In T350479#9312405, @Volans wrote: > The code is not checking if he autoselection of the parent is None or not. Indeed. Why... [15:02:04] (03CR) 10CI reject: [V: 04-1] sre.ganeti.*: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:03:04] (03PS1) 10Brouberol: Fix typo in the an-druit netboot partman case [puppet] - 10https://gerrit.wikimedia.org/r/973178 (https://phabricator.wikimedia.org/T332604) [15:03:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 50%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53204 and previous config saved to /var/cache/conftool/dbconfig/20231109-150311-root.json [15:05:01] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) 05Resolved→03Open [15:06:29] (03PS3) 10Volans: sre.ganeti.*: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) [15:06:37] (03CR) 10Volans: "ready" [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:07:03] (03PS2) 10Brouberol: Fix typo in the an-druid netboot partman case [puppet] - 10https://gerrit.wikimedia.org/r/973178 (https://phabricator.wikimedia.org/T332604) [15:07:28] (03CR) 10Btullis: [C: 03+1] Fix typo in the an-druid netboot partman case [puppet] - 10https://gerrit.wikimedia.org/r/973178 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [15:07:41] (03CR) 10Stevemunene: [C: 03+1] Fix typo in the an-druid netboot partman case [puppet] - 10https://gerrit.wikimedia.org/r/973178 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [15:07:53] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P53205 and previous config saved to /var/cache/conftool/dbconfig/20231109-150752-arnaudb.json [15:08:50] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10Volans) >>! In T350479#9319519, @cmooney wrote: > For now I'll update //customscripts/_common.py// so that it fails cleanly if this should occur. Not sure what... [15:09:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53206 and previous config saved to /var/cache/conftool/dbconfig/20231109-150933-root.json [15:09:46] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [15:10:27] (03CR) 10CI reject: [V: 04-1] sre.ganeti.*: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [15:11:27] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) Please loop me in in the progress, while this doesn't affect production, I may have assumed in some cases that files were always smaller than 4 GB for b... [15:11:30] (03CR) 10Brouberol: [C: 03+2] Fix typo in the an-druid netboot partman case [puppet] - 10https://gerrit.wikimedia.org/r/973178 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [15:12:14] PROBLEM - MariaDB Replica Lag: s7 on dbstore1003 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:12:23] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bookworm [15:12:33] !log arnaudb@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1017.eqiad.wmnet [15:13:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53207 and previous config saved to /var/cache/conftool/dbconfig/20231109-151351-root.json [15:14:52] (03CR) 10Arnaudb: [C: 03+2] haproxy: remove dbproxy1017 from production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972509 (https://phabricator.wikimedia.org/T348956) (owner: 10Arnaudb) [15:17:23] !log arnaudb@cumin1001 START - Cookbook sre.dns.netbox [15:18:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 75%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53208 and previous config saved to /var/cache/conftool/dbconfig/20231109-151816-root.json [15:19:31] !log arnaudb@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:20:34] (03PS1) 10Giuseppe Lavagetto: mobileapps: introduce canary release [deployment-charts] - 10https://gerrit.wikimedia.org/r/973179 (https://phabricator.wikimedia.org/T350846) [15:20:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - arnaudb@cumin1001" [15:20:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:35] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy1017.eqiad.wmnet [15:20:36] (03PS1) 10Giuseppe Lavagetto: mobileapps: add egress networkpolicy for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/973180 (https://phabricator.wikimedia.org/T350846) [15:20:38] (03PS1) 10Giuseppe Lavagetto: mobileapps: switch canary to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973181 (https://phabricator.wikimedia.org/T350846) [15:20:40] (03PS1) 10Giuseppe Lavagetto: mobileapps: move traffic to mw on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973182 (https://phabricator.wikimedia.org/T350846) [15:20:42] (03PS1) 10Giuseppe Lavagetto: mw-api-int: double the number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/973183 (https://phabricator.wikimedia.org/T350846) [15:20:44] (03PS1) 10Giuseppe Lavagetto: mobileapps: move 20% of replicas to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973184 (https://phabricator.wikimedia.org/T350846) [15:21:58] !log removed cp1075 from HAProxy/Varnish pool (NOT ats-be pool) for T349244 [15:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:01] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [15:22:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T348183)', diff saved to https://phabricator.wikimedia.org/P53209 and previous config saved to /var/cache/conftool/dbconfig/20231109-152259-arnaudb.json [15:23:01] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:23:03] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:23:10] !log cp1100 inserted into cluster_text pool [15:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:14] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:23:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T348183)', diff saved to https://phabricator.wikimedia.org/P53210 and previous config saved to /var/cache/conftool/dbconfig/20231109-152320-arnaudb.json [15:23:32] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) As it's clear from the patches, I chose to take the sage advice of @JMeybohm and go down the path of least resistance :) [15:24:20] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/973149 (owner: 10L10n-bot) [15:24:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53211 and previous config saved to /var/cache/conftool/dbconfig/20231109-152438-root.json [15:25:46] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1100.eqiad.wmnet [15:25:47] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1100.eqiad.wmnet [15:25:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T348183)', diff saved to https://phabricator.wikimedia.org/P53212 and previous config saved to /var/cache/conftool/dbconfig/20231109-152553-arnaudb.json [15:28:11] (03PS1) 10Cathal Mooney: Fail when setting int relations if PuppetDB parent not found in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) [15:28:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53213 and previous config saved to /var/cache/conftool/dbconfig/20231109-152856-root.json [15:29:13] (03PS4) 10Volans: sre.ganeti.*: customize lock arguments [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) [15:29:15] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1025.eqiad.wmnet with reason: host reimage [15:29:45] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-eqiad: Applying JVM security upgrade - eevans@cumin1001 [15:31:17] (03PS11) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [15:32:18] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1025.eqiad.wmnet with reason: host reimage [15:33:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2171:3315 (re)pooling @ 100%: Puppet changes', diff saved to https://phabricator.wikimedia.org/P53214 and previous config saved to /var/cache/conftool/dbconfig/20231109-153321-root.json [15:33:48] (03PS1) 10Ottomata: schema.svc - add keeplive_timeout param, default to 1h [puppet] - 10https://gerrit.wikimedia.org/r/973186 (https://phabricator.wikimedia.org/T350713) [15:34:29] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [15:36:47] (03PS1) 10Marostegui: db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/973187 (https://phabricator.wikimedia.org/T350022) [15:37:31] (03CR) 10Marostegui: [C: 03+2] db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/973187 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui) [15:37:47] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/371/con" [puppet] - 10https://gerrit.wikimedia.org/r/973186 (https://phabricator.wikimedia.org/T350713) (owner: 10Ottomata) [15:38:30] 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission dbproxy1017.eqiad.wmnet - https://phabricator.wikimedia.org/T348956 (10ABran-WMF) [15:40:05] (03PS1) 10Ottomata: Set envoy schema proxy keepalive timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/973188 (https://phabricator.wikimedia.org/T350713) [15:40:37] (03Abandoned) 10Ottomata: schema.svc - add keeplive_timeout param, default to 1h [puppet] - 10https://gerrit.wikimedia.org/r/973186 (https://phabricator.wikimedia.org/T350713) (owner: 10Ottomata) [15:41:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P53215 and previous config saved to /var/cache/conftool/dbconfig/20231109-154100-arnaudb.json [15:41:31] (03CR) 10Ottomata: [C: 03+2] Set envoy schema proxy keepalive timeout to 10s [puppet] - 10https://gerrit.wikimedia.org/r/973188 (https://phabricator.wikimedia.org/T350713) (owner: 10Ottomata) [15:44:31] (03PS2) 10Hnowlan: changeprop: add config support for migration to k8s jobrunners [deployment-charts] - 10https://gerrit.wikimedia.org/r/972358 (https://phabricator.wikimedia.org/T349796) [15:45:09] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [15:45:38] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [15:45:47] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [15:45:50] (03PS12) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [15:46:33] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [15:46:44] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [15:47:30] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [15:47:34] RECOVERY - MariaDB Replica Lag: s7 on dbstore1003 is OK: OK slave_sql_lag Replication lag: 0.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:47:46] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:49:02] RECOVERY - Check systemd state on ganeti2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:16] RECOVERY - Check systemd state on htmldumper1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:49] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [15:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:56:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P53216 and previous config saved to /var/cache/conftool/dbconfig/20231109-155606-arnaudb.json [15:58:55] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1025.eqiad.wmnet with OS bookworm [15:59:16] RECOVERY - MariaDB Replica SQL: s1 on clouddb1021 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:59:44] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:45] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10Cparle) Not really sure on either of these, let us talk to people doing similar image-analysis work and get back to you ... [16:06:26] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [16:06:46] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [16:06:51] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [16:07:08] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) Feel free to contact other SREs that can support you (can be those in data engineering, as they may know more about Ha... [16:07:41] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [16:07:51] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [16:08:08] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [16:09:06] RECOVERY - MariaDB Replica IO: s3 on clouddb1021 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:09:24] RECOVERY - MariaDB Replica IO: s5 on clouddb1021 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:09:38] RECOVERY - MariaDB Replica SQL: s3 on clouddb1021 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:10:00] RECOVERY - MariaDB Replica IO: s4 on clouddb1021 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:10:30] RECOVERY - MariaDB Replica SQL: s2 on clouddb1021 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:11:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T348183)', diff saved to https://phabricator.wikimedia.org/P53217 and previous config saved to /var/cache/conftool/dbconfig/20231109-161112-arnaudb.json [16:11:14] RECOVERY - MariaDB Replica IO: s6 on clouddb1021 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:11:15] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:11:16] RECOVERY - MariaDB Replica IO: s7 on clouddb1021 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:11:16] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973192 [16:11:17] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:11:28] RECOVERY - MariaDB Replica SQL: s6 on clouddb1021 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:11:28] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:11:32] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973192 (owner: 10Kosta Harlan) [16:11:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T348183)', diff saved to https://phabricator.wikimedia.org/P53218 and previous config saved to /var/cache/conftool/dbconfig/20231109-161134-arnaudb.json [16:12:07] (03PS1) 10Ssingh: hiera: re-order new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/973193 (https://phabricator.wikimedia.org/T349244) [16:12:09] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add pki to o11y [puppet] - 10https://gerrit.wikimedia.org/r/973160 (owner: 10Filippo Giunchedi) [16:12:14] RECOVERY - MariaDB Replica IO: s8 on clouddb1021 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:12:14] (03PS2) 10Filippo Giunchedi: pontoon: add pki to o11y [puppet] - 10https://gerrit.wikimedia.org/r/973160 [16:12:33] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973192 (owner: 10Kosta Harlan) [16:13:59] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [16:14:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T348183)', diff saved to https://phabricator.wikimedia.org/P53219 and previous config saved to /var/cache/conftool/dbconfig/20231109-161406-arnaudb.json [16:14:18] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [16:14:33] (03CR) 10Filippo Giunchedi: [V: 03+2] pontoon: add pki to o11y [puppet] - 10https://gerrit.wikimedia.org/r/973160 (owner: 10Filippo Giunchedi) [16:17:00] (03PS1) 10Volans: documentation: add example of locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/973194 (https://phabricator.wikimedia.org/T341973) [16:20:40] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Applying JVM security upgrade - eevans@cumin1001 [16:21:52] (03PS1) 10Kosta Harlan: ipoid: Set SPUR_API_KEY variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/973196 [16:22:50] (03PS2) 10Kosta Harlan: ipoid: Set SPUR_API_KEY variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/973196 [16:23:09] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:aqs-codfw: Applying JVM security upgrade - eevans@cumin1001 [16:23:52] (03PS3) 10Kosta Harlan: ipoid: Set SPUR_API_KEY variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/973196 [16:24:19] (03PS1) 10Btullis: Fix whitespace in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/973197 [16:24:21] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Set SPUR_API_KEY variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/973196 (owner: 10Kosta Harlan) [16:24:46] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973197 (owner: 10Btullis) [16:25:16] (03Merged) 10jenkins-bot: ipoid: Set SPUR_API_KEY variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/973196 (owner: 10Kosta Harlan) [16:26:12] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [16:26:25] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [16:27:21] (03PS1) 10Cathal Mooney: Adjust BGP_Customer_out policy to send default and local POP routes [homer/public] - 10https://gerrit.wikimedia.org/r/973198 (https://phabricator.wikimedia.org/T350740) [16:28:46] (03CR) 10Dzahn: admin: add urbanecm to stewards-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [16:28:59] (03PS2) 10Cathal Mooney: Adjust BGP_Customer_out policy to send default and local POP routes [homer/public] - 10https://gerrit.wikimedia.org/r/973198 (https://phabricator.wikimedia.org/T350740) [16:29:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P53220 and previous config saved to /var/cache/conftool/dbconfig/20231109-162913-arnaudb.json [16:29:40] (03PS1) 10Kosta Harlan: ipoid: Adjust command invocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/973199 [16:29:48] (03PS2) 10Dzahn: admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) [16:29:54] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Adjust command invocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/973199 (owner: 10Kosta Harlan) [16:30:03] (03CR) 10Dzahn: admin: add urbanecm to stewards-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [16:30:42] (03Merged) 10jenkins-bot: ipoid: Adjust command invocation [deployment-charts] - 10https://gerrit.wikimedia.org/r/973199 (owner: 10Kosta Harlan) [16:31:15] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [16:31:28] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [16:31:46] (03PS3) 10Dzahn: admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) [16:32:32] (03CR) 10Fabfur: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/973193 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh) [16:35:46] (03CR) 10Ssingh: [C: 03+2] hiera: re-order new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/973193 (https://phabricator.wikimedia.org/T349244) (owner: 10Ssingh) [16:37:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [16:37:37] (03PS1) 10Ladsgroup: Enable pagelinks write both on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973200 (https://phabricator.wikimedia.org/T345732) [16:37:46] jouncebot: nowandnext [16:37:46] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [16:37:46] In 0 hour(s) and 22 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T1700) [16:37:51] RECOVERY - MariaDB Replica Lag: s5 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:38:14] (03CR) 10Btullis: [C: 03+2] Fix whitespace in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/973197 (owner: 10Btullis) [16:38:37] (03CR) 10Ladsgroup: [C: 03+2] Enable pagelinks write both on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973200 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [16:38:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973200 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [16:38:57] (03PS1) 10Kosta Harlan: ipoid: Stop trying to re-assign env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/973201 [16:39:18] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Stop trying to re-assign env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/973201 (owner: 10Kosta Harlan) [16:39:36] (03Merged) 10jenkins-bot: Enable pagelinks write both on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973200 (https://phabricator.wikimedia.org/T345732) (owner: 10Ladsgroup) [16:40:01] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:973200|Enable pagelinks write both on enwiki (T345732)]] [16:40:06] (03Merged) 10jenkins-bot: ipoid: Stop trying to re-assign env variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/973201 (owner: 10Kosta Harlan) [16:40:06] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [16:41:20] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [16:41:26] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:973200|Enable pagelinks write both on enwiki (T345732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:41:29] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [16:41:39] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [16:42:37] RECOVERY - MariaDB Replica Lag: s7 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:42:48] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [16:43:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS bullseye [16:43:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye [16:44:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P53221 and previous config saved to /var/cache/conftool/dbconfig/20231109-164419-arnaudb.json [16:47:47] RECOVERY - MariaDB Replica Lag: s3 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:48:10] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:973200|Enable pagelinks write both on enwiki (T345732)]] (duration: 08m 09s) [16:48:15] T345732: Turn on write both for beta and production - https://phabricator.wikimedia.org/T345732 [16:49:31] (03CR) 10Marostegui: [C: 03+1] haproxy: remove dbproxy1017 from production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972509 (https://phabricator.wikimedia.org/T348956) (owner: 10Arnaudb) [16:50:02] (03CR) 10Marostegui: [C: 03+1] wiki-replicas: Update IP address for cloudcontrol1006 [puppet] - 10https://gerrit.wikimedia.org/r/964871 (https://phabricator.wikimedia.org/T347381) (owner: 10Majavah) [16:51:23] (03PS4) 10Marostegui: filtered_tables.txt: Update for CampaignEvents schema change [puppet] - 10https://gerrit.wikimedia.org/r/925921 (owner: 10Daimona Eaytoy) [16:51:47] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1101.eqiad.wmnet with OS bullseye [16:51:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye executed with erro... [16:52:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS bullseye [16:52:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye [16:53:19] 10SRE-swift-storage, 10Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802 (10Nosferattus) Personally, I think 5 GiB is plenty. Our purpose is education, not entertainment. We don't need 8K videos to explain how mitochondria work. 480p works fine. 1... [16:54:13] (03CR) 10Marostegui: [C: 03+2] filtered_tables.txt: Update for CampaignEvents schema change [puppet] - 10https://gerrit.wikimedia.org/r/925921 (owner: 10Daimona Eaytoy) [16:55:39] (03CR) 10Marostegui: [C: 03+1] tox.ini: whitelist_externals -> allowlist_externals [software] - 10https://gerrit.wikimedia.org/r/955880 (https://phabricator.wikimedia.org/T345695) (owner: 10Hashar) [16:55:40] !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching A:aqs-codfw: Applying JVM security upgrade - eevans@cumin1001 [16:56:28] (03PS1) 10Btullis: Update the contact info for the wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/973203 (https://phabricator.wikimedia.org/T345698) [16:56:40] (03PS1) 10Brouberol: Format both LVM volumes of an-druid1005 at next reimage [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) [16:56:43] (03CR) 10Marostegui: [C: 03+2] dbprov2002.cnf.erb: Change db_inventory target [puppet] - 10https://gerrit.wikimedia.org/r/892948 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [16:57:02] (03CR) 10Marostegui: [C: 03+2] "Sorry, removed the +2 by mistake. Added it back even if this is all a noop :)" [puppet] - 10https://gerrit.wikimedia.org/r/892948 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui) [16:57:33] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1101.eqiad.wmnet with OS bullseye [16:57:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye executed with erro... [16:57:57] (03PS2) 10Brouberol: Format both LVM volumes of an-druid1005 at next reimage [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) [16:58:01] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1101.eqiad.wmnet with OS bullseye [16:58:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye [16:58:39] (03CR) 10Btullis: Format both LVM volumes of an-druid1005 at next reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [16:59:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T348183)', diff saved to https://phabricator.wikimedia.org/P53222 and previous config saved to /var/cache/conftool/dbconfig/20231109-165925-arnaudb.json [16:59:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:59:30] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:59:41] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [16:59:45] (03PS3) 10Brouberol: Format both LVM volumes of an-druid1005 at next reimage [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) [16:59:47] (03CR) 10Brouberol: Format both LVM volumes of an-druid1005 at next reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [16:59:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T348183)', diff saved to https://phabricator.wikimedia.org/P53223 and previous config saved to /var/cache/conftool/dbconfig/20231109-165947-arnaudb.json [17:00:05] jbond and rzl: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet request window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:13] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973203 (https://phabricator.wikimedia.org/T345698) (owner: 10Btullis) [17:02:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T348183)', diff saved to https://phabricator.wikimedia.org/P53224 and previous config saved to /var/cache/conftool/dbconfig/20231109-170220-arnaudb.json [17:02:22] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ssingh) As another data point if it helps debugging, we are reimaging cp1101 again as we are switching roles (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/973193); it took... [17:04:01] RECOVERY - MariaDB Replica Lag: s8 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.36 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:04:35] (03PS2) 10Btullis: Update the contact info for the wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/973203 (https://phabricator.wikimedia.org/T345698) [17:06:34] (03CR) 10Btullis: Format both LVM volumes of an-druid1005 at next reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [17:07:32] (03PS4) 10Brouberol: Format both LVM volumes of an-druid1005 at next reimage [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) [17:07:42] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973203 (https://phabricator.wikimedia.org/T345698) (owner: 10Btullis) [17:08:05] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [17:08:35] (03CR) 10Brouberol: Format both LVM volumes of an-druid1005 at next reimage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [17:08:39] (03CR) 10Brouberol: [C: 03+2] Format both LVM volumes of an-druid1005 at next reimage [puppet] - 10https://gerrit.wikimedia.org/r/973204 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [17:09:01] RECOVERY - MariaDB Replica Lag: s6 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:09:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/973194 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [17:11:40] (03CR) 10Volans: [C: 03+2] documentation: add example of locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/973194 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [17:13:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage [17:13:51] (03PS13) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [17:14:12] (03CR) 10Hnowlan: [C: 03+1] "lgtm! Thanks for the work on this" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972404 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [17:15:28] 10SRE-swift-storage, 10Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802 (10Aklapper) Being able to show freely licensed educational content on big screens is "not entertainment". You're still free to screen 480p in your cinema if your audience en... [17:15:44] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/973158 (owner: 10Filippo Giunchedi) [17:15:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10VRiley-WMF) @BTullis we have received the ram modules. Could you let us know a good time to power the servers down and install them? Let us know, thanks! [17:16:04] (03CR) 10Hnowlan: [C: 03+1] "lgtm, one query" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [17:16:21] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/973159 (owner: 10Filippo Giunchedi) [17:16:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1101.eqiad.wmnet with reason: host reimage [17:16:44] (03CR) 10Hnowlan: [C: 03+1] api-gateway,rest-gateway: Switch to cert-manager certificates (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972844 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [17:16:56] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: elasticsearch::relforge [17:17:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P53225 and previous config saved to /var/cache/conftool/dbconfig/20231109-171727-arnaudb.json [17:17:58] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [17:18:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis) Hi @VRiley-WMF - Many thanks. These servers are not doing anything at the moment, so they can be powered them down any time you like. Would you like me to p... [17:18:10] (03Merged) 10jenkins-bot: documentation: add example of locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/973194 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [17:18:14] (03CR) 10Andrea Denisse: ircecho: Migrate IRC Echo from Python 2 to Python 3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [17:19:14] (03CR) 10Andrea Denisse: [C: 03+2] icinga: Remove unnecessary python-phabricator Python2 dependency [puppet] - 10https://gerrit.wikimedia.org/r/972925 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [17:19:47] (03PS1) 10Jbond: elasticsearch/relforge: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/973206 (https://phabricator.wikimedia.org/T349619) [17:20:10] (03CR) 10Jbond: [C: 03+2] elasticsearch/relforge: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/973206 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [17:22:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10VRiley-WMF) @BTullis I'll go ahead and install them and update the ticket once it's done. Just wanted to verify it would be okay to proceed. [17:23:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis) Awesome. Many thanks. [17:24:47] (Device rebooted) firing: Alert for device ps1-a4-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:25:55] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: elasticsearch::relforge [17:27:21] RECOVERY - MariaDB Replica Lag: s2 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:28:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10JAllemandou) Yay! Thanks so much @VRiley-WMF and @BTullis :) [17:28:43] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [17:29:35] (03PS3) 10Btullis: Update the contact info for the wikireplica servers [puppet] - 10https://gerrit.wikimedia.org/r/973203 (https://phabricator.wikimedia.org/T345698) [17:29:47] (Device rebooted) resolved: Device ps1-a4-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:32:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P53226 and previous config saved to /var/cache/conftool/dbconfig/20231109-173233-arnaudb.json [17:32:51] PROBLEM - ensure kvm processes are running on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:34:10] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1025 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott upgrading to bookworm https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:34:29] (03CR) 10Vgutierrez: [C: 03+1] "it looks good, make sure to disable puppet and stop acme-chief.service on the current active instance (acmechief-test2001.codfw.wmnet) bef" [puppet] - 10https://gerrit.wikimedia.org/r/972886 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:34:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1101.eqiad.wmnet with OS bullseye [17:34:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1101.eqiad.wmnet with OS bullseye completed: - cp1101 (**PASS**) - Remov... [17:36:12] (03CR) 10Vgutierrez: [C: 03+1] "oh you could also trigger acme-chief-certs-sync.service on acmechief-test2001 after disabling puppet and stopping acme-chief.service to en" [puppet] - 10https://gerrit.wikimedia.org/r/972886 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [17:37:22] (03PS14) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [17:37:26] !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1101.eqiad.wmnet [17:37:27] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1101.eqiad.wmnet [17:38:56] !log removed cp1076 from HAProxy/Varnish pool (NOT ats-be pool) for T349244 [17:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:00] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [17:39:04] (03CR) 10Aqu: Send metrics from Airflow analytics test (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [17:39:18] 10SRE-swift-storage, 10Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802 (10C.Suthorn) >>! In T191802#9320060, @Nosferattus wrote: > Personally, I think 5 GiB is plenty. Our purpose is education, not entertainment. We don't need 8K videos to expla... [17:41:45] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wdqs::test [17:43:14] (03PS1) 10Jbond: wdqs::test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/973210 (https://phabricator.wikimedia.org/T349619) [17:45:47] !log pooled cp1101 into upload cluster (both cdn and ats-be): T349244 [17:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:51] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [17:46:19] (03CR) 10Jbond: [C: 03+2] wdqs::test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/973210 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [17:47:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10VRiley-WMF) [17:47:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T348183)', diff saved to https://phabricator.wikimedia.org/P53227 and previous config saved to /var/cache/conftool/dbconfig/20231109-174740-arnaudb.json [17:47:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:47:45] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:47:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [17:48:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53228 and previous config saved to /var/cache/conftool/dbconfig/20231109-174801-arnaudb.json [17:48:50] !log depooled service ats-be for cp1101 (T349244) [17:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:32] 10Puppet, 10iPoid-Service: Rename FEED_API_KEY - https://phabricator.wikimedia.org/T350903 (10kostajh) [17:50:27] RECOVERY - MariaDB Replica Lag: s4 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:50:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53229 and previous config saved to /var/cache/conftool/dbconfig/20231109-175044-arnaudb.json [17:51:08] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wdqs::test [17:51:18] (03PS1) 10Dzahn: cloud/devtools: delete hiera hosts file for deleted hosts [puppet] - 10https://gerrit.wikimedia.org/r/973211 [17:51:59] RECOVERY - ensure kvm processes are running on cloudvirt1025 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:52:32] (03CR) 10Dzahn: "@AOkoth - since this file was needed for vrts-1001 but that instance is gone and there is now vrts-1002. Does this need to be copied to vr" [puppet] - 10https://gerrit.wikimedia.org/r/973211 (owner: 10Dzahn) [17:59:16] (03PS49) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [18:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T1800) [18:00:24] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [18:02:28] (03PS1) 10Dzahn: phabricator/httpd: add support for bullseye/bookworm PHP versions [puppet] - 10https://gerrit.wikimedia.org/r/973213 (https://phabricator.wikimedia.org/T327068) [18:02:55] (03CR) 10Bking: rdf-streaming-updater: update values for application mode (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [18:03:13] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:03:20] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:04:18] (03CR) 10BCornwall: [C: 03+2] acme_chief: Set acmechief-test1001 as active host [puppet] - 10https://gerrit.wikimedia.org/r/972886 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [18:04:20] (03CR) 10Dzahn: "intended to be a noop on anything existing - but without it upgrade to bullseye or bookworm won't be possible" [puppet] - 10https://gerrit.wikimedia.org/r/973213 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [18:05:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P53230 and previous config saved to /var/cache/conftool/dbconfig/20231109-180551-arnaudb.json [18:06:51] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:06:58] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:10:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [18:12:09] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs20[09-12].codfw.wmnet: Applying JVM security upgrade - eevans@cumin1001 [18:14:47] (03CR) 10Jsn.sherman: [C: 03+1] "This looks good to me and can be merged at any time, IMO. It's a beta only change that's only removing an unused variable, so it should ha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968397 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824) [18:15:19] (03PS1) 10Urbanecm: wikimaniawiki: Switch back to standard logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973215 (https://phabricator.wikimedia.org/T350640) [18:15:20] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:15:26] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:15:43] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:15:49] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:18:14] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [18:18:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [18:18:51] RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:20:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P53232 and previous config saved to /var/cache/conftool/dbconfig/20231109-182057-arnaudb.json [18:21:53] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-11-09-122837-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/973216 [18:21:55] (03PS1) 10BryanDavis: toolhub: Bump container to 2023-11-09-085934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/973217 (https://phabricator.wikimedia.org/T338296) [18:22:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Urbanecm) >>! In T350834#9319322, @Aklapper wrote: > I'm confused if this is a staff request (`@wikimedia.org` email address) or a voluntee... [18:23:10] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:23:16] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:23:39] !log eevans@cumin1001 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching aqs20[09-12].codfw.wmnet: Applying JVM security upgrade - eevans@cumin1001 [18:24:02] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container to 2023-11-09-085934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/973217 (https://phabricator.wikimedia.org/T338296) (owner: 10BryanDavis) [18:24:38] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-11-09-122837-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/973216 (owner: 10BryanDavis) [18:24:46] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [18:24:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL*... [18:24:59] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [18:25:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [18:25:52] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-11-09-122837-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/973216 (owner: 10BryanDavis) [18:25:54] (03Merged) 10jenkins-bot: toolhub: Bump container to 2023-11-09-085934-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/973217 (https://phabricator.wikimedia.org/T338296) (owner: 10BryanDavis) [18:28:46] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:28:59] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [18:29:06] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:29:23] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:29:32] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:29:52] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [18:30:20] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [18:30:33] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [18:30:35] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1103.eqiad.wmnet with OS bullseye [18:30:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye executed with errors: - cp1103 (**FAIL*... [18:30:45] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1103.eqiad.wmnet with OS bullseye [18:30:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye [18:31:31] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host acmechief-test2001.codfw.wmnet with OS bookworm [18:32:15] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [18:32:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye [18:32:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye [18:33:31] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [18:33:37] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [18:34:24] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [18:35:20] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [18:36:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53233 and previous config saved to /var/cache/conftool/dbconfig/20231109-183603-arnaudb.json [18:36:06] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [18:36:08] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:36:10] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [18:36:21] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [18:36:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1224 (T348183)', diff saved to https://phabricator.wikimedia.org/P53234 and previous config saved to /var/cache/conftool/dbconfig/20231109-183626-arnaudb.json [18:38:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T348183)', diff saved to https://phabricator.wikimedia.org/P53235 and previous config saved to /var/cache/conftool/dbconfig/20231109-183857-arnaudb.json [18:40:25] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1105.eqiad.wmnet with OS bullseye [18:40:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye [18:41:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL**... [18:41:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye [18:42:23] PROBLEM - ensure kvm processes are running on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:42:33] PROBLEM - ensure kvm processes are running on cloudvirt1027 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:43:13] PROBLEM - ensure kvm processes are running on cloudvirt1060 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:43:47] PROBLEM - ensure kvm processes are running on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:43:54] 10SRE, 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [18:45:57] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [18:46:40] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on acmechief-test2001.codfw.wmnet with reason: host reimage [18:48:13] PROBLEM - ensure kvm processes are running on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:48:54] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1103.eqiad.wmnet with reason: host reimage [18:49:33] !log Adding anycast gw config to ssw*codfw for vlan sandbox1-a-codfw (T348159) [18:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:37] T348159: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 [18:51:30] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on acmechief-test2001.codfw.wmnet with reason: host reimage [18:51:54] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1105.eqiad.wmnet with OS bullseye [18:52:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL**... [18:52:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye [18:52:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye [18:52:43] !log renumber VRRP GW VIP on crX-codfw for sandbox1-a-codfw (T348159) [18:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P53236 and previous config saved to /var/cache/conftool/dbconfig/20231109-185403-arnaudb.json [18:54:23] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:56:23] PROBLEM - ensure kvm processes are running on cloudvirt1026 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:56:29] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1105.eqiad.wmnet with OS bullseye [18:56:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye executed with errors: - cp1105 (**FAIL**... [18:56:40] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1105.eqiad.wmnet with OS bullseye [18:56:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye [19:00:05] jnuche and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T1900). [19:00:11] (03PS1) 10Ottomata: schema service proxy - Add retry on 5xx [puppet] - 10https://gerrit.wikimedia.org/r/973223 (https://phabricator.wikimedia.org/T350713) [19:01:52] (03CR) 10Ottomata: [C: 03+2] schema service proxy - Add retry on 5xx [puppet] - 10https://gerrit.wikimedia.org/r/973223 (https://phabricator.wikimedia.org/T350713) (owner: 10Ottomata) [19:02:06] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1026.eqiad.wmnet with OS bookworm [19:03:11] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:04:07] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1027.eqiad.wmnet with OS bookworm [19:04:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [19:04:23] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE, AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:04:28] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1057.eqiad.wmnet with OS bookworm [19:04:31] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [19:04:40] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host acmechief-test2001.codfw.wmnet with OS bookworm [19:04:43] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1051.eqiad.wmnet with OS bookworm [19:04:54] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [19:04:57] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1059.eqiad.wmnet with OS bookworm [19:05:01] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1060.eqiad.wmnet with OS bookworm [19:05:02] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [19:05:20] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entries for sandbox1-codfw IPs - cmooney@cumin1001" [19:05:44] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [19:06:05] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entries for sandbox1-codfw IPs - cmooney@cumin1001" [19:06:05] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:06:23] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [19:06:58] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1103.eqiad.wmnet with OS bullseye [19:07:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1103.eqiad.wmnet with OS bullseye completed: - cp1103 (**PASS**) - Remo... [19:07:36] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [19:07:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:09:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P53237 and previous config saved to /var/cache/conftool/dbconfig/20231109-190910-arnaudb.json [19:11:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage [19:12:37] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [19:15:00] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1105.eqiad.wmnet with reason: host reimage [19:15:11] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye [19:15:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**... [19:15:36] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [19:15:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:16:49] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1027.eqiad.wmnet with reason: host reimage [19:17:47] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1026.eqiad.wmnet with reason: host reimage [19:18:09] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [19:18:14] (03PS1) 10Ebernhardson: cirrus updater: Updater container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/973227 [19:18:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [19:18:43] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [19:18:52] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [19:19:53] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1027.eqiad.wmnet with reason: host reimage [19:20:11] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye [19:20:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**... [19:20:30] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [19:20:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:20:43] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage [19:22:05] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1051.eqiad.wmnet with reason: host reimage [19:22:28] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1026.eqiad.wmnet with reason: host reimage [19:24:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T348183)', diff saved to https://phabricator.wikimedia.org/P53238 and previous config saved to /var/cache/conftool/dbconfig/20231109-192416-arnaudb.json [19:24:18] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [19:24:21] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:24:31] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [19:24:44] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1059.eqiad.wmnet with reason: host reimage [19:24:55] (03PS2) 10Urbanecm: wikimaniawiki: Switch back to standard logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973215 (https://phabricator.wikimedia.org/T350640) [19:25:17] (03CR) 10Urbanecm: [C: 03+2] wikimaniawiki: Switch back to standard logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973215 (https://phabricator.wikimedia.org/T350640) (owner: 10Urbanecm) [19:25:24] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye [19:25:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**... [19:25:40] !log shutting down et-1/1/5.2201 (sandbox1-a-codfw) interfaces on crX-codfw (T348159) [19:25:42] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [19:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:43] T348159: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 [19:25:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:25:58] (03Merged) 10jenkins-bot: wikimaniawiki: Switch back to standard logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973215 (https://phabricator.wikimedia.org/T350640) (owner: 10Urbanecm) [19:26:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [19:26:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [19:26:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1231 (T348183)', diff saved to https://phabricator.wikimedia.org/P53239 and previous config saved to /var/cache/conftool/dbconfig/20231109-192621-arnaudb.json [19:26:45] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:973215|wikimaniawiki: Switch back to standard logo (T350640)]] [19:26:49] T350640: Restore the standard Wikimania logo on Wikimania wiki - https://phabricator.wikimedia.org/T350640 [19:27:03] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1057.eqiad.wmnet with reason: host reimage [19:28:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 40% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:28:09] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:973215|wikimaniawiki: Switch back to standard logo (T350640)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:28:13] !log urbanecm@deploy2002 urbanecm: Continuing with sync [19:28:44] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1060.eqiad.wmnet with reason: host reimage [19:28:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T348183)', diff saved to https://phabricator.wikimedia.org/P53240 and previous config saved to /var/cache/conftool/dbconfig/20231109-192850-arnaudb.json [19:28:56] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Updater container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/973227 (owner: 10Ebernhardson) [19:29:41] (03Merged) 10jenkins-bot: cirrus updater: Updater container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/973227 (owner: 10Ebernhardson) [19:30:12] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host cp1108.mgmt.eqiad.wmnet with reboot policy GRACEFUL [19:32:32] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1107.eqiad.wmnet with OS bullseye [19:32:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**... [19:32:43] PROBLEM - Check the NTP synchronisation status of timesyncd on cloudvirt1060 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.149.12: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [19:32:49] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [19:32:51] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1105.eqiad.wmnet with OS bullseye [19:32:51] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1108.mgmt.eqiad.wmnet with reboot policy GRACEFUL [19:32:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:33:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1105.eqiad.wmnet with OS bullseye completed: - cp1105 (**PASS**) - Remov... [19:33:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.7% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:33:29] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:33:37] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:33:57] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:973215|wikimaniawiki: Switch back to standard logo (T350640)]] (duration: 07m 11s) [19:34:02] T350640: Restore the standard Wikimania logo on Wikimania wiki - https://phabricator.wikimedia.org/T350640 [19:34:57] (03PS1) 10Urbanecm: wikimaniawiki: Revert wordmark and tagline back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973228 (https://phabricator.wikimedia.org/T350640) [19:36:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:36:29] (03PS2) 10Urbanecm: wikimaniawiki: Revert wordmark and tagline back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973228 (https://phabricator.wikimedia.org/T350640) [19:36:34] (03CR) 10Urbanecm: [C: 03+2] wikimaniawiki: Revert wordmark and tagline back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973228 (https://phabricator.wikimedia.org/T350640) (owner: 10Urbanecm) [19:36:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [19:37:15] (03Merged) 10jenkins-bot: wikimaniawiki: Revert wordmark and tagline back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973228 (https://phabricator.wikimedia.org/T350640) (owner: 10Urbanecm) [19:38:00] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:973228|wikimaniawiki: Revert wordmark and tagline back (T350640)]] [19:38:04] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye [19:38:09] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye [19:38:21] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on cloudvirt1060 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.149.12: Connection reset by peer Andrew Bogott mid-reimage https://wikitech.wikimedia.org/wiki/NTP [19:40:55] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:41:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:41:03] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:41:10] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:43:39] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1027.eqiad.wmnet with OS bookworm [19:43:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P53241 and previous config saved to /var/cache/conftool/dbconfig/20231109-194357-arnaudb.json [19:44:06] !log volans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1108.eqiad.wmnet with OS bullseye [19:44:09] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye executed with errors: - cp1108 (**FAIL**) - Downtime... [19:44:45] PROBLEM - ensure kvm processes are running on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:45:22] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:973228|wikimaniawiki: Revert wordmark and tagline back (T350640)]] (duration: 07m 22s) [19:45:26] T350640: Restore the standard Wikimania logo on Wikimania wiki - https://phabricator.wikimedia.org/T350640 [19:46:18] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:46:24] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10Aklapper) Ah, thanks! :) [19:46:51] PROBLEM - ensure kvm processes are running on cloudvirt1060 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:46:59] !log eevans@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs2012.codfw.wmnet: Applying JVM security upgrade - eevans@cumin1001 [19:47:27] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:47:36] (03CR) 10BCornwall: "Might wanna add some reviewers 😊" [software/purged] - 10https://gerrit.wikimedia.org/r/962670 (https://phabricator.wikimedia.org/T347839) (owner: 10Fabfur) [19:47:53] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage [19:48:24] (03CR) 10BCornwall: Add version print option (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/962670 (https://phabricator.wikimedia.org/T347839) (owner: 10Fabfur) [19:48:49] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1051.eqiad.wmnet with OS bookworm [19:49:25] RECOVERY - Check the NTP synchronisation status of timesyncd on cloudvirt1060 is OK: OK: synced at Thu 2023-11-09 19:49:24 UTC. https://wikitech.wikimedia.org/wiki/NTP [19:49:33] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:50:25] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1059.eqiad.wmnet with OS bookworm [19:50:43] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1026.eqiad.wmnet with OS bookworm [19:50:51] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage [19:51:07] !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs2012.codfw.wmnet: Applying JVM security upgrade - eevans@cumin1001 [19:52:41] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1057.eqiad.wmnet with OS bookworm [19:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:54:44] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1060.eqiad.wmnet with OS bookworm [19:55:05] !log volans@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye [19:55:09] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by volans@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye [19:56:27] (03CR) 10Jdlrobson: [C: 03+1] Enable Reference Previews on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971882 (https://phabricator.wikimedia.org/T282999) (owner: 10WMDE-Fisch) [19:59:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P53242 and previous config saved to /var/cache/conftool/dbconfig/20231109-195903-arnaudb.json [19:59:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [19:59:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye [20:06:21] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1110.eqiad.wmnet with OS bullseye [20:06:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**... [20:06:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [20:06:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye [20:07:40] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Volans) I got from traffic `cp1108` to try, I run in parallel a tcpdump on the install host (following https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#DHCP_issues ) and th... [20:08:38] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1107.eqiad.wmnet with OS bullseye [20:08:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye completed: - cp1107 (**PASS**) - Remov... [20:10:13] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage [20:11:33] RECOVERY - ensure kvm processes are running on cloudvirt1051 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:11:39] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ssingh) Thanks for the comment and debugging @Volans! Adding some more points from the Traffic team: - on some of the hosts that were failing and took repeated attempts, we verified t... [20:11:59] RECOVERY - ensure kvm processes are running on cloudvirt1060 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:12:05] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,ssw1-a8-codfw,ssw1-a8-codfw.mgmt with reason: Adjust vlans trunked to asw-a-codfw from ssw1-a8-codfw T347191 [20:12:09] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [20:12:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,ssw1-a8-codfw,ssw1-a8-codfw.mgmt with reason: Adjust vlans trunked to asw-a-codfw from ssw1-a8-codfw T347191 [20:12:25] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1110.eqiad.wmnet with OS bullseye [20:12:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**... [20:13:04] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ssingh) Adding @cmooney / @ayounsi to this task so they can check the switch side -- thanks folks! [20:13:11] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [20:13:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye [20:13:21] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1108.eqiad.wmnet with reason: host reimage [20:14:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T348183)', diff saved to https://phabricator.wikimedia.org/P53243 and previous config saved to /var/cache/conftool/dbconfig/20231109-201409-arnaudb.json [20:14:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [20:14:25] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [20:14:27] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:15:42] !log resetting asw-a-codfw et-2/0/52 to shift traffic away from ssw1-a8-codfw (T347191) [20:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:05] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [20:16:18] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [20:16:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [20:17:24] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1110.eqiad.wmnet with OS bullseye [20:17:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye executed with errors: - cp1110 (**FAIL**... [20:17:33] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1110.eqiad.wmnet with OS bullseye [20:17:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye [20:18:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [20:18:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [20:18:31] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T348183)', diff saved to https://phabricator.wikimedia.org/P53244 and previous config saved to /var/cache/conftool/dbconfig/20231109-201830-arnaudb.json [20:22:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T348183)', diff saved to https://phabricator.wikimedia.org/P53245 and previous config saved to /var/cache/conftool/dbconfig/20231109-202225-arnaudb.json [20:22:30] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:25:59] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [20:28:16] !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1005.eqiad.wmnet with reason: host reimage [20:28:31] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entries for ssw1-aX-codfw xlink IPs. - cmooney@cumin1001" [20:29:21] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entries for ssw1-aX-codfw xlink IPs. - cmooney@cumin1001" [20:29:21] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:32:06] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1005.eqiad.wmnet with reason: host reimage [20:32:17] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1108.eqiad.wmnet with OS bullseye [20:32:21] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by volans@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye completed: - cp1108 (**PASS**) - Removed from Puppet... [20:32:42] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage [20:35:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1110.eqiad.wmnet with reason: host reimage [20:37:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P53246 and previous config saved to /var/cache/conftool/dbconfig/20231109-203731-arnaudb.json [20:39:34] 10SRE-swift-storage, 10Epic: [Epic] Determine a strategy to store files between 5 and 100 GB - https://phabricator.wikimedia.org/T191802 (10AlexisJazz) https://archive.org/details/Scotichronicon was extracted from 4K video. The text is readable but not particularly sharp. This is the //only// digital version o... [20:40:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Volans) cp1108 completed: see T350179#9321006 [20:41:10] !log change anycast gw type to single-IP on ssw1-aX-codfw for sandbox1-a-codfw vlan (T350579) [20:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:15] T350579: Support Anycast GW on EVPN switches without unique IP - https://phabricator.wikimedia.org/T350579 [20:43:17] (03PS1) 10Ottomata: schema service proxy - set max_requests_per_connection: 1 to avoid upstream disconnect [puppet] - 10https://gerrit.wikimedia.org/r/973234 (https://phabricator.wikimedia.org/T350713) [20:43:44] (03CR) 10CI reject: [V: 04-1] schema service proxy - set max_requests_per_connection: 1 to avoid upstream disconnect [puppet] - 10https://gerrit.wikimedia.org/r/973234 (https://phabricator.wikimedia.org/T350713) (owner: 10Ottomata) [20:44:33] (03PS2) 10Ottomata: schema service proxy - set max_requests_per_connection: 1 [puppet] - 10https://gerrit.wikimedia.org/r/973234 (https://phabricator.wikimedia.org/T350713) [20:45:00] !log cmooney@cumin1001 START - Cookbook sre.dns.wipe-cache 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.3.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:45:04] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.3.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [20:49:24] (03PS3) 10Ottomata: schema service proxy - set keepalive to 10s [puppet] - 10https://gerrit.wikimedia.org/r/973234 (https://phabricator.wikimedia.org/T350713) [20:49:59] (03CR) 10Ottomata: [C: 03+2] schema service proxy - set keepalive to 10s [puppet] - 10https://gerrit.wikimedia.org/r/973234 (https://phabricator.wikimedia.org/T350713) (owner: 10Ottomata) [20:52:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P53247 and previous config saved to /var/cache/conftool/dbconfig/20231109-205238-arnaudb.json [20:54:30] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [20:54:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1110.eqiad.wmnet with OS bullseye [20:54:45] !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1005.eqiad.wmnet with OS bullseye [20:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2146 T350916', diff saved to https://phabricator.wikimedia.org/P53248 and previous config saved to /var/cache/conftool/dbconfig/20231109-205445-root.json [20:54:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1110.eqiad.wmnet with OS bullseye completed: - cp1110 (**PASS**) - Remov... [20:54:51] T350916: db2146 memory warning - https://phabricator.wikimedia.org/T350916 [20:55:08] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [20:55:12] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [20:55:34] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [20:55:40] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [20:56:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2146.codfw.wmnet with OS bookworm [21:00:05] brennen and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231109T2100). [21:01:18] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:01:44] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [21:02:04] !log no pathces for utc late backport & config [21:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:18] (and i can't spell today, apparently.) [21:02:50] (03PS1) 10Marostegui: db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/973236 (https://phabricator.wikimedia.org/T350916) [21:03:29] (03CR) 10Marostegui: [C: 03+2] db2146: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/973236 (https://phabricator.wikimedia.org/T350916) (owner: 10Marostegui) [21:03:57] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old crX-codfw sandbox int IPs - cmooney@cumin1001" [21:04:45] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove old crX-codfw sandbox int IPs - cmooney@cumin1001" [21:04:45] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:07:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T348183)', diff saved to https://phabricator.wikimedia.org/P53249 and previous config saved to /var/cache/conftool/dbconfig/20231109-210744-arnaudb.json [21:07:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [21:07:50] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:08:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [21:08:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T348183)', diff saved to https://phabricator.wikimedia.org/P53250 and previous config saved to /var/cache/conftool/dbconfig/20231109-210806-arnaudb.json [21:10:07] 10SRE, 10SRE-Access-Requests: Requesting access to WMF for Grace - https://phabricator.wikimedia.org/T350918 (10ecarg) [21:10:27] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Re-enable the .* route for mwapi [deployment-charts] - 10https://gerrit.wikimedia.org/r/969209 (owner: 10Ebernhardson) [21:11:12] (03Merged) 10jenkins-bot: cirrus updater: Re-enable the .* route for mwapi [deployment-charts] - 10https://gerrit.wikimedia.org/r/969209 (owner: 10Ebernhardson) [21:12:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T348183)', diff saved to https://phabricator.wikimedia.org/P53251 and previous config saved to /var/cache/conftool/dbconfig/20231109-211200-arnaudb.json [21:13:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2146.codfw.wmnet with reason: host reimage [21:16:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2146.codfw.wmnet with reason: host reimage [21:18:33] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:19:03] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:24:12] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:24:20] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [21:24:20] (03Abandoned) 10Ebernhardson: rdf-streaming-updater: Defined allowed zk clusters for staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/972014 (owner: 10Ebernhardson) [21:24:26] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [21:27:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P53252 and previous config saved to /var/cache/conftool/dbconfig/20231109-212707-arnaudb.json [21:29:45] (03CR) 10Ebernhardson: Add alert for CirrusSearch reported memory issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/830240 (https://phabricator.wikimedia.org/T316712) (owner: 10Ebernhardson) [21:32:30] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10AlexisJazz) https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(proposals)#RfC:_Increasing_the_maximum_size_for_uploaded_files [21:33:35] (03CR) 10Bking: [C: 03+2] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [21:33:44] (03CR) 10CI reject: [V: 04-1] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [21:36:03] PROBLEM - nova-compute proc minimum on cloudvirt1028 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:36:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2146.codfw.wmnet with OS bookworm [21:42:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P53253 and previous config saved to /var/cache/conftool/dbconfig/20231109-214213-arnaudb.json [21:50:16] (03PS1) 10Cathal Mooney: Homer YAML changes to support move of sandbox-a-codfw vlan to ssw's [homer/public] - 10https://gerrit.wikimedia.org/r/973239 (https://phabricator.wikimedia.org/T347191) [21:54:39] (03PS2) 10Cathal Mooney: Homer YAML changes to support move of sandbox-a-codfw vlan to ssw's [homer/public] - 10https://gerrit.wikimedia.org/r/973239 (https://phabricator.wikimedia.org/T347191) [21:55:01] RECOVERY - nova-compute proc minimum on cloudvirt1028 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:56:15] (03CR) 10Cathal Mooney: [C: 03+2] Homer YAML changes to support move of sandbox-a-codfw vlan to ssw's [homer/public] - 10https://gerrit.wikimedia.org/r/973239 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [21:56:52] (03Merged) 10jenkins-bot: Homer YAML changes to support move of sandbox-a-codfw vlan to ssw's [homer/public] - 10https://gerrit.wikimedia.org/r/973239 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [21:57:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T348183)', diff saved to https://phabricator.wikimedia.org/P53254 and previous config saved to /var/cache/conftool/dbconfig/20231109-215719-arnaudb.json [21:57:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [21:57:24] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:57:35] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [21:57:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T348183)', diff saved to https://phabricator.wikimedia.org/P53255 and previous config saved to /var/cache/conftool/dbconfig/20231109-215741-arnaudb.json [21:59:22] (03CR) 10Bking: [C: 03+2] staging-eqiad: raise rdf-streaming-updater quota (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [21:59:31] (03CR) 10CI reject: [V: 04-1] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [22:00:10] (03CR) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [22:01:45] (03PS50) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [22:02:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T348183)', diff saved to https://phabricator.wikimedia.org/P53256 and previous config saved to /var/cache/conftool/dbconfig/20231109-220238-arnaudb.json [22:02:50] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:05:19] (03PS51) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [22:08:35] !log bking@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [22:08:41] !log bking@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [22:08:45] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:17:21] (03PS3) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) [22:17:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P53257 and previous config saved to /var/cache/conftool/dbconfig/20231109-221744-arnaudb.json [22:21:18] (03CR) 10Bking: [C: 03+2] staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [22:22:51] (03CR) 10EoghanGaffney: [C: 03+1] phabricator/httpd: add support for bullseye/bookworm PHP versions [puppet] - 10https://gerrit.wikimedia.org/r/973213 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn) [22:24:02] (03Merged) 10jenkins-bot: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/972483 (https://phabricator.wikimedia.org/T349095) (owner: 10Bking) [22:27:25] !log bking@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [22:27:34] !log bking@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [22:27:44] !log bking@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [22:28:30] !log bking@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [22:29:02] (03PS1) 10Bking: Revert "staging-eqiad: raise rdf-streaming-updater quota" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 [22:32:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P53258 and previous config saved to /var/cache/conftool/dbconfig/20231109-223250-arnaudb.json [22:33:25] ^^ I aborted that deploy and rolled back changes due to missing config [22:37:33] (03PS1) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973242 (https://phabricator.wikimedia.org/T349095) [22:42:19] 10SRE, 10Traffic, 10Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/1 Basic retry mechanism for specific kafka errors [22:46:51] (03PS2) 10Bking: staging-eqiad: raise rdf-streaming-updater quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/973242 (https://phabricator.wikimedia.org/T349095) [22:47:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T348183)', diff saved to https://phabricator.wikimedia.org/P53259 and previous config saved to /var/cache/conftool/dbconfig/20231109-224757-arnaudb.json [22:47:59] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [22:48:02] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:48:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [22:48:18] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2151 (T348183)', diff saved to https://phabricator.wikimedia.org/P53260 and previous config saved to /var/cache/conftool/dbconfig/20231109-224818-arnaudb.json [22:48:26] 10SRE, 10Traffic, 10Patch-For-Review: Add version flag to purged - https://phabricator.wikimedia.org/T347839 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/2 Add version print option [22:52:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T348183)', diff saved to https://phabricator.wikimedia.org/P53261 and previous config saved to /var/cache/conftool/dbconfig/20231109-225208-arnaudb.json [22:58:53] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:01:56] (03PS1) 10Marostegui: Revert "db2146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/973246 [23:02:31] (03CR) 10Marostegui: [C: 03+2] Revert "db2146: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/973246 (owner: 10Marostegui) [23:03:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 10%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P53262 and previous config saved to /var/cache/conftool/dbconfig/20231109-230302-root.json [23:07:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P53263 and previous config saved to /var/cache/conftool/dbconfig/20231109-230715-arnaudb.json [23:10:02] (03PS1) 10Cathal Mooney: Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) [23:16:17] (03PS2) 10Cathal Mooney: Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) [23:18:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 25%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P53264 and previous config saved to /var/cache/conftool/dbconfig/20231109-231807-root.json [23:22:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P53265 and previous config saved to /var/cache/conftool/dbconfig/20231109-232221-arnaudb.json [23:33:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P53266 and previous config saved to /var/cache/conftool/dbconfig/20231109-233312-root.json [23:37:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T348183)', diff saved to https://phabricator.wikimedia.org/P53267 and previous config saved to /var/cache/conftool/dbconfig/20231109-233728-arnaudb.json [23:37:31] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [23:37:32] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [23:37:44] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [23:37:46] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:38:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:38:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53268 and previous config saved to /var/cache/conftool/dbconfig/20231109-233816-arnaudb.json [23:42:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53269 and previous config saved to /var/cache/conftool/dbconfig/20231109-234206-arnaudb.json [23:48:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P53270 and previous config saved to /var/cache/conftool/dbconfig/20231109-234817-root.json [23:50:45] 10SRE, 10Traffic-Icebox, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10Pppery) Anything left to do here? [23:53:07] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Support Anycast GW on EVPN switches without unique IP - https://phabricator.wikimedia.org/T350579 (10cmooney) In terms of the config when we have 2 IPs on the interface with the VGA setup, there is some behaviour we need to be careful of.... [23:53:39] (03PS3) 10Cathal Mooney: Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) [23:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:57:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P53271 and previous config saved to /var/cache/conftool/dbconfig/20231109-235712-arnaudb.json