[00:09:19] (03PS1) 10Dzahn: phabricator: use systemd::sysuser to create vcs user [puppet] - 10https://gerrit.wikimedia.org/r/865207 [00:11:24] (03PS1) 10Dzahn: phabricator: rm code from before system user was created with systemd [puppet] - 10https://gerrit.wikimedia.org/r/865208 [00:16:49] (03PS2) 10Dzahn: phabricator: rm code from before system user was created with systemd [puppet] - 10https://gerrit.wikimedia.org/r/865208 [00:18:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1023.eqiad.wmnet with OS bullseye [00:18:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1... [00:21:42] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1024.eqiad.wmnet with OS bullseye [00:21:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye executed with errors: - kubernetes1... [00:45:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:47:01] (03CR) 10Jberkel: Make "make" available in all images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel) [00:50:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:00:35] (03PS1) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318) [01:31:49] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:34] (03CR) 10Cwhite: [C: 03+1] Enable profile::auto_restarts::service for Burrow [puppet] - 10https://gerrit.wikimedia.org/r/865114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [01:39:34] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) (owner: 10Filippo Giunchedi) [01:41:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:41:45] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:51] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:05] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:21:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [02:33:45] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1011'] [02:35:29] PROBLEM - OpenSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f1c7c36c2e8: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [02:35:29] org/wiki/Search%23Administration [02:36:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:39:36] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1011'] [02:49:19] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1011.eqiad.wmnet with OS bullseye [02:57:33] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [03:12:54] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:15:26] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1011.eqiad.wmnet with reason: host reimage [03:18:34] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1011.eqiad.wmnet with reason: host reimage [03:27:57] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [03:59:32] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1011.eqiad.wmnet with OS bullseye [04:16:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:26:33] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:29] 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10dmaza) >>! In T320675#8368902, @Eevans wrote: > TL;DR I think it's OK if we fly by the seat of ou... [04:43:57] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/865198 [05:29:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:34:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:43:37] PROBLEM - Host an-worker1108 is DOWN: PING CRITICAL - Packet loss = 100% [05:46:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:47:11] (03PS1) 10Marostegui: db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/865240 [05:48:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 1%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42433 and previous config saved to /var/cache/conftool/dbconfig/20221207-054759-root.json [05:48:08] (03CR) 10Marostegui: [C: 03+2] db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/865240 (owner: 10Marostegui) [05:49:16] 10SRE, 10ops-eqiad, 10DBA, 10Phabricator, and 2 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui) That's ok Daniel, I will take care of it on this task. [05:49:39] 10SRE, 10ops-eqiad, 10DBA, 10Phabricator, and 2 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui) I will merge that change and then proceed and remove grants live [05:49:54] (03CR) 10Marostegui: [C: 03+2] mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [05:52:12] (03PS1) 10Marostegui: mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/865241 (https://phabricator.wikimedia.org/T323418) [05:52:27] (03CR) 10Marostegui: "Daniel this required manual rebasing, so it was faster just to send a new patch: https://gerrit.wikimedia.org/r/865241" [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [05:55:49] (03CR) 10Marostegui: [C: 03+2] mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/865241 (https://phabricator.wikimedia.org/T323418) (owner: 10Marostegui) [05:57:30] 10SRE, 10ops-eqiad, 10DBA, 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui) ` root@db1159.eqiad.wmnet[(none)]> select user,host from mysql.user where host like '10.64.16.8'; +----------------+------------+ | User | Host... [05:58:04] !log Drop phab1001 grants from m3 databases T323418 [05:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:07] T323418: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 [06:00:24] 10SRE, 10ops-eqiad, 10DBA, 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui) All done from the DBA side. [06:03:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42434 and previous config saved to /var/cache/conftool/dbconfig/20221207-060305-root.json [06:12:16] 10SRE, 10Data-Persistence, 10MediaWiki-extensions-SecurePoll, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), and 2 others: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Marostegui) [06:18:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42435 and previous config saved to /var/cache/conftool/dbconfig/20221207-061810-root.json [06:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:23:27] 10SRE, 10ops-eqiad, 10Data-Persistence (work done), 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Marostegui) [06:32:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:33:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42436 and previous config saved to /var/cache/conftool/dbconfig/20221207-063316-root.json [06:43:05] (03PS1) 10Marostegui: site.pp: Clarify db1206 isn't production ready [puppet] - 10https://gerrit.wikimedia.org/r/865387 [06:43:57] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:05] (03CR) 10Marostegui: [C: 03+2] site.pp: Clarify db1206 isn't production ready [puppet] - 10https://gerrit.wikimedia.org/r/865387 (owner: 10Marostegui) [06:44:33] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42437 and previous config saved to /var/cache/conftool/dbconfig/20221207-064821-root.json [06:55:45] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:59:27] 10SRE, 10Data-Persistence, 10MediaWiki-extensions-SecurePoll, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), and 2 others: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Urbanecm) Thanks fo... [07:03:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42438 and previous config saved to /var/cache/conftool/dbconfig/20221207-070326-root.json [07:03:49] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) @sbassett I am not sure KHurd is the right user name, from what I can see there are two users with KHurd, there is KHurd and KHurd1, both created... [07:12:54] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:18:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42439 and previous config saved to /var/cache/conftool/dbconfig/20221207-071831-root.json [07:34:28] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF) Hey Marostegui, I’d be somewhat happy to explain. The first one in November was created but I had issues logging in, as at times it would show... [07:36:24] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) @KHurd-WMF Thanks for the explanation. It is probably easier if you keep `KHurd1` then as it is associated to your wmf email account already. Cou... [07:37:27] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF) [07:38:24] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF) Done. Thanks teammate, Kelton Hurd Wikimedia Foundation - Security team khurd@wikimedia.org {F35844006} [07:42:13] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) So, `check_user` looks good and KHurd1 is associated to `khurd` WMF email account now. [07:48:44] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF) Thank you. I appreciate you work on this. [07:49:33] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10KHurd-WMF) 05Stalled→03Resolved a:03KHurd-WMF [07:51:24] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) a:05KHurd-WMF→03None [07:51:28] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) 05Resolved→03Open @KHurd-WMF this is not yet done - I was just verifying it is now fine and also added you to the Phabricator group wmf-nda... [08:00:05] Amir1 and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:29] indeed [08:03:24] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10Jclark-ctr) @ayounsi Are you available to look at this today? [08:19:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:19:32] 10SRE, 10Cloud-Services, 10observability, 10Sustainability (Incident Followup), and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff) [08:23:31] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10ayounsi) I noticed that this interface is on FPC4 and {T304712} is about moving links away from FPC4, so better to move it while replacing the optic (and cleaning the patch). We can for example move it to xe-3/2/2. Ping me a bi... [08:24:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:35:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/865177 (https://phabricator.wikimedia.org/T324057) (owner: 10JHathaway) [08:38:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] C:vopsbot: Notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/860625 (owner: 10Clément Goubert) [08:40:06] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) JTAC case 2022-1207-600204 opened asking for an RMA as it's the 2nd time the issue happens. [08:40:42] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10Clement_Goubert) At your service o> [08:44:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The image seems ok to me - just remember to add the user mapping in puppet too before building/publishing the image." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [08:44:35] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "sigh, puppet." [puppet] - 10https://gerrit.wikimedia.org/r/864662 (https://phabricator.wikimedia.org/T324437) (owner: 10Clément Goubert) [08:48:46] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for Burrow [puppet] - 10https://gerrit.wikimedia.org/r/865114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:49:32] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Burrow [puppet] - 10https://gerrit.wikimedia.org/r/865114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:50:20] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] C:vopsbot: Notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/860625 (owner: 10Clément Goubert) [08:51:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:51:44] (03CR) 10Volans: [C: 04-1] "Thanks for the effort! It's a good start. I did a first pass and left few comments. Feel free to ping me if you have any questions." [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [08:53:31] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] C:systemd::syslog: Do not filebucket logfiles [puppet] - 10https://gerrit.wikimedia.org/r/864662 (https://phabricator.wikimedia.org/T324437) (owner: 10Clément Goubert) [08:56:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:56:14] 10SRE, 10Cloud-Services, 10observability, 10Sustainability (Incident Followup), and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10fgiunchedi) I'm in general favor of switching to openssl for rsyslog (and thank you for the deep dive investigation!), since in produ... [08:59:47] (03PS2) 10JMeybohm: KubernetesAPILatency: Remove special handling of LIST secret requests [alerts] - 10https://gerrit.wikimedia.org/r/864760 (https://phabricator.wikimedia.org/T323706) [09:00:59] (03CR) 10JMeybohm: [C: 03+2] helm-state-metrics: Update resources for v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864759 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [09:01:36] (03PS1) 10Jgiannelos: beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) [09:02:18] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ayounsi) FYI, there are outstanding Homer diffs for asw1-eqsin: `lang=diff [edit interfaces] - ge-0/0/16 { - description DISABLED; - disable; - } [edit interfaces xe-0... [09:02:25] (03CR) 10CI reject: [V: 04-1] beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos) [09:03:07] (03PS2) 10Jgiannelos: beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) [09:05:40] (03Merged) 10jenkins-bot: helm-state-metrics: Update resources for v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/864759 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [09:10:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 42 [09:10:35] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-worker1108.eqiad.wmnet [09:11:08] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1108.eqiad.wmnet [09:12:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 42 [09:13:28] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 397715 [09:14:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 397715 [09:14:51] (03CR) 10Physikerwelt: [C: 03+1] "If you have shell access you can test it with a simple curl, before merging." [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos) [09:14:59] (03PS3) 10Filippo Giunchedi: base: remove support for plaintext remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) [09:17:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 395570 [09:17:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 395570 [09:18:59] (03CR) 10JMeybohm: [C: 03+2] KubernetesAPILatency: Remove special handling of LIST secret requests [alerts] - 10https://gerrit.wikimedia.org/r/864760 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [09:19:07] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38610/console" [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) (owner: 10Filippo Giunchedi) [09:19:43] RECOVERY - Host an-worker1108 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [09:19:52] 10SRE, 10Data-Persistence, 10MediaWiki-extensions-SecurePoll, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), and 2 others: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Reedy) It's probabl... [09:20:36] (03Merged) 10jenkins-bot: KubernetesAPILatency: Remove special handling of LIST secret requests [alerts] - 10https://gerrit.wikimedia.org/r/864760 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [09:22:37] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] base: remove support for plaintext remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) (owner: 10Filippo Giunchedi) [09:23:43] !log jiji@deploy1002 backport aborted: (duration: 00m 18s) [09:23:53] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 167, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:24:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [09:24:48] (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (1/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [09:25:17] !log jiji@deploy1002 Started scap: Backport for [[gerrit:865117|ProductionServices: Use redis_misc servers for LockManager (1/6) (T267581)]] [09:25:20] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [09:26:10] (03CR) 10Jgiannelos: beta-cluster: Fix restbase mathoid URI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos) [09:27:18] (03PS3) 10Jgiannelos: beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) [09:27:19] !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865117|ProductionServices: Use redis_misc servers for LockManager (1/6) (T267581)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [09:31:54] (03CR) 10Physikerwelt: beta-cluster: Fix restbase mathoid URI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos) [09:33:31] (03PS3) 10Volans: cluster::cloud_management: create new role [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) [09:33:33] (03PS2) 10Volans: cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) [09:34:25] !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865117|ProductionServices: Use redis_misc servers for LockManager (1/6) (T267581)]] (duration: 09m 08s) [09:34:28] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [09:34:47] PROBLEM - Host contint1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:35:29] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:35:50] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:35:57] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:36:14] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:36:21] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:36:37] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:36:44] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:37:02] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:39:21] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert) [09:40:27] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 7568 [09:41:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 7568 [09:41:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45430 [09:41:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [09:41:59] (03PS5) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) [09:42:40] (03PS4) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (3/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581) [09:42:44] (03PS4) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) [09:42:46] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45430 [09:42:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 31800 [09:44:04] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 31800 [09:44:34] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 31800 [09:45:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 31800 [09:46:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 32098 [09:46:23] (03PS5) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) [09:47:23] (03CR) 10TrainBranchBot: "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [09:49:45] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 100, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:50:23] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 32098 [09:51:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16276 [09:51:08] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you for tacking this!" [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert) [09:52:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 16276 [09:52:23] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 138064 [09:53:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 138064 [09:54:17] (03PS1) 10KarlBeecken: mobileapps: bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/865583 [09:55:44] (03PS4) 10Volans: cluster::cloud_management: create new role [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) [09:59:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 13150 [09:59:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 [09:59:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8932 [10:00:01] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 8932 [10:00:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 35320 [10:00:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 35320 [10:02:33] RECOVERY - Host contint1001 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [10:04:21] (03CR) 10CI reject: [V: 04-1] cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:04:41] (03CR) 10Effie Mouzeli: [C: 03+2] ProductionServices: Use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [10:04:56] (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/865043 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [10:05:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 16276 [10:05:13] (03CR) 10Volans: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:05:24] (03CR) 10CI reject: [V: 04-1] ProductionServices: Use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [10:06:18] (03PS2) 10Stevemunene: Add an-presto1008-1015 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/865043 (https://phabricator.wikimedia.org/T323783) [10:07:03] !log jiji@deploy1002 Started scap: Backport for [[gerrit:865118|ProductionServices: Use redis_misc servers for LockManager (2/6) (T267581)]] [10:07:06] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [10:07:21] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:09:04] !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865118|ProductionServices: Use redis_misc servers for LockManager (2/6) (T267581)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [10:09:37] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:11:42] 10SRE-tools, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10ayounsi) [10:12:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10ayounsi) [10:12:40] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 16276 [10:12:53] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 714 [10:13:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one question inline" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:14:18] (03CR) 10KarlBeecken: [C: 03+1] mobileapps: bump image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/865583 (owner: 10KarlBeecken) [10:15:26] (03CR) 10Hnowlan: [C: 03+2] beta-cluster: Fix restbase mathoid URI [puppet] - 10https://gerrit.wikimedia.org/r/865578 (https://phabricator.wikimedia.org/T208758) (owner: 10Jgiannelos) [10:15:57] (03CR) 10Stevemunene: [C: 03+2] Add an-presto1008-1015 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/865043 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [10:16:10] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:17:04] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 714 [10:17:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 40217 [10:17:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 40217 [10:17:52] !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865118|ProductionServices: Use redis_misc servers for LockManager (2/6) (T267581)]] (duration: 10m 48s) [10:17:55] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [10:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:23:26] (03PS1) 10Hnowlan: restbase: fix deployment-prep services [puppet] - 10https://gerrit.wikimedia.org/r/865586 [10:23:40] (03PS1) 10Ilias Sarantopoulos: ml-services: decrease container memory limits to march constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/865587 (https://phabricator.wikimedia.org/T323624) [10:23:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [10:24:05] (03CR) 10Muehlenhoff: [C: 03+1] cluster::cloud_management: create new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:24:35] (03CR) 10CI reject: [V: 04-1] restbase: fix deployment-prep services [puppet] - 10https://gerrit.wikimedia.org/r/865586 (owner: 10Hnowlan) [10:24:52] (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (3/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [10:25:15] !log jiji@deploy1002 Started scap: Backport for [[gerrit:865119|ProductionServices: Use redis_misc servers for LockManager (3/6) (T267581)]] [10:25:19] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [10:26:01] !log rebooted contin1001.eqiad.wmnet [10:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:12] !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865119|ProductionServices: Use redis_misc servers for LockManager (3/6) (T267581)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [10:29:34] (03PS2) 10Hnowlan: restbase: fix deployment-prep services [puppet] - 10https://gerrit.wikimedia.org/r/865586 [10:30:15] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:32:21] (03PS5) 10Volans: cluster::cloud_management: create new role [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) [10:32:23] (03PS3) 10Volans: cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) [10:32:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:32:42] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:33:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:33:48] (03PS1) 10JMeybohm: kubertenes: Fix naming typo [labs/private] - 10https://gerrit.wikimedia.org/r/865588 [10:33:56] (03CR) 10CI reject: [V: 04-1] cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:34:22] (03PS2) 10JMeybohm: kubertenes: Fix naming typo [labs/private] - 10https://gerrit.wikimedia.org/r/865588 [10:35:44] !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865119|ProductionServices: Use redis_misc servers for LockManager (3/6) (T267581)]] (duration: 10m 29s) [10:35:48] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [10:36:41] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:37:18] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] kubertenes: Fix naming typo [labs/private] - 10https://gerrit.wikimedia.org/r/865588 (owner: 10JMeybohm) [10:37:25] (03PS1) 10Ilias Sarantopoulos: ml-services: increase limitrange for containers/pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624) [10:38:26] (03PS2) 10Ilias Sarantopoulos: ml-services: increase limitrange for containers/pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624) [10:38:37] (03Abandoned) 10Ilias Sarantopoulos: ml-services: decrease container memory limits to march constraints [deployment-charts] - 10https://gerrit.wikimedia.org/r/865587 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [10:39:33] 10SRE, 10Data-Persistence, 10MediaWiki-extensions-SecurePoll, 10MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), and 2 others: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Ladsgroup) yeah, it... [10:43:02] (03CR) 10Volans: [C: 03+2] cluster::cloud_management: create new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:43:06] (03CR) 10Btullis: [C: 03+1] yarn: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862886 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:43:10] (03CR) 10Btullis: [C: 03+1] hue: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862885 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:43:14] (03CR) 10Btullis: [C: 03+1] Enable profile::auto_restarts::service for Superset [puppet] - 10https://gerrit.wikimedia.org/r/862933 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:43:39] (03CR) 10Btullis: [C: 03+1] analytics::refinery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858604 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:46:09] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM request for cloudcumin1001 - https://phabricator.wikimedia.org/T323516 (10Volans) [10:46:16] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: (1) VM request for cumincloud2001 - https://phabricator.wikimedia.org/T323518 (10Volans) [10:46:30] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [10:48:29] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10serviceops-radar, 10Release-Engineering-Team (Radar): contint2001.mgmt disappeared from Icinga - https://phabricator.wikimedia.org/T298861 (10hashar) [10:49:15] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [10:49:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar) [10:50:00] !log volans@cumin1001 START - Cookbook sre.ganeti.makevm for new host cloudcumin1001.eqiad.wmnet [10:50:01] !log volans@cumin1001 START - Cookbook sre.dns.netbox [10:50:35] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:50:59] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) [10:51:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar) [10:51:19] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) [10:51:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar) [10:51:42] sorry for the spam [10:51:45] (03PS1) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging [puppet] - 10https://gerrit.wikimedia.org/r/865591 [10:51:47] (03PS1) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [10:52:05] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudcumin1001.eqiad.wmnet - volans@cumin1001" [10:53:06] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudcumin1001.eqiad.wmnet - volans@cumin1001" [10:53:06] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:53:06] !log volans@cumin1001 START - Cookbook sre.dns.wipe-cache cloudcumin1001.eqiad.wmnet on all recursors [10:53:09] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcumin1001.eqiad.wmnet on all recursors [10:53:58] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10hashar) [10:54:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10hashar) [10:55:07] (03CR) 10CI reject: [V: 04-1] k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [10:56:09] (03PS1) 10Marostegui: change_echo_unread_wikis_T255174.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) [10:58:18] !log volans@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host cloudcumin1001.eqiad.wmnet [10:58:26] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops-collab: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10hashar) contint1001 keeps crashing due to a faulty memory stick. It happened on October 31st ( T294276#8357385 ) and ag... [10:58:48] (03CR) 10Ladsgroup: [C: 04-1] change_echo_unread_wikis_T255174.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui) [10:59:28] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM request for cloudcumin1001 - https://phabricator.wikimedia.org/T323516 (10Volans) VM successfully created running: ` sudo cookbook sre.ganeti.makevm --cluster eqiad --group D cloudcumin1001 ` [10:59:42] (03PS2) 10Marostegui: change_echo_unread_wikis_T255174.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) [10:59:46] (03CR) 10Marostegui: change_echo_unread_wikis_T255174.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui) [11:01:23] !log volans@cumin2002 START - Cookbook sre.ganeti.makevm for new host cloudcumin2001.codfw.wmnet [11:01:24] !log volans@cumin2002 START - Cookbook sre.dns.netbox [11:02:41] (03CR) 10Ladsgroup: [C: 03+1] change_echo_unread_wikis_T255174.py: New schema change (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui) [11:02:49] (03CR) 10Marostegui: [C: 03+2] change_echo_unread_wikis_T255174.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/865593 (https://phabricator.wikimedia.org/T255174) (owner: 10Marostegui) [11:03:15] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [11:03:41] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [11:05:09] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10Volans) For the latter part, you can move all the RO pre-requisite checks in the cookbook's __init__ that is run before the START !log, so that the failure wi... [11:05:16] !log volans@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudcumin2001.codfw.wmnet - volans@cumin2002" [11:05:32] (03PS1) 10Hnowlan: thumbor: move replica definition to per-DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/865595 [11:06:19] !log volans@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM cloudcumin2001.codfw.wmnet - volans@cumin2002" [11:06:19] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:06:19] !log volans@cumin2002 START - Cookbook sre.dns.wipe-cache cloudcumin2001.codfw.wmnet on all recursors [11:06:22] !log volans@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcumin2001.codfw.wmnet on all recursors [11:07:00] (03PS3) 10Clément Goubert: P:mediawiki::php:monitoring: Longer opcache delay [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) [11:07:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/863381 (owner: 10Volans) [11:08:40] (03PS2) 10Volans: sre.hosts.reimage: call the Hiera cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/863381 [11:09:10] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38613/console" [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert) [11:11:37] !log volans@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host cloudcumin2001.codfw.wmnet [11:11:55] (03CR) 10Klausman: [C: 03+1] Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [11:12:31] (03PS4) 10Volans: cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) [11:12:54] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:13:37] 10SRE, 10Infrastructure-Foundations, 10vm-requests: codfw: (1) VM request for cumincloud2001 - https://phabricator.wikimedia.org/T323518 (10Volans) VM created with: ` sudo cookbook sre.ganeti.makevm --cluster codfw --group C cloudcumin2001 ` [11:13:57] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: call the Hiera cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/863381 (owner: 10Volans) [11:14:20] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops-collab, 10serviceops-radar: Fix UIDs for deployment server users - https://phabricator.wikimedia.org/T163667 (10jbond) >mwdeploy has uid/gid 499 in prod hosts Just wanted to note that this is not quote the case. On most hosts the uid is 499 how... [11:15:34] (03Merged) 10jenkins-bot: sre.hosts.reimage: call the Hiera cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/863381 (owner: 10Volans) [11:18:44] (03PS3) 10Ilias Sarantopoulos: ml-services: increase limitrange for containers/pods in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624) [11:19:42] (03PS4) 10Ilias Sarantopoulos: ml-services: increase limitrange for containers/pods in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624) [11:19:58] (03CR) 10Volans: [C: 03+2] cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:22:05] (03CR) 10Jbond: [C: 04-1] "lgtm apart from minor typo, -1 is for the missing $" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans) [11:24:08] (03CR) 10Jbond: [C: 03+1] spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:24:49] (03CR) 10Volans: [C: 03+2] spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:25:03] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:26:26] (03CR) 10Elukey: [C: 03+2] ml-services: increase limitrange for containers/pods in ml-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865589 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [11:28:49] (03Merged) 10jenkins-bot: spicerack: add module injection support [software/spicerack] - 10https://gerrit.wikimedia.org/r/863004 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:29:56] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:30:28] (03PS1) 10Volans: cloud-cumin: set the installer to use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/865600 (https://phabricator.wikimedia.org/T319401) [11:30:51] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:31:38] (03CR) 10Volans: [C: 03+2] "self-merging, wrong OS" [puppet] - 10https://gerrit.wikimedia.org/r/865600 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:31:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/865061 (owner: 10Muehlenhoff) [11:33:09] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [11:33:50] (03PS3) 10Muehlenhoff: package_builder: Don't fail on cleanup jobs [puppet] - 10https://gerrit.wikimedia.org/r/865061 [11:35:22] (03PS2) 10Hnowlan: thumbor: enable mesh, move replicas to main values [deployment-charts] - 10https://gerrit.wikimedia.org/r/865595 [11:36:10] (03CR) 10Muehlenhoff: [C: 03+2] package_builder: Don't fail on cleanup jobs [puppet] - 10https://gerrit.wikimedia.org/r/865061 (owner: 10Muehlenhoff) [11:37:15] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:01] (03CR) 10Btullis: [C: 03+1] superset: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862883 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:39:25] 10SRE, 10Cloud-Services, 10observability, 10Sustainability (Incident Followup), and 2 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:40:58] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [11:41:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "I can't meaningfully say on whether this will be an improvement over the (AFAICT) unattended warning, but open to try!" [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert) [11:42:50] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) >>! In T322048#8449909, @ayounsi wrote: > FYI, there are outstanding Homer diffs for asw1-eqsin: > `lang=diff > [edit interfaces] > - ge-0/0/16 { > - description DISAB... [11:42:56] (03PS1) 10Muehlenhoff: Add component/rsyslog-openssl for Buster [puppet] - 10https://gerrit.wikimedia.org/r/865602 (https://phabricator.wikimedia.org/T324623) [11:45:45] (03PS1) 10Giuseppe Lavagetto: [WiP] Add base.volume module [deployment-charts] - 10https://gerrit.wikimedia.org/r/865603 [11:46:38] (03CR) 10Filippo Giunchedi: [C: 03+1] Add component/rsyslog-openssl for Buster [puppet] - 10https://gerrit.wikimedia.org/r/865602 (https://phabricator.wikimedia.org/T324623) (owner: 10Muehlenhoff) [11:48:06] (03PS1) 10Ssingh: ntp/eqsin: move to dns5004 [dns] - 10https://gerrit.wikimedia.org/r/865605 (https://phabricator.wikimedia.org/T323830) [11:50:57] !log hashar@deploy1002 Started deploy [integration/docroot@2e0d44b]: Spelling, coobooks -> cookbooks [11:51:11] !log hashar@deploy1002 Finished deploy [integration/docroot@2e0d44b]: Spelling, coobooks -> cookbooks (duration: 00m 14s) [11:51:46] (03PS1) 10Volans: cloud_management: fix missing key in hiera [puppet] - 10https://gerrit.wikimedia.org/r/865608 [11:52:33] (03CR) 10Muehlenhoff: [C: 03+2] Add component/rsyslog-openssl for Buster [puppet] - 10https://gerrit.wikimedia.org/r/865602 (https://phabricator.wikimedia.org/T324623) (owner: 10Muehlenhoff) [11:54:31] (03CR) 10Ssingh: [C: 03+2] ntp/eqsin: move to dns5004 [dns] - 10https://gerrit.wikimedia.org/r/865605 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [11:54:48] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [11:55:03] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [11:55:15] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [11:55:27] !log running authdns-update for Gerrit: 865605 [11:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:42] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [11:56:02] (03PS1) 10Ssingh: hiera: decommission dns5002 [puppet] - 10https://gerrit.wikimedia.org/r/865610 (https://phabricator.wikimedia.org/T323830) [11:56:20] (03PS1) 10Ssingh: sites.yaml: remove dns5002 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/865611 (https://phabricator.wikimedia.org/T323830) [11:57:07] (03CR) 10Volans: [C: 03+2] "fix puppet" [puppet] - 10https://gerrit.wikimedia.org/r/865608 (owner: 10Volans) [11:57:51] !log imported librelp 1.10.0-1~buster1 to component/rsyslog-openssl T324623 [11:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:56] T324623: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 [11:59:24] 10SRE-swift-storage: Swiftrepl doesn't work on bullseye (and swiftrepl.conf is deployed by hand) - https://phabricator.wikimedia.org/T299125 (10MatthewVernon) I've a package of rclone 1.60.1 that builds cleanly against unstable now; I'll be uploading it soon (tomorrow unless anyone on the go team objects). [11:59:28] !log imported rsyslog 8.2102.0-2+deb11u1~buster1 to component/rsyslog-openssl T324623 [11:59:30] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [11:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:57] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [12:02:34] 10SRE-swift-storage: Update Debian rclone package to 1.60.0 - https://phabricator.wikimedia.org/T322547 (10MatthewVernon) I have a package that's pretty much ready to go, hopefully upload tomorrow. [12:03:19] (03PS1) 10Ssingh: lvs5005: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865613 (https://phabricator.wikimedia.org/T322048) [12:04:40] (03PS1) 10Volans: cloud_management: add missing wikimedia_clusters [puppet] - 10https://gerrit.wikimedia.org/r/865614 (https://phabricator.wikimedia.org/T319401) [12:05:22] (03PS1) 10Ssingh: sites.yaml: add lvs5005 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865615 (https://phabricator.wikimedia.org/T322048) [12:05:40] 10SRE, 10Cloud-Services, 10observability, 10Patch-For-Review, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10MoritzMuehlenhoff) This turned out to a little more complicated than initially assumed. I've now built a backport of the version that is in Bullseye (w... [12:07:18] (03CR) 10Jbond: "-1 is for the incorrect param name in the doc string, but s other comments. i have also added kieth as observability own this infrastruct" [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [12:07:31] (03CR) 10Volans: [C: 03+2] "fix puppet" [puppet] - 10https://gerrit.wikimedia.org/r/865614 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:09:32] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [12:09:55] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [12:10:08] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [12:10:40] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [12:10:52] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation [12:11:02] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [12:11:31] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [12:12:31] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:12:50] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:13:40] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:14:09] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:14:34] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:14:53] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation [12:15:08] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:15:21] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:15:24] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:15:58] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [12:16:25] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [12:16:35] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [12:17:22] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [12:18:21] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [12:18:52] (03CR) 10Muehlenhoff: [C: 03+2] superset: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862883 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:19:15] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [12:22:32] (03CR) 10Muehlenhoff: [C: 03+2] hue: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862885 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:24:36] (03CR) 10Muehlenhoff: [C: 03+2] yarn: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/862886 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:25:28] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for Superset [puppet] - 10https://gerrit.wikimedia.org/r/862933 (https://phabricator.wikimedia.org/T135991) [12:27:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:27:52] that's me, WIP [12:28:47] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation [12:28:48] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation [12:29:14] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Superset [puppet] - 10https://gerrit.wikimedia.org/r/862933 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:30:31] !log upgrading cloudweb to PHP 7.4.33 [12:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:32:45] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation [12:32:59] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudcumin2001.codfw.wmnet with reason: First installation [12:33:45] !log upgrading deployment servers to PHP 7.4.33 [12:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:12] (03PS1) 10Volans: cloud_management: re-add datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/865617 (https://phabricator.wikimedia.org/T319401) [12:36:14] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865617 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:37:24] !log upgrading mwmaint servers to PHP 7.4.33 [12:37:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:57] (03CR) 10Jbond: "some minor nits however its robably worth touching base with o11y as i have also seen some tls related changes relating to rsysog from the" [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [12:40:58] (03CR) 10Volans: [C: 03+2] cloud_management: re-add datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/865617 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:42:42] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [12:48:23] (03CR) 10Muehlenhoff: [C: 03+2] analytics::refinery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858604 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:48:25] (03PS2) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging [puppet] - 10https://gerrit.wikimedia.org/r/865591 [12:48:27] (03PS2) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [12:48:29] (03PS1) 10JMeybohm: kubeadm: Declare /etc/kubernetes directory resource directly [puppet] - 10https://gerrit.wikimedia.org/r/865619 [12:49:29] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: add lvs5005 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865615 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [12:51:25] (03CR) 10JMeybohm: "I'm going to (re)move that class in a follow up patch having it create another directory and hopefully it will be going away after the mig" [puppet] - 10https://gerrit.wikimedia.org/r/865619 (owner: 10JMeybohm) [12:51:56] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ayounsi) >>! In T322048#8450256, @ssingh wrote: >>>! In T322048#8449909, @ayounsi wrote: >> FYI, there are outstanding Homer diffs for asw1-eqsin: >> `lang=diff... [12:52:02] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove dns5002 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/865611 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [12:59:09] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10ayounsi) >>! In T324655#8450198, @Volans wrote: > For the latter part, you can move all the RO pre-requisite checks in the cookbook's __init__ that is run... [13:09:31] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudcumin1001.eqiad.wmnet with reason: First installation [13:09:44] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudcumin1001.eqiad.wmnet with reason: First installation [13:10:16] (03PS1) 10Clément Goubert: P:docker::builder: Add otelcol-contrib uid mapping [puppet] - 10https://gerrit.wikimedia.org/r/865623 [13:11:59] (03CR) 10Clément Goubert: Add a new production image for otelcol (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/857672 (https://phabricator.wikimedia.org/T320552) (owner: 10Clément Goubert) [13:15:59] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38616/console" [puppet] - 10https://gerrit.wikimedia.org/r/865623 (owner: 10Clément Goubert) [13:16:24] (03CR) 10Clément Goubert: P:docker::builder: Add otelcol-contrib uid mapping [puppet] - 10https://gerrit.wikimedia.org/r/865623 (owner: 10Clément Goubert) [13:18:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206', diff saved to https://phabricator.wikimedia.org/P42443 and previous config saved to /var/cache/conftool/dbconfig/20221207-131858-marostegui.json [13:19:17] (03PS1) 10Marostegui: Revert "db1206: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/865517 [13:20:24] (03CR) 10Marostegui: [C: 03+2] Revert "db1206: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/865517 (owner: 10Marostegui) [13:22:51] !log volans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Created cloudcumin instances - volans@cumin1001" [13:25:56] !log volans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Created cloudcumin instances - volans@cumin1001" [13:33:37] jouncebot: nowandnext [13:33:37] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [13:33:37] In 0 hour(s) and 26 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T1400) [13:34:45] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.13 refs T320518 [13:34:49] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [13:35:08] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:38:47] (03PS1) 10Muehlenhoff: Remove misc-apache Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/865625 [13:40:01] (03PS4) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [13:41:11] (03CR) 10Muehlenhoff: [C: 03+2] Remove misc-apache Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/865625 (owner: 10Muehlenhoff) [13:42:30] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.13 refs T320518 (duration: 07m 45s) [13:42:34] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [13:48:00] (03PS1) 10Muehlenhoff: doc: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865646 (https://phabricator.wikimedia.org/T135991) [13:49:38] (03PS4) 10Clément Goubert: P:mediawiki::php:monitoring: Longer opcache delay [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) [13:54:56] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: ganeti500[567] implementation tracking for serviceops - https://phabricator.wikimedia.org/T324610 (10MoritzMuehlenhoff) Ack, decomming these by mid January sounds doable! [13:55:28] (03PS3) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [13:59:33] (03PS1) 10Muehlenhoff: prometheus: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865648 (https://phabricator.wikimedia.org/T135991) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T1400). [14:00:04] No Gerrit patches in the queue for this window AFAICS. [14:01:13] o/ [14:01:24] yup, looks like nothing to do [14:02:53] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on an-tool1005.eqiad.wmnet with reason: redeploying an-tool1005 as bullseye [14:03:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on an-tool1005.eqiad.wmnet with reason: redeploying an-tool1005 as bullseye [14:05:55] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655 (10Volans) >>! In T324655#8450468, @ayounsi wrote: > > > > >>>! In T324655#8450198, @Volans wrote: >> For the latter part, you can move all the RO pre-requis... [14:11:18] (03CR) 10JMeybohm: "I would also argue not to remove things from the chart that can just stay disabled/unused to allow for easier merging of upstream changes " [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:12:41] (03PS1) 10Hashar: hiera: reorder contint1001 entries [puppet] - 10https://gerrit.wikimedia.org/r/865649 [14:14:00] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove dns5002 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/865611 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [14:14:43] (03Merged) 10jenkins-bot: sites.yaml: remove dns5002 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/865611 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [14:16:49] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dns5002.wikimedia.org with reason: downtimed, to be depooled [14:17:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dns5002.wikimedia.org with reason: downtimed, to be depooled [14:18:07] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 23 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:18:08] (03CR) 10Ssingh: [C: 03+2] hiera: decommission dns5002 [puppet] - 10https://gerrit.wikimedia.org/r/865610 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [14:18:23] (03CR) 10JMeybohm: flink-kubernetes-operator - modify for WMF (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [14:18:29] (03PS2) 10Ssingh: hiera: decommission dns5002 [puppet] - 10https://gerrit.wikimedia.org/r/865610 (https://phabricator.wikimedia.org/T323830) [14:20:13] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts dns5002.wikimedia.org [14:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:24:36] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [14:25:01] (03PS1) 10Ayounsi: OSPF: update drmrs GTT interface name [homer/public] - 10https://gerrit.wikimedia.org/r/865652 (https://phabricator.wikimedia.org/T324047) [14:26:28] (03CR) 10Ayounsi: [C: 03+2] OSPF: update drmrs GTT interface name [homer/public] - 10https://gerrit.wikimedia.org/r/865652 (https://phabricator.wikimedia.org/T324047) (owner: 10Ayounsi) [14:26:44] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:26:45] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:53] (03PS9) 10Awight: kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [14:27:00] (03Merged) 10jenkins-bot: OSPF: update drmrs GTT interface name [homer/public] - 10https://gerrit.wikimedia.org/r/865652 (https://phabricator.wikimedia.org/T324047) (owner: 10Ayounsi) [14:27:41] !log restarting ntpd [14:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:43] (03CR) 10CI reject: [V: 04-1] kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [14:27:56] (03PS1) 10Ssingh: dns5003: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/865657 (https://phabricator.wikimedia.org/T322048) [14:28:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dns5002.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [14:28:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:28:11] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dns5002.wikimedia.org [14:28:18] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `dns5002.wikimedia.org` - dns5002.wikimedia.... [14:28:58] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [14:29:47] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:29:55] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:30:15] (03CR) 10Elukey: [C: 03+1] "LGTM, but I'll defer to John the final green light!" [puppet] - 10https://gerrit.wikimedia.org/r/865075 (owner: 10JMeybohm) [14:31:45] (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:31:59] (03CR) 10Ssingh: [C: 03+2] dns5003: add Puppet role and DNS/NTP configs [puppet] - 10https://gerrit.wikimedia.org/r/865657 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [14:32:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:32:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host dns5003.wikimedia.org with OS buster [14:33:06] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS buster [14:33:14] (03CR) 10Elukey: "Do we have a pcc run to see the diffs?" [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm) [14:35:00] (03PS1) 10Ssingh: sites.yaml: add dns5003 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865660 (https://phabricator.wikimedia.org/T322048) [14:38:26] !log draining Arelion eqiad-codfw circuit for optic replacement [14:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:19] (03PS2) 10Ssingh: lvs5005: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865613 (https://phabricator.wikimedia.org/T322048) [14:41:45] (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:55] (03CR) 10Herron: [C: 03+1] prometheus: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865648 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:43:03] (03CR) 10Ssingh: [C: 03+2] lvs5005: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865613 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [14:44:46] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5005.eqsin.wmnet with OS buster [14:44:57] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5005.eqsin.wmnet with OS buster [14:46:01] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:47:48] (03PS5) 10Eevans: Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026 [14:49:05] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10ayounsi) [14:49:24] (03CR) 10Eevans: [C: 03+2] Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [14:50:25] (03CR) 10Eevans: [V: 03+2 C: 03+2] Promote Cassandra 3.11.13 to '3.x' (aka stable) [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [14:52:21] (03PS1) 10Btullis: Upgrade an-tool1005 from buster to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/865669 (https://phabricator.wikimedia.org/T323458) [14:55:43] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865648 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:56:14] !log krinkle@deploy1002 Started deploy [performance/navtiming@6caa033]: (no justification provided) [14:56:22] !log krinkle@deploy1002 Finished deploy [performance/navtiming@6caa033]: (no justification provided) (duration: 00m 07s) [14:56:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:01:53] (03PS1) 10Btullis: Update the mediawiki_history_snapshot in use by AQS [puppet] - 10https://gerrit.wikimedia.org/r/865671 [15:01:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:01:59] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:03:04] (03CR) 10Milimetric: [C: 03+1] Update the mediawiki_history_snapshot in use by AQS [puppet] - 10https://gerrit.wikimedia.org/r/865671 (owner: 10Btullis) [15:03:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5003.wikimedia.org with reason: host reimage [15:04:19] 10SRE-swift-storage, 10Community-Tech, 10MediaWiki-extensions-Phonos, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17): Establish Phonos production storage requirements - https://phabricator.wikimedia.org/T320675 (10Eevans) >>! In T320675#8449574, @dmaza wrote: >>>! In T320675#8368902, @Eevans wrote: >> TL;DR I... [15:06:31] (03CR) 10Btullis: [C: 03+2] Upgrade an-tool1005 from buster to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/865669 (https://phabricator.wikimedia.org/T323458) (owner: 10Btullis) [15:06:46] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5003.wikimedia.org with reason: host reimage [15:07:14] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:07:30] (03CR) 10Btullis: [C: 03+2] Update the mediawiki_history_snapshot in use by AQS [puppet] - 10https://gerrit.wikimedia.org/r/865671 (owner: 10Btullis) [15:08:11] PROBLEM - jenkins_service_running on releases1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [15:09:11] RECOVERY - jenkins_service_running on releases1002 is OK: PROCS OK: 1 process with regex args .*/bin/java .*-jar /usr/share/java/jenkins.war https://wikitech.wikimedia.org/wiki/Jenkins [15:09:42] releases1002 alarmed cause I was restarting Jenkins there [15:10:05] ack [15:11:12] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [15:11:21] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:12:14] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:12:24] ^ expected [15:12:43] ack thanks [15:12:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:13:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage [15:14:03] (03CR) 10David Caro: "LGTM, let me try to test it in toolsbeta" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [15:16:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5005.eqsin.wmnet with reason: host reimage [15:16:45] (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:14] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:17:48] (03CR) 10Btullis: [C: 03+2] search: drop search-drop-query-clicks systemd timer (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/865073 (owner: 10DCausse) [15:17:49] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:15] (03PS1) 10Hashar: contint: give access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) [15:19:13] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [15:19:32] ^ expected, should resolve soon [15:19:48] ack thanks [15:20:21] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache on VRTS [puppet] - 10https://gerrit.wikimedia.org/r/865674 (https://phabricator.wikimedia.org/T135991) [15:23:01] RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:23:51] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [15:24:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [15:26:06] 10SRE, 10Cloud-Services, 10observability, 10Patch-For-Review, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10Andrew) Wow, instant gratification! Thank you @MoritzMuehlenhoff, I will test. [15:26:45] (JobUnavailable) firing: (3) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:31:31] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [15:33:37] (03PS6) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) [15:34:27] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:36:14] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:36:14] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5003.wikimedia.org with OS buster [15:36:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [15:36:31] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host dns5003.wikimedia.org with OS buster completed: - dns5003 (**PASS**)... [15:37:32] (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [15:37:55] !log jiji@deploy1002 Started scap: Backport for [[gerrit:865121|ProductionServices: Use redis_misc servers for LockManager (4/6) (T267581)]] [15:37:58] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [15:37:59] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:49] !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865121|ProductionServices: Use redis_misc servers for LockManager (4/6) (T267581)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [15:40:04] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:41:03] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [15:41:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5005.eqsin.wmnet with OS buster [15:41:34] (03CR) 10David Caro: "The webservice starts as expected:" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [15:41:38] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5005.eqsin.wmnet with OS buster completed: - lvs5005 (**PASS**)... [15:43:02] (03PS1) 10Herron: update role_contacts for thanos (front|back)end [puppet] - 10https://gerrit.wikimedia.org/r/865679 [15:44:29] (03PS2) 10Hashar: contint: give RelEng access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) [15:44:31] (03PS1) 10Hashar: contint: add ci::master to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) [15:44:33] (03PS1) 10Hashar: contint: add contint1002 as a scap target [puppet] - 10https://gerrit.wikimedia.org/r/865681 (https://phabricator.wikimedia.org/T313832) [15:44:58] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [15:45:41] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:25] !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865121|ProductionServices: Use redis_misc servers for LockManager (4/6) (T267581)]] (duration: 08m 29s) [15:46:28] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [15:48:09] (03CR) 10David Caro: [C: 03+1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [15:48:11] (03CR) 10Effie Mouzeli: [C: 03+1] "seems reasonable, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/865066 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:48:53] (03PS4) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (5/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865122 (https://phabricator.wikimedia.org/T267581) [15:49:00] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache/FPM/Envoy on mwmaint/noc [puppet] - 10https://gerrit.wikimedia.org/r/865066 (https://phabricator.wikimedia.org/T135991) [15:49:29] (03PS3) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (6/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865123 (https://phabricator.wikimedia.org/T267581) [15:50:38] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Apache/FPM/Envoy on mwmaint/noc [puppet] - 10https://gerrit.wikimedia.org/r/865066 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:50:47] PROBLEM - OpenSearch health check for shards on 9200 on logstash1026 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fc2371af320: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [15:50:47] (03CR) 10Effie Mouzeli: Redis sessions: Goodbye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [15:50:47] org/wiki/Search%23Administration [15:51:04] (03PS1) 10Hnowlan: api-gateway: add restbase routing, enable in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/865683 (https://phabricator.wikimedia.org/T322152) [15:51:58] (03CR) 10Andrew Bogott: "thanks for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [15:52:09] (03Abandoned) 10Muehlenhoff: puppet: migrate from require_package to ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/640688 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond) [15:52:19] (03PS4) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [15:52:49] (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (034 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [15:56:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865122 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [15:57:08] (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (5/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865122 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [15:57:32] !log jiji@deploy1002 Started scap: Backport for [[gerrit:865122|ProductionServices: Use redis_misc servers for LockManager (5/6) (T267581)]] [15:57:36] (03CR) 10David Caro: [C: 03+1] webservice cli: allow for deployment of custom harbor images (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [15:57:36] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [15:57:41] (03CR) 10Filippo Giunchedi: [C: 03+1] update role_contacts for thanos (front|back)end [puppet] - 10https://gerrit.wikimedia.org/r/865679 (owner: 10Herron) [15:58:15] (03CR) 10Herron: [C: 03+2] update role_contacts for thanos (front|back)end [puppet] - 10https://gerrit.wikimedia.org/r/865679 (owner: 10Herron) [15:58:46] (03CR) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [15:58:51] (03CR) 10Majavah: [C: 04-1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [15:59:25] !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865122|ProductionServices: Use redis_misc servers for LockManager (5/6) (T267581)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [15:59:39] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10BTullis) a:03BTullis Is it OK if I have a crack at this @papaul? [15:59:43] (03PS12) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [15:59:45] (03PS4) 10Andrew Bogott: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) [16:00:55] (03CR) 10Alexandros Kosiaris: Update cxserver to 2022-12-06-121330-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865063 (https://phabricator.wikimedia.org/T321781) (owner: 10KartikMistry) [16:02:13] (03CR) 10David Caro: [C: 04-1] "The manifest update needs fixing as Taavi pointed out ;)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [16:02:28] (03CR) 10CI reject: [V: 04-1] remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [16:03:48] (03CR) 10Elukey: "Left a couple of nits, but overall it makes sense. Didn't get to review in detail all the changes in {master,node}.pp yet :(" [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:05:16] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add dns5003 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865660 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [16:06:51] !log run homer in cr*-eqsin for Gerrit: 865660 [16:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:02] (03CR) 10BBlack: [C: 03+2] eqsin cp: unify per-node hieradata [puppet] - 10https://gerrit.wikimedia.org/r/865120 (https://phabricator.wikimedia.org/T322048) (owner: 10BBlack) [16:08:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Redis sessions: Goodbye [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [16:08:32] !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865122|ProductionServices: Use redis_misc servers for LockManager (5/6) (T267581)]] (duration: 10m 59s) [16:08:35] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [16:09:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:46] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [16:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:18:20] (03PS1) 10Ssingh: lvs5002: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/865687 (https://phabricator.wikimedia.org/T323830) [16:19:16] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38619/console" [puppet] - 10https://gerrit.wikimedia.org/r/865687 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [16:22:58] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs5005 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865615 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [16:24:35] !log run homer in cr*-eqsin for Gerrit: 865615 [16:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [16:25:15] !log Deploying analytics/refinery (HDFS usage scripts) [16:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:25:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2117.codfw.wmnet with reason: Maintenance [16:25:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P42446 and previous config saved to /var/cache/conftool/dbconfig/20221207-162533-ladsgroup.json [16:25:36] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [16:25:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:25:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42447 and previous config saved to /var/cache/conftool/dbconfig/20221207-162553-ladsgroup.json [16:27:19] !log aqu@deploy1002 Started deploy [analytics/refinery@349e1cc]: Deploy HDFS usage dataset generation scripts [analytics/refinery@349e1cc] [16:27:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P42448 and previous config saved to /var/cache/conftool/dbconfig/20221207-162745-ladsgroup.json [16:28:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42449 and previous config saved to /var/cache/conftool/dbconfig/20221207-162802-ladsgroup.json [16:29:26] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1026.eqiad.wmnet with OS bullseye [16:29:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [16:30:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [16:30:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:30:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [16:30:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T322618)', diff saved to https://phabricator.wikimedia.org/P42450 and previous config saved to /var/cache/conftool/dbconfig/20221207-163031-ladsgroup.json [16:32:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T322618)', diff saved to https://phabricator.wikimedia.org/P42451 and previous config saved to /var/cache/conftool/dbconfig/20221207-163242-ladsgroup.json [16:32:46] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [16:33:07] (03PS1) 10BBlack: cp: remove the last haproxy role refs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/865691 [16:35:18] (03PS1) 10Cwhite: logstash: move alertmanager severity field to labels.check_severity [puppet] - 10https://gerrit.wikimedia.org/r/865631 (https://phabricator.wikimedia.org/T324684) [16:35:55] (03CR) 10BBlack: "NOP in PCC just for extra verification: https://puppet-compiler.wmflabs.org/output/865691/38620/" [puppet] - 10https://gerrit.wikimedia.org/r/865691 (owner: 10BBlack) [16:36:04] (03CR) 10BBlack: [C: 03+2] cp: remove the last haproxy role refs from hiera [puppet] - 10https://gerrit.wikimedia.org/r/865691 (owner: 10BBlack) [16:36:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:36:35] (03PS1) 10Andrea Denisse: netmon: Remove netmon2001 from the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) [16:36:45] (JobUnavailable) firing: (2) Reduced availability for job es_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:38:07] !log cr[23]-eqsin*: set routing-options static route 103.102.166.240/28 next-hop 10.132.0.6: T322048 [16:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:10] T322048: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 [16:38:54] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38621/console" [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [16:40:17] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results:" [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [16:40:42] (03CR) 10Ssingh: [V: 03+1 C: 03+2] lvs5002: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/865687 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [16:40:54] (03PS5) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [16:41:30] (03PS2) 10Eevans: echostore: bring codfw hosts up to date [deployment-charts] - 10https://gerrit.wikimedia.org/r/862307 (https://phabricator.wikimedia.org/T253244) [16:42:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs5002.eqsin.wmnet with reason: downtimed, in the process of decom [16:42:30] !log restart pybal on lvs5002 [16:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:35] (03CR) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:42:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [16:42:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs5002.eqsin.wmnet with reason: downtimed, in the process of decom [16:42:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42452 and previous config saved to /var/cache/conftool/dbconfig/20221207-164252-ladsgroup.json [16:42:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [16:42:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T322618)', diff saved to https://phabricator.wikimedia.org/P42453 and previous config saved to /var/cache/conftool/dbconfig/20221207-164258-ladsgroup.json [16:43:02] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [16:43:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42454 and previous config saved to /var/cache/conftool/dbconfig/20221207-164308-ladsgroup.json [16:43:13] * elukey bbiab [16:43:22] err wrong chan :) [16:45:31] (03PS1) 10Andrea Denisse: netmon: Remove the netmon2001 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) [16:45:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jiji@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865123 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [16:46:31] (03Merged) 10jenkins-bot: ProductionServices: Use redis_misc servers for LockManager (6/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865123 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [16:46:33] (03CR) 10Hnowlan: [C: 03+1] echostore: bring codfw hosts up to date [deployment-charts] - 10https://gerrit.wikimedia.org/r/862307 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [16:46:55] !log jiji@deploy1002 Started scap: Backport for [[gerrit:865123|ProductionServices: Use redis_misc servers for LockManager (6/6) (T267581)]] [16:46:59] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [16:47:12] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38622/console" [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:47:19] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38623/console" [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [16:47:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P42455 and previous config saved to /var/cache/conftool/dbconfig/20221207-164748-ladsgroup.json [16:48:09] (03CR) 10Eevans: [C: 03+2] echostore: bring codfw hosts up to date [deployment-charts] - 10https://gerrit.wikimedia.org/r/862307 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [16:48:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T322618)', diff saved to https://phabricator.wikimedia.org/P42456 and previous config saved to /var/cache/conftool/dbconfig/20221207-164809-ladsgroup.json [16:48:13] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [16:48:49] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38624/console" [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [16:48:52] !log jiji@deploy1002 jiji and jiji: Backport for [[gerrit:865123|ProductionServices: Use redis_misc servers for LockManager (6/6) (T267581)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [16:49:15] (03PS1) 10Cmjohnson: updateing site.pp for kubernetes servers to change role to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/865632 (https://phabricator.wikimedia.org/T313873) [16:50:31] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865695/38624/" [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [16:51:03] (03PS1) 10Ssingh: lvs5005: set as high-traffic2 primary LVS and remove lvs5002 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865701 (https://phabricator.wikimedia.org/T323830) [16:51:21] (03CR) 10Cmjohnson: [C: 03+2] updateing site.pp for kubernetes servers to change role to insetup::serviceops [puppet] - 10https://gerrit.wikimedia.org/r/865632 (https://phabricator.wikimedia.org/T313873) (owner: 10Cmjohnson) [16:51:45] (JobUnavailable) firing: (2) Reduced availability for job es_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:53:17] (03PS3) 10Hnowlan: thumbor: move replicas to main values, use swift discovery [deployment-charts] - 10https://gerrit.wikimedia.org/r/865595 [16:53:37] (03Merged) 10jenkins-bot: echostore: bring codfw hosts up to date [deployment-charts] - 10https://gerrit.wikimedia.org/r/862307 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [16:54:43] (03PS1) 10Andrea Denisse: netmon: Add the netmon2002 as a LibreNMS scap deploy target [puppet] - 10https://gerrit.wikimedia.org/r/865705 (https://phabricator.wikimedia.org/T315523) [16:55:10] (03PS1) 10RobH: updating role [puppet] - 10https://gerrit.wikimedia.org/r/865706 (https://phabricator.wikimedia.org/T322048) [16:55:14] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [16:55:16] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [16:55:22] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply [16:55:39] (03PS2) 10RobH: updating role [puppet] - 10https://gerrit.wikimedia.org/r/865706 (https://phabricator.wikimedia.org/T322048) [16:56:01] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [16:56:02] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38625/console" [puppet] - 10https://gerrit.wikimedia.org/r/865705 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [16:56:13] (03CR) 10RobH: [C: 03+2] updating role [puppet] - 10https://gerrit.wikimedia.org/r/865706 (https://phabricator.wikimedia.org/T322048) (owner: 10RobH) [16:56:56] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865705/38625/" [puppet] - 10https://gerrit.wikimedia.org/r/865705 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [16:57:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1024.eqiad.wmnet with OS bullseye [16:57:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P42457 and previous config saved to /var/cache/conftool/dbconfig/20221207-165758-ladsgroup.json [16:58:01] (03CR) 10JHathaway: [C: 03+2] Add Wenjun Fan to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/865177 (https://phabricator.wikimedia.org/T324057) (owner: 10JHathaway) [16:58:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye [16:58:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P42458 and previous config saved to /var/cache/conftool/dbconfig/20221207-165815-ladsgroup.json [16:58:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1023.eqiad.wmnet with OS bullseye [16:58:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye [16:59:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) 05Open→03Resolved @AnnWF done! [17:00:40] (03PS1) 10Andrea Denisse: netmon: Add the netmon2002 instance as a ganeti rapi node. [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) [17:01:42] !log jiji@deploy1002 Finished scap: Backport for [[gerrit:865123|ProductionServices: Use redis_misc servers for LockManager (6/6) (T267581)]] (duration: 14m 46s) [17:01:45] T267581: Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 [17:01:52] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38626/console" [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [17:01:57] 10SRE, 10ops-eqiad, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10Papaul) @BTullis feel free [17:02:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P42459 and previous config saved to /var/cache/conftool/dbconfig/20221207-170256-ladsgroup.json [17:03:03] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P42460 and previous config saved to /var/cache/conftool/dbconfig/20221207-170316-ladsgroup.json [17:04:48] (03PS1) 10Andrea Denisse: netmon: Remove rsync quickdatacopy failover restrictions [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074) [17:06:29] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38627/console" [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [17:07:11] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865707/38626/" [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [17:07:33] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865707/38626/" [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [17:08:13] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/865633 (https://phabricator.wikimedia.org/T324692) [17:08:15] (03CR) 10Andrea Denisse: [V: 03+1] "This constraint is no longer required." [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [17:08:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs5002.eqsin.wmnet [17:08:52] 10SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for Dom Walden - https://phabricator.wikimedia.org/T323549 (10jhathaway) 05Open→03Resolved a:03jhathaway @dom_walden done! [17:10:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1024.eqiad.wmnet with reason: host reimage [17:10:41] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1023.eqiad.wmnet with reason: host reimage [17:11:25] (03PS3) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging [puppet] - 10https://gerrit.wikimedia.org/r/865591 [17:11:27] (03PS6) 10JMeybohm: k8s: Add support for PKI with k8s >= 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/865592 (https://phabricator.wikimedia.org/T307943) [17:11:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 38 hosts with reason: Primary switchover s1 T324692 [17:11:55] T324692: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T324692 [17:12:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 38 hosts with reason: Primary switchover s1 T324692 [17:12:46] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [17:13:02] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1024.eqiad.wmnet with reason: host reimage [17:13:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T322618)', diff saved to https://phabricator.wikimedia.org/P42461 and previous config saved to /var/cache/conftool/dbconfig/20221207-171305-ladsgroup.json [17:13:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [17:13:08] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [17:13:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2124.codfw.wmnet with reason: Maintenance [17:13:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42462 and previous config saved to /var/cache/conftool/dbconfig/20221207-171321-ladsgroup.json [17:13:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [17:13:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T322618)', diff saved to https://phabricator.wikimedia.org/P42463 and previous config saved to /var/cache/conftool/dbconfig/20221207-171326-ladsgroup.json [17:13:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [17:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42464 and previous config saved to /var/cache/conftool/dbconfig/20221207-171342-ladsgroup.json [17:14:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2103 with weight 0 T324692', diff saved to https://phabricator.wikimedia.org/P42465 and previous config saved to /var/cache/conftool/dbconfig/20221207-171416-ladsgroup.json [17:14:27] (03PS1) 10Andrea Denisse: netmon: Set netmon2002 the main instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) [17:14:41] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:15:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1023.eqiad.wmnet with reason: host reimage [17:15:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T322618)', diff saved to https://phabricator.wikimedia.org/P42466 and previous config saved to /var/cache/conftool/dbconfig/20221207-171538-ladsgroup.json [17:15:44] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38629/console" [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [17:15:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42467 and previous config saved to /var/cache/conftool/dbconfig/20221207-171551-ladsgroup.json [17:16:13] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs5002 [homer/public] - 10https://gerrit.wikimedia.org/r/865712 (https://phabricator.wikimedia.org/T323830) [17:16:57] (03CR) 10Andrea Denisse: [V: 03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/865711/38629/" [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [17:17:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5002.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [17:17:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:17:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs5002.eqsin.wmnet [17:17:26] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs5002.eqsin.wmnet` - lvs5002.eqsin.wmnet... [17:18:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T322618)', diff saved to https://phabricator.wikimedia.org/P42468 and previous config saved to /var/cache/conftool/dbconfig/20221207-171803-ladsgroup.json [17:18:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P42469 and previous config saved to /var/cache/conftool/dbconfig/20221207-171822-ladsgroup.json [17:24:34] (03PS1) 10Papaul: Fix typo for sretest2002 node in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/865715 (https://phabricator.wikimedia.org/T322578) [17:24:56] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs5002 [homer/public] - 10https://gerrit.wikimedia.org/r/865712 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [17:25:54] !log running homer for Gerrit: 865712 [17:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:09] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash1026.eqiad.wmnet with OS bullseye [17:26:34] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001" [17:27:39] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001" [17:29:18] (03PS2) 10Hashar: contint: add ci::master to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) [17:29:37] (03PS3) 10BBlack: Add 'cdn' conftool service to all caches [puppet] - 10https://gerrit.wikimedia.org/r/863336 (https://phabricator.wikimedia.org/T324336) [17:29:39] (03PS3) 10BBlack: Switch pybal + scripts to 'cdn' service [puppet] - 10https://gerrit.wikimedia.org/r/863337 (https://phabricator.wikimedia.org/T324336) [17:29:41] (03PS3) 10BBlack: Remove legacy varnish-fe + ats-tls conftool keys [puppet] - 10https://gerrit.wikimedia.org/r/863338 (https://phabricator.wikimedia.org/T324336) [17:29:43] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [17:30:25] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [17:30:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42470 and previous config saved to /var/cache/conftool/dbconfig/20221207-173045-ladsgroup.json [17:30:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42471 and previous config saved to /var/cache/conftool/dbconfig/20221207-173057-ladsgroup.json [17:31:48] (03CR) 10BBlack: [C: 03+1] lvs5005: set as high-traffic2 primary LVS and remove lvs5002 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865701 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [17:32:08] (03CR) 10Papaul: [C: 03+2] Fix typo for sretest2002 node in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/865715 (https://phabricator.wikimedia.org/T322578) (owner: 10Papaul) [17:32:10] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:33:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T322618)', diff saved to https://phabricator.wikimedia.org/P42472 and previous config saved to /var/cache/conftool/dbconfig/20221207-173329-ladsgroup.json [17:33:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:33:33] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [17:33:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:33:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P42473 and previous config saved to /var/cache/conftool/dbconfig/20221207-173350-ladsgroup.json [17:35:20] (03CR) 10Ssingh: [C: 03+2] lvs5005: set as high-traffic2 primary LVS and remove lvs5002 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865701 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [17:35:37] (03PS2) 10Ssingh: lvs5005: set as high-traffic2 primary LVS and remove lvs5002 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865701 (https://phabricator.wikimedia.org/T323830) [17:36:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [17:36:21] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye [17:36:30] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bullseye [17:36:38] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye executed with errors: - sretest2002 (**FAIL**) - **T... [17:36:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [17:36:54] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye [17:37:36] (03PS1) 10JHathaway: Add Kelton Hurd to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/865716 (https://phabricator.wikimedia.org/T323941) [17:38:16] (03CR) 10SBassett: [C: 03+1] Add Kelton Hurd to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/865716 (https://phabricator.wikimedia.org/T323941) (owner: 10JHathaway) [17:38:24] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [17:40:45] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] netmon: Add the netmon2002 as a LibreNMS scap deploy target [puppet] - 10https://gerrit.wikimedia.org/r/865705 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [17:41:01] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10Patch-For-Review, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10jhathaway) 05Open→03Resolved a:03jhathaway @KHurd-WMF done! [17:41:38] (03PS2) 10Ladsgroup: mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/865633 (https://phabricator.wikimedia.org/T324692) (owner: 10Gerrit maintenance bot) [17:41:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2002.codfw.wmnet with reason: host reimage [17:41:42] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2103 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/865633 (https://phabricator.wikimedia.org/T324692) (owner: 10Gerrit maintenance bot) [17:42:32] (03CR) 10JHathaway: [C: 03+2] Add Kelton Hurd to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/865716 (https://phabricator.wikimedia.org/T323941) (owner: 10JHathaway) [17:42:55] !log restart pybal on lvs5005 to pick up bgp-med [17:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:20] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [17:45:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2002.codfw.wmnet with reason: host reimage [17:45:11] !log Starting s1 codfw failover from db2112 to db2103 - T324692 [17:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:13] T324692: Switchover s1 master (db2112 -> db2103) - https://phabricator.wikimedia.org/T324692 [17:45:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2103 to s1 primary T324692', diff saved to https://phabricator.wikimedia.org/P42474 and previous config saved to /var/cache/conftool/dbconfig/20221207-174540-ladsgroup.json [17:45:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P42475 and previous config saved to /var/cache/conftool/dbconfig/20221207-174551-ladsgroup.json [17:46:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P42476 and previous config saved to /var/cache/conftool/dbconfig/20221207-174604-ladsgroup.json [17:46:32] !log aqu@deploy1002 Finished deploy [analytics/refinery@349e1cc]: Deploy HDFS usage dataset generation scripts [analytics/refinery@349e1cc] (duration: 79m 12s) [17:46:41] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1026'] [17:48:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2112 T324692', diff saved to https://phabricator.wikimedia.org/P42477 and previous config saved to /var/cache/conftool/dbconfig/20221207-174811-ladsgroup.json [17:48:21] (03PS1) 10JHathaway: Add Vaughn Walters to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/865718 (https://phabricator.wikimedia.org/T324515) [17:48:46] !log aqu@deploy1002 Started deploy [analytics/refinery@349e1cc] (thin): Deploy HDFS usage dataset generation scripts THIN [analytics/refinery@349e1cc] [17:48:53] !log aqu@deploy1002 Finished deploy [analytics/refinery@349e1cc] (thin): Deploy HDFS usage dataset generation scripts THIN [analytics/refinery@349e1cc] (duration: 00m 07s) [17:49:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [17:49:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [17:49:25] !log aqu@deploy1002 Started deploy [analytics/refinery@349e1cc] (hadoop-test): Deploy HDFS usage dataset generation scripts TEST [analytics/refinery@349e1cc] [17:49:49] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-vaughnwalters, 10User-zeljkofilipin: Request for wmf group access for user: vwalters - https://phabricator.wikimedia.org/T324515 (10jhathaway) 05Open→03Resolved a:03jhathaway @vaughnwalters done! [17:50:41] !log aqu@deploy1002 Finished deploy [analytics/refinery@349e1cc] (hadoop-test): Deploy HDFS usage dataset generation scripts TEST [analytics/refinery@349e1cc] (duration: 01m 15s) [17:51:10] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10sbassett) [17:52:38] Hello, about to update varnishkafka certificates which will entail, [17:52:38] Disabling puppet on all cp servers [17:52:38] Merging the changes made [17:52:38] verifying the keypair is updated [17:52:38] verifying restarting of the varnishkafka instance, if not perfornimg a restart [17:52:39] re enabling and running puppet on all varnishkafka instances [17:52:39] T323771 [17:52:39] T323771: Update varnishkafka client certificate for authenticating to kafka-jumbo - https://phabricator.wikimedia.org/T323771 [17:53:17] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash1026'] [17:54:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [17:54:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [17:54:38] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10jhathaway) @KFrancis has @Muhammad_Yasser_Jazirahly_WMDE signed an NDA? [17:56:15] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1026'] [17:56:38] (03PS1) 10Ssingh: hiera: lvs5003: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/865720 (https://phabricator.wikimedia.org/T323830) [17:57:41] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38632/console" [puppet] - 10https://gerrit.wikimedia.org/r/865720 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [17:58:37] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [17:58:42] (03CR) 10Ssingh: [V: 03+1 C: 03+2] hiera: lvs5003: bump bgp_med to 150 [puppet] - 10https://gerrit.wikimedia.org/r/865720 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [17:59:59] (03CR) 10JHathaway: [C: 03+2] Add Vaughn Walters to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/865718 (https://phabricator.wikimedia.org/T324515) (owner: 10JHathaway) [18:00:13] (03PS1) 10Ssingh: lvs5006: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865722 (https://phabricator.wikimedia.org/T322048) [18:00:56] !log restart pybal on lvs5003 to pick up bgp-med change [18:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T322618)', diff saved to https://phabricator.wikimedia.org/P42478 and previous config saved to /var/cache/conftool/dbconfig/20221207-180058-ladsgroup.json [18:01:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:01:01] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:01:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42479 and previous config saved to /var/cache/conftool/dbconfig/20221207-180110-ladsgroup.json [18:01:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:01:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [18:01:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T322618)', diff saved to https://phabricator.wikimedia.org/P42480 and previous config saved to /var/cache/conftool/dbconfig/20221207-180119-ladsgroup.json [18:01:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [18:01:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42481 and previous config saved to /var/cache/conftool/dbconfig/20221207-180132-ladsgroup.json [18:01:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42482 and previous config saved to /var/cache/conftool/dbconfig/20221207-180140-ladsgroup.json [18:03:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [18:03:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2002.codfw.wmnet with OS bullseye [18:03:27] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye completed: - sretest2002 (**PASS**) - Downtimed on I... [18:03:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T322618)', diff saved to https://phabricator.wikimedia.org/P42483 and previous config saved to /var/cache/conftool/dbconfig/20221207-180331-ladsgroup.json [18:04:53] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1026'] [18:05:48] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) [18:05:58] (03PS7) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [18:06:55] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1026.eqiad.wmnet with OS bullseye [18:09:32] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) [18:09:54] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10jhathaway) @Nikerabbit when does their contract expire, so I can document it in our user database? [18:10:46] (03CR) 10Ottomata: flink and flink-kubernetes-operator image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) (owner: 10Ottomata) [18:12:19] (03PS8) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [18:13:49] 10SRE, 10LDAP-Access-Requests: Grant Access to ciadmin for Dom Walden - https://phabricator.wikimedia.org/T323549 (10dom_walden) >>! In T323549#8451455, @jhathaway wrote: > @dom_walden done! Thanks! [18:14:31] (03CR) 10Hnowlan: maps: remove tilerator and cassandra (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [18:16:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42484 and previous config saved to /var/cache/conftool/dbconfig/20221207-181647-ladsgroup.json [18:18:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42485 and previous config saved to /var/cache/conftool/dbconfig/20221207-181838-ladsgroup.json [18:19:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:36] (03CR) 10RLazarus: [C: 03+1] Add Vaughn Walters to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/865718 (https://phabricator.wikimedia.org/T324515) (owner: 10JHathaway) [18:23:59] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1026.eqiad.wmnet with reason: host reimage [18:26:18] (03CR) 10BBlack: [C: 03+1] "LGTM, great work!" [puppet] - 10https://gerrit.wikimedia.org/r/857748 (https://phabricator.wikimedia.org/T315676) (owner: 10Vgutierrez) [18:27:02] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1026.eqiad.wmnet with reason: host reimage [18:27:22] (03CR) 10BBlack: [C: 03+1] "In general this seems like it's on the right track. Given the complexity, I wouldn't be shocked if we find we need minor post-merge fixup" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:28:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P42486 and previous config saved to /var/cache/conftool/dbconfig/20221207-182808-ladsgroup.json [18:28:12] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:31:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P42487 and previous config saved to /var/cache/conftool/dbconfig/20221207-183154-ladsgroup.json [18:32:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:32:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [18:33:30] (03CR) 10Krinkle: [C: 04-1] "I suspect this is no longer needed with the x2 replicas removed from db config. Please confirm and close or clarify accordingly :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/828072 (https://phabricator.wikimedia.org/T312809) (owner: 10Aaron Schulz) [18:33:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [18:33:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P42488 and previous config saved to /var/cache/conftool/dbconfig/20221207-183344-ladsgroup.json [18:41:42] 10SRE, 10ops-eqiad, 10Data-Persistence (work done), 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Dzahn) Thank you @Marostegui , perfect :) [18:42:33] (03CR) 10Dzahn: [C: 03+1] mariadb: remove phab1001 from production-m3 grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [18:42:46] (03Abandoned) 10Dzahn: mariadb: remove phab1001 from production-m3 grants [puppet] - 10https://gerrit.wikimedia.org/r/858419 (https://phabricator.wikimedia.org/T323418) (owner: 10Dzahn) [18:43:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P42489 and previous config saved to /var/cache/conftool/dbconfig/20221207-184315-ladsgroup.json [18:45:42] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001" [18:45:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1024.eqiad.wmnet with OS bullseye [18:45:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye completed: - kubernetes1024 (**WARN... [18:45:48] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001" [18:45:48] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1023.eqiad.wmnet with OS bullseye [18:45:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye completed: - kubernetes1023 (**WARN... [18:47:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42490 and previous config saved to /var/cache/conftool/dbconfig/20221207-184700-ladsgroup.json [18:47:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [18:47:05] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:47:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [18:47:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T322618)', diff saved to https://phabricator.wikimedia.org/P42491 and previous config saved to /var/cache/conftool/dbconfig/20221207-184722-ladsgroup.json [18:47:53] (03PS1) 10RobH: r650xs updates [software] - 10https://gerrit.wikimedia.org/r/865724 [18:48:18] (03CR) 10RobH: [C: 03+2] r650xs updates [software] - 10https://gerrit.wikimedia.org/r/865724 (owner: 10RobH) [18:48:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T322618)', diff saved to https://phabricator.wikimedia.org/P42492 and previous config saved to /var/cache/conftool/dbconfig/20221207-184830-ladsgroup.json [18:48:48] (03Merged) 10jenkins-bot: r650xs updates [software] - 10https://gerrit.wikimedia.org/r/865724 (owner: 10RobH) [18:48:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T322618)', diff saved to https://phabricator.wikimedia.org/P42493 and previous config saved to /var/cache/conftool/dbconfig/20221207-184851-ladsgroup.json [18:48:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [18:49:11] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1026.eqiad.wmnet with OS bullseye [18:49:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [18:49:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [18:49:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance [18:49:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [18:49:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [18:49:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T322618)', diff saved to https://phabricator.wikimedia.org/P42494 and previous config saved to /var/cache/conftool/dbconfig/20221207-184958-ladsgroup.json [18:52:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T322618)', diff saved to https://phabricator.wikimedia.org/P42495 and previous config saved to /var/cache/conftool/dbconfig/20221207-185210-ladsgroup.json [18:52:14] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:56:49] (03CR) 10Dzahn: "moving ahead with this. contint1001 has been breaking. and this is existing group on new host which will turn into the same role. it needs" [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [18:56:53] (03CR) 10Dzahn: [C: 03+2] contint: give RelEng access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [18:56:59] (03PS3) 10Dzahn: contint: give RelEng access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [18:58:01] (03PS4) 10Dzahn: contint: give RelEng access to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [18:58:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P42496 and previous config saved to /var/cache/conftool/dbconfig/20221207-185821-ladsgroup.json [19:00:04] ^demon and dancy: #bothumor I � Unicode. All rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T1900). [19:00:04] ^demon and dancy: OwO what's this, a deployment window?? MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T1900). nyaa~ [19:00:25] hmm [19:00:46] (03CR) 10Dzahn: "You have now shell access." [puppet] - 10https://gerrit.wikimedia.org/r/865672 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [19:02:23] (03PS2) 10Dzahn: contint: add contint1002 as a scap target [puppet] - 10https://gerrit.wikimedia.org/r/865681 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [19:03:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42497 and previous config saved to /var/cache/conftool/dbconfig/20221207-190337-ladsgroup.json [19:05:41] (03PS4) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [19:06:13] (03CR) 10Slyngshede: sre.ganeti.reimage: add new cookbook (0311 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [19:07:08] (03CR) 10Slyngshede: "Thanks, the comments helped a lot in clarifying the work needed to be done." [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [19:07:16] (03PS5) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [19:07:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42498 and previous config saved to /var/cache/conftool/dbconfig/20221207-190717-ladsgroup.json [19:07:18] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [19:08:06] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 (10Dzahn) I merged your change https://gerrit.wikimedia.org/r/c/operations/puppet/+/865672/4 so now... [19:08:49] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [19:09:40] (03CR) 10Herron: [C: 03+1] netmon: Remove rsync quickdatacopy failover restrictions [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:10:25] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] netmon: Remove rsync quickdatacopy failover restrictions [puppet] - 10https://gerrit.wikimedia.org/r/865708 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [19:11:05] (03PS6) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [19:12:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:12:56] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [19:13:21] (03Restored) 10Samtar: InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) (owner: 10Samtar) [19:13:29] (03PS2) 10Samtar: InitialiseSettings.php: Add oathauth-verify-user to default bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/835252 (https://phabricator.wikimedia.org/T265726) [19:13:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T322618)', diff saved to https://phabricator.wikimedia.org/P42499 and previous config saved to /var/cache/conftool/dbconfig/20221207-191328-ladsgroup.json [19:13:32] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:15:14] (03CR) 10Dzahn: [C: 03+2] hiera: reorder contint1001 entries [puppet] - 10https://gerrit.wikimedia.org/r/865649 (owner: 10Hashar) [19:16:57] (03CR) 10Dzahn: [C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/865649 (owner: 10Hashar) [19:18:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42500 and previous config saved to /var/cache/conftool/dbconfig/20221207-191843-ladsgroup.json [19:19:06] (03PS7) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [19:20:38] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [19:21:45] (03CR) 10Volans: "Addressed comments" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans) [19:22:07] (03PS2) 10Volans: cumin: add an audit report for insetup servers [puppet] - 10https://gerrit.wikimedia.org/r/864729 [19:22:09] (03PS1) 10Volans: profile::cumin: use bool2str to simplify code [puppet] - 10https://gerrit.wikimedia.org/r/865728 [19:22:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42501 and previous config saved to /var/cache/conftool/dbconfig/20221207-192223-ladsgroup.json [19:22:44] (03PS8) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [19:24:37] (03CR) 10CI reject: [V: 04-1] sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [19:25:21] (03PS9) 10Slyngshede: sre.ganeti.reimage: add new cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [19:28:00] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10KFrancis) @jhathaway, I don't have one on file, but can process one. I'll need Muhammad Jaziraly's WMDE email address. Please send that to kfrancis@wikimedia.org. Thanks! [19:33:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T322618)', diff saved to https://phabricator.wikimedia.org/P42502 and previous config saved to /var/cache/conftool/dbconfig/20221207-193350-ladsgroup.json [19:33:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [19:33:55] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:34:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [19:34:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:34:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:34:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:34:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:34:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T322618)', diff saved to https://phabricator.wikimedia.org/P42503 and previous config saved to /var/cache/conftool/dbconfig/20221207-193445-ladsgroup.json [19:35:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T322618)', diff saved to https://phabricator.wikimedia.org/P42504 and previous config saved to /var/cache/conftool/dbconfig/20221207-193553-ladsgroup.json [19:36:45] (03CR) 10Herron: [C: 03+1] netmon: Add the netmon2002 instance as a ganeti rapi node. [puppet] - 10https://gerrit.wikimedia.org/r/865707 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [19:37:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T322618)', diff saved to https://phabricator.wikimedia.org/P42505 and previous config saved to /var/cache/conftool/dbconfig/20221207-193730-ladsgroup.json [19:37:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [19:37:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance [19:37:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42506 and previous config saved to /var/cache/conftool/dbconfig/20221207-193751-ladsgroup.json [19:40:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42507 and previous config saved to /var/cache/conftool/dbconfig/20221207-194003-ladsgroup.json [19:40:07] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:43:55] (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [19:50:38] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10jhathaway) @KFrancis, email sent, thanks! [19:50:43] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/865680/38635/" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [19:51:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42508 and previous config saved to /var/cache/conftool/dbconfig/20221207-195100-ladsgroup.json [19:51:05] (03CR) 10Dzahn: "deploying first on registry hosts, then contint old, then contint new..wip" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [19:51:10] (03CR) 10Dzahn: [C: 03+2] contint: add ci::master to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [19:51:21] (03PS3) 10Dzahn: contint: add ci::master to contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [19:53:00] (03PS1) 10Southparkfan: rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 [19:53:23] (03PS1) 10Ssingh: sites.yaml: add lvs5006 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865732 (https://phabricator.wikimedia.org/T322048) [19:53:48] (03CR) 10CI reject: [V: 04-1] rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (owner: 10Southparkfan) [19:53:55] !log registry* (docker registry HA) - adding contint1002 to allowed hosts gerrit:865680 T313832 [19:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:59] T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 [19:55:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42509 and previous config saved to /var/cache/conftool/dbconfig/20221207-195510-ladsgroup.json [19:56:13] (03PS2) 10Southparkfan: rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) [19:56:21] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) 05In progress→03Open a:05RobH→03None [19:56:42] (03CR) 10Dzahn: "deployed on registry*, deployed on contint2002 (noop)" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [19:57:01] (03CR) 10CI reject: [V: 04-1] rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [19:57:02] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) a:03ssingh @ssingh, Once the final OS installations are completed please resolve this task. Thanks! [19:58:51] (03CR) 10Dzahn: [C: 03+2] "deployed on contint2001, contint1001 (firewall only changes)" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [19:59:02] (03PS3) 10Southparkfan: rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) [19:59:15] (03PS12) 10Ryan Kemper: add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [19:59:54] (03CR) 10Ryan Kemper: add grizzly dashboard for WDQS uptime (033 comments) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:00:02] (03CR) 10Dzahn: [C: 03+2] "Antoine, it's fine with existing servers but for the new server it's missing something:" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [20:00:13] !log contint* - deploying firewall changes to add contint1002 - T313832 [20:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:17] T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 [20:02:31] (03CR) 10Dzahn: [C: 03+2] "it's because these are done based on host names, based to avoid that if we can:" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [20:04:49] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865733 (https://phabricator.wikimedia.org/T320518) [20:04:51] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865733 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [20:05:21] 10SRE, 10Cloud-Services, 10observability, 10Patch-For-Review, and 3 others: Switch rsyslog from gtls to ossl - https://phabricator.wikimedia.org/T324623 (10Southparkfan) I have tested https://gerrit.wikimedia.org/r/c/operations/puppet/+/865731 by using `rsyslog-openssl` on one syslog client and one syslog... [20:05:44] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865733 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [20:06:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42510 and previous config saved to /var/cache/conftool/dbconfig/20221207-200606-ladsgroup.json [20:06:56] (03PS1) 10Dzahn: contint: add docker::settings for contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865734 (https://phabricator.wikimedia.org/T313832) [20:07:06] (03CR) 10Dzahn: [C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/865734/" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [20:08:28] (03CR) 10Dzahn: [C: 03+2] contint: add docker::settings for contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865734 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn) [20:09:49] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on contint1002.wikimedia.org with reason: new setup [20:10:04] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on contint1002.wikimedia.org with reason: new setup [20:10:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42511 and previous config saved to /var/cache/conftool/dbconfig/20221207-201016-ladsgroup.json [20:13:49] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.13 refs T320518 [20:13:52] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [20:14:03] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10Jclark-ctr) 05Open→03Resolved replaced optic and moved to new port [20:16:01] (CirrusSearchJobQueueBacklogTooBig) firing: (4) CirrusSearch job topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite is heavily backlogged with 6.211M messages - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [20:18:38] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:31] (03CR) 10Andrew Bogott: [C: 03+1] "confirmed no-op where it counts: https://puppet-compiler.wmflabs.org/output/865731/38636/centrallog1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [20:20:53] !log demon@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.13 refs T320518 (duration: 07m 03s) [20:20:58] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [20:21:01] (CirrusSearchJobQueueBacklogTooBig) resolved: (4) CirrusSearch job topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite is heavily backlogged with 1.574M messages - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [20:21:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T322618)', diff saved to https://phabricator.wikimedia.org/P42512 and previous config saved to /var/cache/conftool/dbconfig/20221207-202113-ladsgroup.json [20:21:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [20:21:17] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [20:21:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [20:21:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T322618)', diff saved to https://phabricator.wikimedia.org/P42513 and previous config saved to /var/cache/conftool/dbconfig/20221207-202134-ladsgroup.json [20:22:30] ^^ anyone doing any maintenances that would explain those CirrusSearch job queue alerts? [20:23:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T322618)', diff saved to https://phabricator.wikimedia.org/P42514 and previous config saved to /var/cache/conftool/dbconfig/20221207-202343-ladsgroup.json [20:24:12] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:24:22] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:25:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42515 and previous config saved to /var/cache/conftool/dbconfig/20221207-202524-ladsgroup.json [20:25:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [20:25:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [20:25:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42516 and previous config saved to /var/cache/conftool/dbconfig/20221207-202545-ladsgroup.json [20:25:59] inflatador: when did they start? (I'm not but might help track down a related SAL entry or something) [20:26:31] (03PS1) 10Dzahn: ci: move docker::settings to common, avoid host names [puppet] - 10https://gerrit.wikimedia.org/r/865735 (https://phabricator.wikimedia.org/T313832) [20:27:38] (03CR) 10Effie Mouzeli: [C: 03+1] "Commit message needs some minor mending, other than that LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865580 (https://phabricator.wikimedia.org/T324649) (owner: 10Clément Goubert) [20:27:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42517 and previous config saved to /var/cache/conftool/dbconfig/20221207-202758-ladsgroup.json [20:28:02] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [20:30:09] (03PS5) 10Andrew Bogott: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) [20:31:27] (03CR) 10Dzahn: "This does not work on a new contint master. When the ci::master role was applied now on contint1002 the contint-admins group is not create" [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [20:31:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10jijiki) [20:33:16] (03CR) 10Dzahn: [C: 03+2] "most things worked after the one follow-up above. We do have some remaining issues though, or at least one which comes from:" [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [20:34:03] RhinosF1 first alert popped around 2016 UTC (~20m ago) [20:34:21] I see a DB maintenance, maybe that could explain it? [20:34:22] (03CR) 10Dzahn: [C: 03+2] "And no worries, I have confirmed jenkins, zuul and zuul-merger are dead and masked." [puppet] - 10https://gerrit.wikimedia.org/r/865680 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [20:36:06] inflatador: DB maintenance happens near 24/7 now [20:36:18] I was more thinking that the alert matched the train [20:36:23] ah, maybe a red herring then [20:36:39] (03PS1) 10Effie Mouzeli: site.pp Productionise mc20[39-55] [puppet] - 10https://gerrit.wikimedia.org/r/865736 (https://phabricator.wikimedia.org/T293012) [20:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:37:07] inflatador: the alert resolved didn't it so could it be something that was transient during sync? [20:37:40] Is there any other error mediawiki side to show why they might have backed up / failed / been generated more than normal? [20:38:01] (03CR) 10Ssingh: [C: 03+2] lvs5006: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865722 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [20:38:24] (03PS2) 10Ssingh: lvs5006: commission new LVS host (eqsin hardware refresh) [puppet] - 10https://gerrit.wikimedia.org/r/865722 (https://phabricator.wikimedia.org/T322048) [20:38:34] RhinosF1 yeah, that's what I'm curious about myself. I found a kafka dashboard ( https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?forceLogin&orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=All&var-topic=eqiad.mediawiki.job.cirrusSearchElasticaWrite ) but I'm not sure it has any useful info [20:38:39] (03PS1) 10DDesouza: Remove Research Incentive survey from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930) [20:38:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42519 and previous config saved to /var/cache/conftool/dbconfig/20221207-203849-ladsgroup.json [20:39:15] (03CR) 10RLazarus: [C: 03+1] site.pp Productionise mc20[39-55] [puppet] - 10https://gerrit.wikimedia.org/r/865736 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [20:40:00] (03CR) 10Herron: [C: 03+1] netmon: Set netmon2002 the main instance in codfw [puppet] - 10https://gerrit.wikimedia.org/r/865711 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [20:40:19] (03PS2) 10DDesouza: Remove Research Incentive survey from frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930) [20:40:26] (03CR) 10Herron: [C: 03+1] netmon: Remove the netmon2001 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/865695 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [20:40:48] inflatador: not really sure either where would be looked at. Probably a serviceops question if it's concerning to you. [20:40:49] (03CR) 10Herron: [C: 03+1] netmon: Remove netmon2001 from the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/865693 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [20:41:59] (03PS1) 10JHathaway: Add Jennifer Hancock to datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/865738 (https://phabricator.wikimedia.org/T324585) [20:42:30] No worries, it's not urgent ATM [20:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42520 and previous config saved to /var/cache/conftool/dbconfig/20221207-204304-ladsgroup.json [20:43:15] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs5006.eqsin.wmnet with OS buster [20:43:26] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs5006.eqsin.wmnet with OS buster [20:43:43] (03PS1) 10Dzahn: ci::master: hack to bootstrap new server contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865739 (https://phabricator.wikimedia.org/T313832) [20:44:04] (03PS1) 10DDesouza: Remove Research Incentive survey from swwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865740 (https://phabricator.wikimedia.org/T321252) [20:47:26] (03CR) 10Dzahn: [C: 03+2] ci::master: hack to bootstrap new server contint1002 [puppet] - 10https://gerrit.wikimedia.org/r/865739 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn) [20:47:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson) [20:47:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10Cmjohnson) 05Open→03Resolved completed [20:48:28] (03PS1) 10Ssingh: lvs5006: set as secondary LVS and remove lvs5003 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865742 (https://phabricator.wikimedia.org/T323830) [20:49:31] (03CR) 10CI reject: [V: 04-1] lvs5006: set as secondary LVS and remove lvs5003 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865742 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [20:50:11] (03PS1) 10Dzahn: Revert "ci::master: hack to bootstrap new server contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865525 [20:50:42] (03PS1) 10DDesouza: Deploy Research Incentive survey on yowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865744 (https://phabricator.wikimedia.org/T321249) [20:51:45] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:52:23] (03CR) 10Ssingh: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/865742 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [20:53:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42521 and previous config saved to /var/cache/conftool/dbconfig/20221207-205356-ladsgroup.json [20:55:53] (03PS5) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [20:58:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42522 and previous config saved to /var/cache/conftool/dbconfig/20221207-205811-ladsgroup.json [20:58:27] (03CR) 10Ottomata: "> I would also argue not to remove things from the chart that can just stay disabled/unused" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [20:58:37] (03PS6) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [20:59:07] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 137 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221207T2100) [21:00:04] duesen and danisztls: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] o/ [21:01:36] o/ [21:01:42] (03CR) 10RLazarus: [C: 03+1] add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [21:02:10] !log contint1002 a2dismod mpm_event - https://phabricator.wikimedia.org/T208108 Bug: T313832 [21:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:15] T313832: contint1002 service implementation tracking - https://phabricator.wikimedia.org/T313832 [21:03:01] o/ I can deploy [21:03:39] TheresNoTime: awesome :) I can also self service, but tbh, it's late, and I have had a bit of a day... [21:03:59] duesen: no worries :D where were you wanting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/864838/ deployed, assuming a backport..? [21:04:12] (03PS3) 10Samtar: hewiki: enable parser cache writes for parsoid's page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865070 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [21:04:23] TheresNoTime: oh crud yes, i didn't cherry-pick. give me a sec [21:05:53] (03PS3) 10Samtar: Page 5% of calls to parsoid's page/html endpoint write to PC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865071 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [21:05:57] TheresNoTime: hm, the cherry pick failed. can you do the config patches first? they can go in together, at the same time [21:06:07] duesen: sure, will do now [21:06:09] I'll figure out what's up with the DiscussionTools patch [21:06:50] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865070 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [21:06:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865071 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [21:07:37] (03Merged) 10jenkins-bot: hewiki: enable parser cache writes for parsoid's page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865070 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [21:07:39] (03PS1) 10Dzahn: Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 [21:07:41] (03Merged) 10jenkins-bot: Page 5% of calls to parsoid's page/html endpoint write to PC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865071 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [21:07:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage [21:08:04] (03CR) 10CI reject: [V: 04-1] Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn) [21:08:12] !log samtar@deploy1002 Started scap: Backport for [[gerrit:865070|hewiki: enable parser cache writes for parsoid's page/html endpoint. (T322672 T320534 T320529)]], [[gerrit:865071|Page 5% of calls to parsoid's page/html endpoint write to PC (T322672)]] [21:08:18] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [21:08:18] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [21:08:18] T322672: Make ParsoidHandler::wt2html write to parser cache - https://phabricator.wikimedia.org/T322672 [21:09:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T322618)', diff saved to https://phabricator.wikimedia.org/P42523 and previous config saved to /var/cache/conftool/dbconfig/20221207-210902-ladsgroup.json [21:09:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [21:09:06] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [21:09:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [21:09:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42524 and previous config saved to /var/cache/conftool/dbconfig/20221207-210923-ladsgroup.json [21:10:05] !log samtar@deploy1002 samtar and daniel: Backport for [[gerrit:865070|hewiki: enable parser cache writes for parsoid's page/html endpoint. (T322672 T320534 T320529)]], [[gerrit:865071|Page 5% of calls to parsoid's page/html endpoint write to PC (T322672)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:10:46] (03PS2) 10Dzahn: Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 [21:10:52] duesen: those patches are live on mwdebug, but just FYI I'm looking at T324711, a lot of busy exception logs [21:10:52] T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711 [21:11:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs5006.eqsin.wmnet with reason: host reimage [21:11:10] (unrelated to yours, just worrying :D) [21:11:11] (03CR) 10Dzahn: [C: 03+2] Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn) [21:11:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42525 and previous config saved to /var/cache/conftool/dbconfig/20221207-211132-ladsgroup.json [21:11:47] TheresNoTime: nvm the DiscussionTools patch, it's already on the branch, it got merged before the branch cut [21:11:52] (03CR) 10CI reject: [V: 04-1] Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn) [21:12:00] PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - free space: / 706 MB (3% inode=91%): /tmp 706 MB (3% inode=91%): /var/tmp 706 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [21:12:03] (03CR) 10Hashar: [C: 03+1] "Paired with Daniel. That was my mistake, I made a first change to get shell access which only added contint-root, the next change adding " [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn) [21:12:21] duesen: ack okay — can you test those config patches? [21:12:38] TheresNoTime: I'll try to test the config patch on debug, though I don't think I'll be able to see much. [21:12:39] (03PS7) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [21:12:55] (03PS3) 10Dzahn: Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 [21:13:09] (03CR) 10Dzahn: [V: 03+2] Revert "contint: give RelEng access to contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865746 (owner: 10Dzahn) [21:13:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T322618)', diff saved to https://phabricator.wikimedia.org/P42526 and previous config saved to /var/cache/conftool/dbconfig/20221207-211317-ladsgroup.json [21:13:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [21:13:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance [21:13:39] (03CR) 10Dzahn: [C: 03+2] Revert "ci::master: hack to bootstrap new server contint1002" [puppet] - 10https://gerrit.wikimedia.org/r/865525 (owner: 10Dzahn) [21:13:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42527 and previous config saved to /var/cache/conftool/dbconfig/20221207-211338-ladsgroup.json [21:14:25] (03CR) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [21:14:50] (03CR) 10Dzahn: "You can ignore my comments here. We found the _actual_ cause of the issue and it wasn't this :)" [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [21:15:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42528 and previous config saved to /var/cache/conftool/dbconfig/20221207-211551-ladsgroup.json [21:15:55] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [21:17:11] TheresNoTime: everything looks fine. Whether it actually is, we'll know once restbase starts hitting it. [21:17:49] TheresNoTime: if you merge them, i'll keep an eye on the metrics [21:18:01] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10Papaul) First OS install was done with the first 2 ssd's in software raid 1 and was able to see the Nvme as well ` Disk /dev/nvme0n1: 5.82 TiB, 6401252745216 bytes, 1562805846 sectors Disk model: WUS4CB064D7P3E3... [21:18:27] duesen: okay - I'm a little concerned with T324711, not entirely sure if I should merge while we're seeing that many exceptions post-train [21:18:27] T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711 [21:18:35] TheresNoTime: I will start to look at T324711 as well. May be related to my work (not the backport patches though). [21:20:59] duesen: should I merge the config patches, or would you prefer to address that first? [21:22:39] (03CR) 10Dzahn: [C: 03+2] gerrit: raise H2 compaction time [puppet] - 10https://gerrit.wikimedia.org/r/865023 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar) [21:22:43] TheresNoTime: please merge the config patches [21:22:51] ack [21:24:12] (03PS1) 10Stang: specieswiki: Install GeoData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865766 (https://phabricator.wikimedia.org/T324348) [21:25:16] 10SRE, 10SRE-Access-Requests, 10ops-codfw, 10Patch-For-Review: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10jhathaway) @wiki_willy the datacenter-ops group is a local group which grants access to a number of sudo commands needed for datacenter work. The [[ https://... [21:25:29] Hi TheresNoTime, would you mind taking care of one more patch ^^ [21:26:02] cirno: sure, there's one more ahead of you [21:26:33] (03PS3) 10Samtar: Remove Research Incentive survey from frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930) (owner: 10DDesouza) [21:26:35] 10SRE, 10SRE-Access-Requests, 10ops-codfw, 10Patch-For-Review: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10wiki_willy) Yup, that's correct. Thanks @jhathaway! >>! In T324585#8452314, @jhathaway wrote: > @wiki_willy the datacenter-ops group is a local group which... [21:26:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42529 and previous config saved to /var/cache/conftool/dbconfig/20221207-212638-ladsgroup.json [21:27:16] I have added this one on the board. TheresNoTime: this patch require https://gerrit.wikimedia.org/r/863442/, could you please have a look and give a +2? [21:27:46] (03PS2) 10JHathaway: Add Jennifer Hancock to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/865738 (https://phabricator.wikimedia.org/T324585) [21:27:57] cirno: looking [21:28:00] (03CR) 10Hashar: [V: 04-1] "We can dig in the history, but I think the partition name is local to the host. That comes from when we migrated from /mnt to /srv Ic0c805" [puppet] - 10https://gerrit.wikimedia.org/r/865735 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn) [21:28:47] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865070|hewiki: enable parser cache writes for parsoid's page/html endpoint. (T322672 T320534 T320529)]], [[gerrit:865071|Page 5% of calls to parsoid's page/html endpoint write to PC (T322672)]] (duration: 20m 35s) [21:28:53] T320529: Configure VE backend to use Parsoid directly, instead of calling RESTbase - https://phabricator.wikimedia.org/T320529 [21:28:54] T320534: Put Parsoid output into the ParserCache on every edit - https://phabricator.wikimedia.org/T320534 [21:28:54] T322672: Make ParsoidHandler::wt2html write to parser cache - https://phabricator.wikimedia.org/T322672 [21:29:06] duesen: those config patches should be live now [21:29:24] danisztls: doing 865737 now [21:29:30] TheresNoTime: Thanks! [21:29:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930) (owner: 10DDesouza) [21:30:28] (03Merged) 10jenkins-bot: Remove Research Incentive survey from frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865737 (https://phabricator.wikimedia.org/T321930) (owner: 10DDesouza) [21:30:52] TheresNoTime, I think I'm going to roll back the train when you're done. [21:30:55] !log samtar@deploy1002 Started scap: Backport for [[gerrit:865737|Remove Research Incentive survey from frwiki (T321930)]] [21:30:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42530 and previous config saved to /var/cache/conftool/dbconfig/20221207-213057-ladsgroup.json [21:30:58] T321930: Deploy Research Incentive Survey targeting Sub-Saharan Africa on French Wikipedia - https://phabricator.wikimedia.org/T321930 [21:30:59] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [21:31:10] dancy: ack [21:31:44] 10SRE, 10SRE-Access-Requests, 10ops-codfw, 10Patch-For-Review: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10jhathaway) 05Open→03Resolved a:03jhathaway done! [21:32:07] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin2002" [21:32:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs5006.eqsin.wmnet with OS buster [21:32:18] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs5006.eqsin.wmnet with OS buster completed: - lvs5006 (**PASS**)... [21:32:47] !log samtar@deploy1002 samtar and dani: Backport for [[gerrit:865737|Remove Research Incentive survey from frwiki (T321930)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:32:50] danisztls: that's live on mwdebug now, can you test? [21:32:59] TheresNoTime: yes [21:33:07] any mwdebug? [21:33:14] any :) [21:33:37] (03CR) 10Ssingh: [C: 03+2] sites.yaml: add lvs5006 (eqsin hardware refresh) [homer/public] - 10https://gerrit.wikimedia.org/r/865732 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [21:33:48] TheresNoTime: it looks fine [21:33:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Wenjun Fan - https://phabricator.wikimedia.org/T319056 (10jhathaway) @AnnWF is this now a duplicate, since you were added to analytics_privatedata_users in https://phabricator.wikimedia.org/T324057? [21:33:56] syncing [21:34:51] !log homer "cr*-eqsin*" commit "running homer for Gerrit: 865742" [21:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:52] (03PS2) 10Samtar: specieswiki: Install GeoData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865766 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [21:36:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on lvs5003.eqsin.wmnet with reason: downtimed, in the process of decom [21:36:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on lvs5003.eqsin.wmnet with reason: downtimed, in the process of decom [21:36:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs5003.eqsin.wmnet [21:38:34] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs5003 [homer/public] - 10https://gerrit.wikimedia.org/r/865773 (https://phabricator.wikimedia.org/T323830) [21:39:59] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865737|Remove Research Incentive survey from frwiki (T321930)]] (duration: 09m 04s) [21:40:02] danisztls: that should be live now :) [21:40:02] T321930: Deploy Research Incentive Survey targeting Sub-Saharan Africa on French Wikipedia - https://phabricator.wikimedia.org/T321930 [21:40:15] cirno: doing 865766 now [21:40:20] TheresNoTime: thanks [21:40:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865766 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [21:40:41] TheresNoTime: subbu and scott and I are looking into the bug. i have a good idea what it is. but not how to fix it, really [21:40:55] :(( [21:41:08] (03CR) 10Dzahn: [C: 03+1] "lgtm! Arnold, do you wanna merge it, watch what puppet adds and test starting that new systemd unit it creates? might be interesting for " [puppet] - 10https://gerrit.wikimedia.org/r/865674 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [21:41:20] (03Merged) 10jenkins-bot: specieswiki: Install GeoData extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865766 (https://phabricator.wikimedia.org/T324348) (owner: 10Stang) [21:41:20] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [21:41:44] (03PS3) 10Dzahn: phabricator: rm code from before system user was created with systemd [puppet] - 10https://gerrit.wikimedia.org/r/865208 [21:41:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42532 and previous config saved to /var/cache/conftool/dbconfig/20221207-214145-ladsgroup.json [21:41:46] !log samtar@deploy1002 Started scap: Backport for [[gerrit:865766|specieswiki: Install GeoData extension (T324348)]] [21:41:50] T324348: Add Extension:GeoData to Wikispecies wiki - https://phabricator.wikimedia.org/T324348 [21:42:01] TheresNoTime: have you run the script createExtensionTables to create tables? [21:42:50] cirno: nope.. will do now! [21:43:20] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5003.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [21:43:39] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:865766|specieswiki: Install GeoData extension (T324348)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [21:43:51] (03PS1) 10Brion VIBBER: Use blubber via Docker tooling; no longer requires local binary [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/865779 [21:44:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lvs5003.eqsin.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin2002" [21:44:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:44:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs5003.eqsin.wmnet [21:44:39] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs5003.eqsin.wmnet` - lvs5003.eqsin.wmnet... [21:44:42] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10ssingh) [21:44:51] (03CR) 10Ssingh: [C: 03+2] lvs5006: set as secondary LVS and remove lvs5003 (decomm) [puppet] - 10https://gerrit.wikimedia.org/r/865742 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [21:45:00] cirno: one moment [21:45:53] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs5003 [homer/public] - 10https://gerrit.wikimedia.org/r/865773 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [21:46:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42533 and previous config saved to /var/cache/conftool/dbconfig/20221207-214603-ladsgroup.json [21:47:18] !log homer "cr*-eqsin*" commit "running homer for Gerrit: 865773" [21:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:46] cirno: I will need to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/863442, may be worth rescheduling your patch? [21:48:30] or will I..? [21:48:34] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [21:48:38] it's ok, I could re-schedule the extension install patch [21:49:05] or should I do a backport of WikimediaMaintenance? [21:49:15] (I mean, by myself [21:49:29] !log samtar@deploy1002 Sync cancelled. [21:50:27] cirno: let's reschedule, given there's also T324711 going on and dancy wants to roll back the train. I'll revert that patch I merged [21:50:28] T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711 [21:50:57] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) 05Open→03Resolved Thanks to @RobH, @Papaul, @Bblack, @cmooney, @MoritzMuehlenhoff, @Volans for all their help in the eqsin refresh. [21:51:17] !log samtar@deploy1002 backport aborted: (duration: 00m 15s) [21:51:55] TheresNoTime: got it and agree to postpone, what do you think is the time for next schecule? [21:52:03] (03PS1) 10Samtar: Revert "specieswiki: Install GeoData extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865747 [21:54:52] (03CR) 10Samtar: [C: 03+2] Revert "specieswiki: Install GeoData extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865747 (owner: 10Samtar) [21:55:37] cirno: as long as that WikimediaMaintenance change is available - next window maybe? [21:56:07] dancy: done, all yours [21:56:14] thx! [21:56:20] !log UTC late backport window done [21:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42534 and previous config saved to /var/cache/conftool/dbconfig/20221207-215651-ladsgroup.json [21:56:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [21:56:55] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [21:57:06] (03Merged) 10jenkins-bot: Revert "specieswiki: Install GeoData extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865747 (owner: 10Samtar) [21:57:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [21:57:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T322618)', diff saved to https://phabricator.wikimedia.org/P42535 and previous config saved to /var/cache/conftool/dbconfig/20221207-215712-ladsgroup.json [21:57:18] (03PS1) 10Stang: createExtensionTables: Add extension GeoData [extensions/WikimediaMaintenance] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865748 (https://phabricator.wikimedia.org/T324348) [21:59:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T322618)', diff saved to https://phabricator.wikimedia.org/P42536 and previous config saved to /var/cache/conftool/dbconfig/20221207-215921-ladsgroup.json [22:01:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T322618)', diff saved to https://phabricator.wikimedia.org/P42537 and previous config saved to /var/cache/conftool/dbconfig/20221207-220110-ladsgroup.json [22:09:11] PROBLEM - ensure kvm processes are running on cloudvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:11:41] RECOVERY - ensure kvm processes are running on cloudvirt1019 is OK: PROCS OK: 5 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:13:22] (03PS2) 10Ryan Kemper: wdqs: Bring wdqs20[09,10,11,12] online [puppet] - 10https://gerrit.wikimedia.org/r/862369 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking) [22:14:06] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: Bring wdqs20[09,10,11,12] online [puppet] - 10https://gerrit.wikimedia.org/r/862369 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking) [22:14:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42538 and previous config saved to /var/cache/conftool/dbconfig/20221207-221427-ladsgroup.json [22:14:44] (03CR) 10Bking: [C: 03+2] wdqs: Bring wdqs20[09,10,11,12] online [puppet] - 10https://gerrit.wikimedia.org/r/862369 (https://phabricator.wikimedia.org/T301167) (owner: 10Bking) [22:23:41] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [22:25:01] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [22:25:18] !log bking@cumin2002 START - Cookbook sre.wdqs.data-transfer [22:25:37] TheresNoTime: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/865785 is the fix we want, never mind the other one for now. [22:25:44] TheresNoTime: having both doesn't hurt. [22:26:33] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [22:26:48] ah :D [22:28:59] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:29:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:29:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42539 and previous config saved to /var/cache/conftool/dbconfig/20221207-222934-ladsgroup.json [22:29:54] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:29:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:30:01] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:30:25] duesen: guessing you're going to want 865785 backported? [22:32:12] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:32:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:33:34] TheresNoTime: yes, please. [22:34:16] * TheresNoTime is available to do that unless anyone else would prefer to? [22:34:46] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 110 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:34:57] TheresNoTime: if you could do it, that would really help. I'm in zombie mode at this point. Need to sleep. I hope subbu can help in case something goes wrong. [22:35:20] sure :) [22:35:30] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:35:34] worst case, we roll back train to group 0. [22:36:01] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=wdqs2010.* [22:36:17] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=no; selector: name=wdqs2009.* [22:36:42] PROBLEM - WDQS SPARQL on wdqs2011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 533 bytes in 1.225 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:37:43] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:39:54] PROBLEM - Blazegraph Port for wdqs-categories on wdqs2012 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:40:46] (03PS1) 10Samtar: Make parsoid accept all content models. [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865749 (https://phabricator.wikimedia.org/T324711) [22:41:15] (03PS1) 10Bking: wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788 [22:41:40] !log T301167 Downtimed `wdqs20[09-12]` for 7 days [22:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:44] T301167: Service implementation for wdqs20[09,10,11,12] - https://phabricator.wikimedia.org/T301167 [22:42:53] (03CR) 10Samtar: [C: 03+2] "Backporting" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865749 (https://phabricator.wikimedia.org/T324711) (owner: 10Samtar) [22:43:16] (03CR) 10CI reject: [V: 04-1] wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788 (owner: 10Bking) [22:44:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T322618)', diff saved to https://phabricator.wikimedia.org/P42540 and previous config saved to /var/cache/conftool/dbconfig/20221207-224440-ladsgroup.json [22:44:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [22:44:45] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [22:44:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [22:45:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T322618)', diff saved to https://phabricator.wikimedia.org/P42541 and previous config saved to /var/cache/conftool/dbconfig/20221207-224502-ladsgroup.json [22:46:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T322618)', diff saved to https://phabricator.wikimedia.org/P42542 and previous config saved to /var/cache/conftool/dbconfig/20221207-224610-ladsgroup.json [22:47:01] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:48:31] !log Going to backport [[gerrit:865749]] to wmf/1.40.0-wmf.13 for T324711 [22:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:35] T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711 [22:48:40] !log bking@cumin2002 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [22:49:33] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [22:49:35] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:49:54] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [22:49:55] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:49:56] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2012 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:50:06] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [22:51:26] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:51:28] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [22:51:46] !log bking@cumin2002 START - Cookbook sre.wdqs.data-reload [22:54:39] (03PS2) 10Bking: wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788 [22:54:44] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:55:15] (03PS3) 10Bking: wdqs data-reload.py: fix usage comment [cookbooks] - 10https://gerrit.wikimedia.org/r/865788 [22:55:21] (03CR) 10Dzahn: [C: 03+2] doc: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/865646 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [22:55:52] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 128 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:56:28] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:56:55] (03PS1) 10RLazarus: Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 [22:56:57] (03PS1) 10RLazarus: Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) [22:56:59] (03PS1) 10RLazarus: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) [22:58:14] (03CR) 10Dzahn: "Arnold, if you are here tomorrow, maybe you can chat with Antoine (hashar) and merge this for him when he says it's ready to go?" [puppet] - 10https://gerrit.wikimedia.org/r/865681 (https://phabricator.wikimedia.org/T313832) (owner: 10Hashar) [22:58:28] (03CR) 10CI reject: [V: 04-1] Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [22:58:30] (03CR) 10CI reject: [V: 04-1] Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [22:58:32] (03CR) 10CI reject: [V: 04-1] Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus) [22:58:45] (03CR) 10Dzahn: [C: 03+2] phabricator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860905 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [22:58:52] (03PS3) 10Dzahn: phabricator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860905 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [22:59:55] (03Merged) 10jenkins-bot: Make parsoid accept all content models. [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865749 (https://phabricator.wikimedia.org/T324711) (owner: 10Samtar) [23:00:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865749 (https://phabricator.wikimedia.org/T324711) (owner: 10Samtar) [23:00:48] !log samtar@deploy1002 Started scap: Backport for [[gerrit:865749|Make parsoid accept all content models. (T324711)]] [23:00:52] T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711 [23:01:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42543 and previous config saved to /var/cache/conftool/dbconfig/20221207-230116-ladsgroup.json [23:01:18] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:02:45] !log samtar@deploy1002 samtar and samtar: Backport for [[gerrit:865749|Make parsoid accept all content models. (T324711)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [23:02:58] * TheresNoTime is testing [23:08:12] I still see the same errors: "/srv/mediawiki/php-1.40.0-wmf.13/includes/parser/Parsoid/ParsoidOutputAccess.php:196" ... but, on master, that isn't the right line anymore. [23:08:26] oh, testservers only .. never mind. ignore me. [23:09:29] TheresNoTime, parsoid requests go to parse200* cluster btw. not sure if mwdebug* will let you test this. [23:09:51] s/cluster/servers [23:09:55] ahhhh [23:09:58] * TheresNoTime syncs [23:10:21] the parsoid canary servers are parse2001/2002 and parse1001/1002 [23:10:28] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:35] mutante, ah .. good to know. [23:12:30] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 116 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:12:33] should be starting to see a drop off in exceptions now [23:12:55] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:14:22] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 45 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:14:25] TheresNoTime, looks like it has. [23:14:36] icinga was faster than me. [23:14:46] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:865749|Make parsoid accept all content models. (T324711)]] (duration: 13m 57s) [23:14:48] phew [23:14:50] T324711: UnexpectedValueException: Parsoid does not support content model proofread-index - https://phabricator.wikimedia.org/T324711 [23:15:02] (03PS4) 10Dzahn: phabricator: rm code from before system user was created with systemd [puppet] - 10https://gerrit.wikimedia.org/r/865208 (https://phabricator.wikimedia.org/T280597) [23:15:26] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/865208/38639/" [puppet] - 10https://gerrit.wikimedia.org/r/865208 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:15:26] PROBLEM - Disk space on mx1001 is CRITICAL: DISK CRITICAL - free space: / 720 MB (3% inode=91%): /tmp 720 MB (3% inode=91%): /var/tmp 720 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [23:16:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42544 and previous config saved to /var/cache/conftool/dbconfig/20221207-231623-ladsgroup.json [23:23:20] !log mx1001 - apt-get clean, gzip /var/log/exim4/mainlog.1 find -mtime +31 -delete in /var/log/exim4 - deleting old logs to prevent mail server running out of disk - it was alerting in Icinga but same as conf* - monitoring works, alerting does not [23:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:22] !log mx1001 about to run out of disk again - apt-get clean, gzip /var/log/exim4/mainlog.1 find -mtime +31 -delete in /var/log/exim4 - deleting old logs to prevent mail server running out of disk - it was alerting in Icinga but same as conf* - monitoring works, alerting does not T305567 [23:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:26] T305567: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 [23:24:54] (03CR) 10Jforrester: ci: move docker::settings to common, avoid host names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865735 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn) [23:25:59] (03CR) 10Dzahn: ci: move docker::settings to common, avoid host names (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865735 (https://phabricator.wikimedia.org/T313832) (owner: 10Dzahn) [23:26:36] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:26:38] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:27:18] TheresNoTime, alright, I'm going to relocate and will be unavailable for a bit ... but it looks like all is well so far. [23:27:47] subbu: I'll be around and will keep an eye for a bit, but looking okay [23:27:56] perfect. thanks! [23:31:04] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10Dzahn) I think the priority is surprisingly low for this being the main prod mail server and almost running out of disk multiple times. [23:31:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T322618)', diff saved to https://phabricator.wikimedia.org/P42545 and previous config saved to /var/cache/conftool/dbconfig/20221207-233130-ladsgroup.json [23:31:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [23:31:34] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [23:31:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [23:32:06] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:32:08] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:36:12] RECOVERY - Disk space on mx1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mx1001&var-datasource=eqiad+prometheus/ops [23:38:25] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1012.eqiad.wmnet with OS bullseye [23:43:56] (03PS2) 10RLazarus: Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) [23:43:58] (03PS2) 10RLazarus: Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 [23:44:00] (03PS2) 10RLazarus: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) [23:44:02] (03PS1) 10RLazarus: Typing cleanup, mostly associated with Python version upgrade [software/httpbb] - 10https://gerrit.wikimedia.org/r/865794 [23:45:38] (03CR) 10CI reject: [V: 04-1] Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [23:45:45] (03CR) 10CI reject: [V: 04-1] Typing cleanup, mostly associated with Python version upgrade [software/httpbb] - 10https://gerrit.wikimedia.org/r/865794 (owner: 10RLazarus) [23:45:52] (03CR) 10CI reject: [V: 04-1] Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [23:53:31] (03PS3) 10RLazarus: Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 [23:53:33] (03PS3) 10RLazarus: Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) [23:53:35] (03PS3) 10RLazarus: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) [23:54:05] (03Abandoned) 10RLazarus: Typing cleanup, mostly associated with Python version upgrade [software/httpbb] - 10https://gerrit.wikimedia.org/r/865794 (owner: 10RLazarus) [23:57:27] (03CR) 10jenkins-bot: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [23:59:50] (03PS4) 10RLazarus: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707)