[00:03:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42354 and previous config saved to /var/cache/conftool/dbconfig/20221206-000329-ladsgroup.json [00:03:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [00:03:35] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:03:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [00:03:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [00:04:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [00:04:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:04:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [00:04:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T322618)', diff saved to https://phabricator.wikimedia.org/P42355 and previous config saved to /var/cache/conftool/dbconfig/20221206-000444-ladsgroup.json [00:06:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T322618)', diff saved to https://phabricator.wikimedia.org/P42356 and previous config saved to /var/cache/conftool/dbconfig/20221206-000633-ladsgroup.json [00:06:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [00:06:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2171.codfw.wmnet with reason: Maintenance [00:06:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42357 and previous config saved to /var/cache/conftool/dbconfig/20221206-000654-ladsgroup.json [00:07:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T322618)', diff saved to https://phabricator.wikimedia.org/P42358 and previous config saved to /var/cache/conftool/dbconfig/20221206-000703-ladsgroup.json [00:08:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42359 and previous config saved to /var/cache/conftool/dbconfig/20221206-000820-ladsgroup.json [00:14:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P42360 and previous config saved to /var/cache/conftool/dbconfig/20221206-001438-ladsgroup.json [00:20:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:20:38] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:22:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:22:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P42361 and previous config saved to /var/cache/conftool/dbconfig/20221206-002210-ladsgroup.json [00:23:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P42362 and previous config saved to /var/cache/conftool/dbconfig/20221206-002326-ladsgroup.json [00:25:46] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.424 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:25:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:27:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:28:23] (03CR) 10Jberkel: Make "make" available in all images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel) [00:29:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T323907)', diff saved to https://phabricator.wikimedia.org/P42363 and previous config saved to /var/cache/conftool/dbconfig/20221206-002945-ladsgroup.json [00:29:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [00:29:49] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [00:30:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [00:37:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P42364 and previous config saved to /var/cache/conftool/dbconfig/20221206-003716-ladsgroup.json [00:38:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P42365 and previous config saved to /var/cache/conftool/dbconfig/20221206-003833-ladsgroup.json [00:42:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T323907)', diff saved to https://phabricator.wikimedia.org/P42366 and previous config saved to /var/cache/conftool/dbconfig/20221206-004231-ladsgroup.json [00:42:35] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [00:52:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T322618)', diff saved to https://phabricator.wikimedia.org/P42367 and previous config saved to /var/cache/conftool/dbconfig/20221206-005223-ladsgroup.json [00:52:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance [00:52:26] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:52:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1185.eqiad.wmnet with reason: Maintenance [00:52:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T322618)', diff saved to https://phabricator.wikimedia.org/P42368 and previous config saved to /var/cache/conftool/dbconfig/20221206-005244-ladsgroup.json [00:53:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T322618)', diff saved to https://phabricator.wikimedia.org/P42369 and previous config saved to /var/cache/conftool/dbconfig/20221206-005339-ladsgroup.json [00:53:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [00:53:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2178.codfw.wmnet with reason: Maintenance [00:54:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T322618)', diff saved to https://phabricator.wikimedia.org/P42370 and previous config saved to /var/cache/conftool/dbconfig/20221206-005401-ladsgroup.json [00:54:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T322618)', diff saved to https://phabricator.wikimedia.org/P42371 and previous config saved to /var/cache/conftool/dbconfig/20221206-005457-ladsgroup.json [00:55:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T322618)', diff saved to https://phabricator.wikimedia.org/P42372 and previous config saved to /var/cache/conftool/dbconfig/20221206-005526-ladsgroup.json [00:57:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42373 and previous config saved to /var/cache/conftool/dbconfig/20221206-005737-ladsgroup.json [01:07:50] (03CR) 10BryanDavis: Make "make" available in all images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel) [01:08:54] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8436528, @Eevans wrote: >>>! In T307035#8078353, @Cmjohnson wrote: >> @Eevans take your time, I just want to make sure that we're not falling behind o... [01:10:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P42374 and previous config saved to /var/cache/conftool/dbconfig/20221206-011003-ladsgroup.json [01:10:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42375 and previous config saved to /var/cache/conftool/dbconfig/20221206-011033-ladsgroup.json [01:10:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [01:11:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [01:11:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:11:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:11:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T323907)', diff saved to https://phabricator.wikimedia.org/P42376 and previous config saved to /var/cache/conftool/dbconfig/20221206-011128-ladsgroup.json [01:11:32] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [01:12:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P42377 and previous config saved to /var/cache/conftool/dbconfig/20221206-011244-ladsgroup.json [01:25:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P42378 and previous config saved to /var/cache/conftool/dbconfig/20221206-012510-ladsgroup.json [01:25:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P42379 and previous config saved to /var/cache/conftool/dbconfig/20221206-012539-ladsgroup.json [01:27:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T323907)', diff saved to https://phabricator.wikimedia.org/P42380 and previous config saved to /var/cache/conftool/dbconfig/20221206-012750-ladsgroup.json [01:27:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [01:27:54] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [01:28:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [01:28:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42381 and previous config saved to /var/cache/conftool/dbconfig/20221206-012812-ladsgroup.json [01:30:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T323907)', diff saved to https://phabricator.wikimedia.org/P42382 and previous config saved to /var/cache/conftool/dbconfig/20221206-013057-ladsgroup.json [01:34:08] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T322618)', diff saved to https://phabricator.wikimedia.org/P42383 and previous config saved to /var/cache/conftool/dbconfig/20221206-014017-ladsgroup.json [01:40:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance [01:40:21] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [01:40:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1200.eqiad.wmnet with reason: Maintenance [01:40:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T322618)', diff saved to https://phabricator.wikimedia.org/P42384 and previous config saved to /var/cache/conftool/dbconfig/20221206-014038-ladsgroup.json [01:40:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T322618)', diff saved to https://phabricator.wikimedia.org/P42385 and previous config saved to /var/cache/conftool/dbconfig/20221206-014046-ladsgroup.json [01:41:45] (JobUnavailable) firing: (5) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T322618)', diff saved to https://phabricator.wikimedia.org/P42386 and previous config saved to /var/cache/conftool/dbconfig/20221206-014251-ladsgroup.json [01:46:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42387 and previous config saved to /var/cache/conftool/dbconfig/20221206-014604-ladsgroup.json [01:51:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:08] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [01:57:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P42388 and previous config saved to /var/cache/conftool/dbconfig/20221206-015757-ladsgroup.json [02:01:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P42389 and previous config saved to /var/cache/conftool/dbconfig/20221206-020110-ladsgroup.json [02:06:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42390 and previous config saved to /var/cache/conftool/dbconfig/20221206-021301-ladsgroup.json [02:13:05] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [02:13:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P42391 and previous config saved to /var/cache/conftool/dbconfig/20221206-021310-ladsgroup.json [02:16:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T323907)', diff saved to https://phabricator.wikimedia.org/P42392 and previous config saved to /var/cache/conftool/dbconfig/20221206-021617-ladsgroup.json [02:16:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [02:16:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [02:16:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T323907)', diff saved to https://phabricator.wikimedia.org/P42393 and previous config saved to /var/cache/conftool/dbconfig/20221206-021638-ladsgroup.json [02:28:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42394 and previous config saved to /var/cache/conftool/dbconfig/20221206-022808-ladsgroup.json [02:28:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T322618)', diff saved to https://phabricator.wikimedia.org/P42395 and previous config saved to /var/cache/conftool/dbconfig/20221206-022817-ladsgroup.json [02:28:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [02:28:21] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [02:28:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [02:31:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [02:36:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:42:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T323907)', diff saved to https://phabricator.wikimedia.org/P42396 and previous config saved to /var/cache/conftool/dbconfig/20221206-024236-ladsgroup.json [02:42:40] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [02:43:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P42397 and previous config saved to /var/cache/conftool/dbconfig/20221206-024314-ladsgroup.json [02:57:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42398 and previous config saved to /var/cache/conftool/dbconfig/20221206-025743-ladsgroup.json [02:58:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42399 and previous config saved to /var/cache/conftool/dbconfig/20221206-025821-ladsgroup.json [02:58:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [02:58:25] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [02:58:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [02:58:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42400 and previous config saved to /var/cache/conftool/dbconfig/20221206-025831-ladsgroup.json [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T0300) [03:07:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.13 [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864819 (https://phabricator.wikimedia.org/T320518) [03:07:53] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.13 [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864819 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [03:12:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P42401 and previous config saved to /var/cache/conftool/dbconfig/20221206-031250-ladsgroup.json [03:22:04] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.13 [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864819 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [03:27:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T323907)', diff saved to https://phabricator.wikimedia.org/P42402 and previous config saved to /var/cache/conftool/dbconfig/20221206-032756-ladsgroup.json [03:27:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [03:28:00] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [03:28:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [03:28:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T323907)', diff saved to https://phabricator.wikimedia.org/P42403 and previous config saved to /var/cache/conftool/dbconfig/20221206-032818-ladsgroup.json [03:29:50] PROBLEM - IPMI Sensor Status on dns5004 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [03:30:05] ^ robh [03:34:20] PROBLEM - Check unit status of geoip_update_main on puppetmaster1001 is CRITICAL: CRITICAL: Status of the systemd unit geoip_update_main https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:34:46] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: geoip_update_main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42404 and previous config saved to /var/cache/conftool/dbconfig/20221206-034309-ladsgroup.json [03:43:14] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [03:48:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T323907)', diff saved to https://phabricator.wikimedia.org/P42405 and previous config saved to /var/cache/conftool/dbconfig/20221206-034806-ladsgroup.json [03:58:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42406 and previous config saved to /var/cache/conftool/dbconfig/20221206-035815-ladsgroup.json [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T0400) [04:00:38] RECOVERY - IPMI Sensor Status on dns5004 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [04:01:15] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864888 (https://phabricator.wikimedia.org/T320518) [04:01:17] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864888 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [04:01:58] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864888 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [04:02:28] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.13 refs T320518 [04:02:32] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [04:03:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42407 and previous config saved to /var/cache/conftool/dbconfig/20221206-040313-ladsgroup.json [04:13:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P42408 and previous config saved to /var/cache/conftool/dbconfig/20221206-041322-ladsgroup.json [04:18:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P42409 and previous config saved to /var/cache/conftool/dbconfig/20221206-041820-ladsgroup.json [04:28:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T323907)', diff saved to https://phabricator.wikimedia.org/P42410 and previous config saved to /var/cache/conftool/dbconfig/20221206-042828-ladsgroup.json [04:28:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [04:28:34] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [04:28:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [04:28:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T323907)', diff saved to https://phabricator.wikimedia.org/P42411 and previous config saved to /var/cache/conftool/dbconfig/20221206-042850-ladsgroup.json [04:33:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T323907)', diff saved to https://phabricator.wikimedia.org/P42412 and previous config saved to /var/cache/conftool/dbconfig/20221206-043326-ladsgroup.json [04:33:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [04:33:36] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1187.eqiad.wmnet with reason: Maintenance [04:33:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T323907)', diff saved to https://phabricator.wikimedia.org/P42413 and previous config saved to /var/cache/conftool/dbconfig/20221206-043348-ladsgroup.json [04:33:51] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [04:53:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T323907)', diff saved to https://phabricator.wikimedia.org/P42414 and previous config saved to /var/cache/conftool/dbconfig/20221206-045330-ladsgroup.json [04:53:35] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [04:55:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T323907)', diff saved to https://phabricator.wikimedia.org/P42415 and previous config saved to /var/cache/conftool/dbconfig/20221206-045510-ladsgroup.json [05:08:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42416 and previous config saved to /var/cache/conftool/dbconfig/20221206-050837-ladsgroup.json [05:10:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42417 and previous config saved to /var/cache/conftool/dbconfig/20221206-051016-ladsgroup.json [05:23:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P42418 and previous config saved to /var/cache/conftool/dbconfig/20221206-052343-ladsgroup.json [05:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P42419 and previous config saved to /var/cache/conftool/dbconfig/20221206-052523-ladsgroup.json [05:31:31] (03CR) 10Abijeet Patro: "Use: I6fb68fdd10fa30bc6724c1bfee459e213509f060" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/861837 (owner: 10L10n-bot) [05:31:45] (03CR) 10Abijeet Patro: "Use: I6fb68fdd10fa30bc6724c1bfee459e213509f060" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/862864 (owner: 10L10n-bot) [05:32:38] (03CR) 10Abijeet Patro: Localisation updates from https://translatewiki.net. (031 comment) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/862864 (owner: 10L10n-bot) [05:38:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T323907)', diff saved to https://phabricator.wikimedia.org/P42420 and previous config saved to /var/cache/conftool/dbconfig/20221206-053850-ladsgroup.json [05:38:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [05:38:55] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [05:39:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1201.eqiad.wmnet with reason: Maintenance [05:39:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T323907)', diff saved to https://phabricator.wikimedia.org/P42421 and previous config saved to /var/cache/conftool/dbconfig/20221206-053911-ladsgroup.json [05:40:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T323907)', diff saved to https://phabricator.wikimedia.org/P42422 and previous config saved to /var/cache/conftool/dbconfig/20221206-054030-ladsgroup.json [05:58:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T323907)', diff saved to https://phabricator.wikimedia.org/P42423 and previous config saved to /var/cache/conftool/dbconfig/20221206-055843-ladsgroup.json [05:58:47] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [06:11:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:12:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:13:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42424 and previous config saved to /var/cache/conftool/dbconfig/20221206-061349-ladsgroup.json [06:28:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P42425 and previous config saved to /var/cache/conftool/dbconfig/20221206-062856-ladsgroup.json [06:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:36:41] (03CR) 10Giuseppe Lavagetto: "I think we still use some isntances of this redis for locking during uploads. That should be migrated before we dismiss them." [puppet] - 10https://gerrit.wikimedia.org/r/864830 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [06:44:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T323907)', diff saved to https://phabricator.wikimedia.org/P42426 and previous config saved to /var/cache/conftool/dbconfig/20221206-064402-ladsgroup.json [06:44:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [06:44:08] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [06:44:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [07:00:04] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T0700). [07:04:00] (03PS1) 10Stevemunene: Add python-is-python3 package [puppet] - 10https://gerrit.wikimedia.org/r/864895 (https://phabricator.wikimedia.org/T323783) [07:04:20] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [07:07:54] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:09:35] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38589/console" [puppet] - 10https://gerrit.wikimedia.org/r/864895 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [07:35:58] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 120 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:38:52] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) I tried to create an JTAC ticket for an RMA but am getting: > Our records show that the Service Contract has expired for the serial number or Software Support Reference Number (S... [07:43:20] PROBLEM - Check systemd state on analytics1077 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:20] PROBLEM - Host analytics1077 is DOWN: PING CRITICAL - Packet loss = 100% [08:00:04] Amir1 and Urbanecm: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T0800). Please do the needful. [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:06:16] I may have a patch to backport. Dealing with some CI issues at the moment, though. [08:13:52] (03PS1) 10Hashar: gerrit: raise H2 compaction time [puppet] - 10https://gerrit.wikimedia.org/r/865023 (https://phabricator.wikimedia.org/T323754) [08:17:25] (03CR) 10Hashar: "This follow up Gerrit disk being filed up when doing the 3.4 > 3.5 upgrade. I have provided a summary of my rational in the commit message" [puppet] - 10https://gerrit.wikimedia.org/r/865023 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar) [08:17:59] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10MoritzMuehlenhoff) One thing to keep in mind for the LVSes is that Bullseye only includes Python 2 as a build dependency (at the time of the release some crucial packages (most notably Chrom... [08:28:17] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add note re: thanos-web and scheduler: sh and SSO [puppet] - 10https://gerrit.wikimedia.org/r/864663 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [08:39:24] RECOVERY - Host analytics1077 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [08:39:54] RECOVERY - Check systemd state on analytics1077 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:47:00] (03PS1) 10Kosta Harlan: User impact: Do not show impact module if user has no mainspace edits [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864908 (https://phabricator.wikimedia.org/T324285) [08:47:21] (03PS1) 10Kosta Harlan: User impact: Do not show impact module if user has no mainspace edits [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864909 (https://phabricator.wikimedia.org/T324285) [08:48:46] (03PS2) 10Kosta Harlan: User impact: Do not show impact module if user has no mainspace edits [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864908 (https://phabricator.wikimedia.org/T324285) [08:49:17] (03PS2) 10Kosta Harlan: User impact: Do not show impact module if user has no mainspace edits [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864909 (https://phabricator.wikimedia.org/T324285) [08:49:52] Alright going ahead with two backports, as there's nothing coming up [08:50:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864908 (https://phabricator.wikimedia.org/T324285) (owner: 10Kosta Harlan) [08:59:50] (03PS1) 10Kosta Harlan: NewImpact: Show "999+" when we could not count edits/thanks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864910 (https://phabricator.wikimedia.org/T324286) [09:00:05] (03PS1) 10Kosta Harlan: NewImpact: Show "999+" when we could not count edits/thanks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864911 (https://phabricator.wikimedia.org/T324286) [09:03:20] (03CR) 10Hashar: gerrit: script to report on git gc durations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [09:03:58] make that 4 backports (2 to wmf.12, 2 to wmf.13), as long as no one objects. cc Amir1 urbanecm [09:07:17] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/864907 unbreaks CI for several extensions, and will need a backport to wmf.13. [09:08:39] (03Merged) 10jenkins-bot: User impact: Do not show impact module if user has no mainspace edits [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864908 (https://phabricator.wikimedia.org/T324285) (owner: 10Kosta Harlan) [09:09:27] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864908|User impact: Do not show impact module if user has no mainspace edits (T324285)]] [09:09:30] T324285: NewImpact: Null state for "Last edited" - https://phabricator.wikimedia.org/T324285 [09:11:41] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864908|User impact: Do not show impact module if user has no mainspace edits (T324285)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [09:13:44] syncing... [09:17:15] (03CR) 10CI reject: [V: 04-1] User impact: Do not show impact module if user has no mainspace edits [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864909 (https://phabricator.wikimedia.org/T324285) (owner: 10Kosta Harlan) [09:18:10] (03PS1) 10Kosta Harlan: Revert "resourceloader: Modern ES6 code should be forced to target mobile" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864912 (https://phabricator.wikimedia.org/T323542) [09:18:56] (03CR) 10Kosta Harlan: [C: 04-2] "Should get approval from ResourceLoader maintainers or reviewers/authors of reverted patch." [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864912 (https://phabricator.wikimedia.org/T323542) (owner: 10Kosta Harlan) [09:19:06] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [09:20:31] (03PS2) 10Kosta Harlan: NewImpact: Show "999+" when we could not count edits/thanks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864911 (https://phabricator.wikimedia.org/T324286) [09:20:51] (03PS3) 10Kosta Harlan: User impact: Do not show impact module if user has no mainspace edits [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864909 (https://phabricator.wikimedia.org/T324285) [09:22:38] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [09:25:00] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [09:27:12] the sync-proxies/sync-apaches/sync-canaries steps seem to take much longer than I remember. each one is 3-4 minutes [09:32:41] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/863380 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [09:33:33] (03CR) 10Filippo Giunchedi: "can be abandoned in favor of I0da3b5e54d" [puppet] - 10https://gerrit.wikimedia.org/r/864776 (https://phabricator.wikimedia.org/T324466) (owner: 10Herron) [09:37:32] (03CR) 10Filippo Giunchedi: [C: 03+2] icinga: move mgmt_parents to icinga [puppet] - 10https://gerrit.wikimedia.org/r/860573 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:37:32] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864908|User impact: Do not show impact module if user has no mainspace edits (T324285)]] (duration: 28m 05s) [09:37:36] T324285: NewImpact: Null state for "Last edited" - https://phabricator.wikimedia.org/T324285 [09:37:38] (03PS3) 10Filippo Giunchedi: icinga: move mgmt_parents to icinga [puppet] - 10https://gerrit.wikimedia.org/r/860573 (https://phabricator.wikimedia.org/T310266) [09:38:29] on to the next one... [09:38:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864910 (https://phabricator.wikimedia.org/T324286) (owner: 10Kosta Harlan) [09:41:06] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 119 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:41:15] (03CR) 10Filippo Giunchedi: [V: 03+2] icinga: move mgmt_parents to icinga [puppet] - 10https://gerrit.wikimedia.org/r/860573 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:44:43] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove mgmt_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/860574 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:47:38] (03PS3) 10Filippo Giunchedi: hieradata: remove mgmt_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/860574 (https://phabricator.wikimedia.org/T310266) [09:49:05] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: remove mgmt_contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/860574 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [09:49:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:54:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:08] (03CR) 10CI reject: [V: 04-1] NewImpact: Show "999+" when we could not count edits/thanks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864910 (https://phabricator.wikimedia.org/T324286) (owner: 10Kosta Harlan) [09:58:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864910 (https://phabricator.wikimedia.org/T324286) (owner: 10Kosta Harlan) [09:59:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864910 (https://phabricator.wikimedia.org/T324286) (owner: 10Kosta Harlan) [10:00:33] (03PS2) 10Stevemunene: Add python-is-python3 package [puppet] - 10https://gerrit.wikimedia.org/r/864895 (https://phabricator.wikimedia.org/T323783) [10:03:12] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38590/console" [puppet] - 10https://gerrit.wikimedia.org/r/864895 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [10:08:05] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/864895 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [10:08:45] (03PS6) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) [10:11:26] (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Add python-is-python3 package [puppet] - 10https://gerrit.wikimedia.org/r/864895 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [10:12:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:16:24] (03Merged) 10jenkins-bot: NewImpact: Show "999+" when we could not count edits/thanks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864910 (https://phabricator.wikimedia.org/T324286) (owner: 10Kosta Harlan) [10:16:28] (03PS1) 10Stevemunene: Add an-presto1008-1015 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/865043 (https://phabricator.wikimedia.org/T323783) [10:16:52] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864910|NewImpact: Show "999+" when we could not count edits/thanks (T324286)]] [10:16:57] T324286: NewImpact: edits and thanks are capped at 1000 - https://phabricator.wikimedia.org/T324286 [10:20:49] (03PS8) 10FNegri: cumin::target: Add support for cloudcumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [10:23:05] (03CR) 10CI reject: [V: 04-1] cumin::target: Add support for cloudcumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [10:23:39] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38591/console" [puppet] - 10https://gerrit.wikimedia.org/r/865043 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [10:24:13] waiting for k8s images build [10:24:54] (03CR) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [10:24:57] (03CR) 10David Caro: [C: 03+2] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [10:25:26] (03PS3) 10Majavah: puppetdb: support using client certificates [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 [10:25:56] (03CR) 10Majavah: puppetdb: support using client certificates (034 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah) [10:27:25] (03PS1) 10Kosta Harlan: Instrumentation: Monitor navigation duration, transferSize, first paint [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864915 (https://phabricator.wikimedia.org/T324198) [10:28:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): neutron: cloudnet nodes use VRRP over VXLAN to instrument HA and they require to be on the same subnet - https://phabricator.wikimedia.org/T319539 (10cmooney) @aborrero just going through some tasks. I think perhaps we can clos... [10:28:39] (03Merged) 10jenkins-bot: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [10:30:34] (03CR) 10Majavah: cumin::target: Add support for cloudcumin hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [10:31:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [10:31:44] 10SRE, 10Infrastructure-Foundations, 10netops: Move interface VRF assignment to Netbox - https://phabricator.wikimedia.org/T310715 (10cmooney) 05Open→03Resolved This has been completed and automation updated to support VRFs "generically" https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/... [10:32:32] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:36:25] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864910|NewImpact: Show "999+" when we could not count edits/thanks (T324286)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [10:36:28] T324286: NewImpact: edits and thanks are capped at 1000 - https://phabricator.wikimedia.org/T324286 [10:41:56] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 120 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [10:42:40] (03CR) 10Hnowlan: [C: 03+1] Promote Cassandra 3.11.13 to '3.x' (aka stable) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/863026 (owner: 10Eevans) [10:48:17] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864910|NewImpact: Show "999+" when we could not count edits/thanks (T324286)]] (duration: 31m 25s) [10:48:21] T324286: NewImpact: edits and thanks are capped at 1000 - https://phabricator.wikimedia.org/T324286 [10:48:54] CUSTOM - Host an-coord1001 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [10:51:40] (03PS1) 10FNegri: Rename cloudcumin key to match production name [labs/private] - 10https://gerrit.wikimedia.org/r/865047 (https://phabricator.wikimedia.org/T323483) [10:52:20] (03CR) 10Volans: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/865047 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [10:53:56] (03CR) 10FNegri: [C: 03+2] Rename cloudcumin key to match production name [labs/private] - 10https://gerrit.wikimedia.org/r/865047 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [10:54:32] (03CR) 10FNegri: [V: 03+2 C: 03+2] Rename cloudcumin key to match production name [labs/private] - 10https://gerrit.wikimedia.org/r/865047 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [10:56:52] !log installing freetype security updates [10:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:57] (03PS9) 10FNegri: cumin::target: Add support for cloudcumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [10:57:22] (03PS10) 10FNegri: cumin::target: Add support for cloudcumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [10:59:08] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [10:59:19] !log UTC morning deploys done [10:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:06] RECOVERY - NFS Share Volume Space /srv/tools on labstore1004 is OK: DISK OK - free space: /srv/tools 1826019 MB (23% inode=67%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [11:04:21] 10SRE: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10Clement_Goubert) [11:05:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:08] CUSTOM - Host an-coord1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [11:08:22] (03PS2) 10Hnowlan: thumbor: increase memory limit, replicas. [deployment-charts] - 10https://gerrit.wikimedia.org/r/864773 (https://phabricator.wikimedia.org/T323936) [11:11:31] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10BTullis) a:03BTullis Thanks @Clement_Goubert - I believe that @odimitrijevic is already working on getting an updated licence. I'll claim this ticket and ta... [11:12:46] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:43] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10Clement_Goubert) FYI, I still reset the failed state on puppetmaster1001 so we get alerted if another service fails, and left a persistent comment with this ta... [11:16:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:38] RECOVERY - Check unit status of geoip_update_main on puppetmaster1001 is OK: OK: Status of the systemd unit geoip_update_main https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:20:07] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [11:22:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [11:24:51] (03PS1) 10Volans: cumin: remove hieradata of decommissioned host [puppet] - 10https://gerrit.wikimedia.org/r/865048 [11:24:53] (03PS1) 10Volans: cluster::cloud_management: create new role [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) [11:29:12] (03CR) 10Muehlenhoff: [C: 03+1] cumin: remove hieradata of decommissioned host [puppet] - 10https://gerrit.wikimedia.org/r/865048 (owner: 10Volans) [11:29:46] (03CR) 10Volans: [C: 03+2] cumin: remove hieradata of decommissioned host [puppet] - 10https://gerrit.wikimedia.org/r/865048 (owner: 10Volans) [11:31:01] steve_munene: ok to puppet-merge your change too? Add python-is-python3 package (ed81cc4d99) [11:32:18] cc btullis I guess [11:40:07] volans: I can vouch for that change. Guess Steve forgot to merge. [11:40:18] ack, thanks [11:44:23] (03CR) 10Volans: "PCC: https://puppet-compiler.wmflabs.org/output/865049/38592/" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:44:41] (03CR) 10Volans: "Of course we can't PCC it for the new hosts as they don't exist yet." [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [11:51:46] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:54:00] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:54:48] (03PS1) 10Hnowlan: maps: remove tilerator::regen [puppet] - 10https://gerrit.wikimedia.org/r/865053 (https://phabricator.wikimedia.org/T298246) [11:55:32] (03PS2) 10Hnowlan: maps: remove tilerator::regen [puppet] - 10https://gerrit.wikimedia.org/r/865053 (https://phabricator.wikimedia.org/T298246) [11:55:44] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:57:26] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38593/console" [puppet] - 10https://gerrit.wikimedia.org/r/865053 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [11:59:05] (03CR) 10Effie Mouzeli: [C: 03+1] maps: remove tilerator::regen [puppet] - 10https://gerrit.wikimedia.org/r/865053 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [12:04:49] (03PS1) 10Hnowlan: thumbor: change exposed port to 8800 [deployment-charts] - 10https://gerrit.wikimedia.org/r/865054 (https://phabricator.wikimedia.org/T233196) [12:05:24] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: remove tilerator::regen [puppet] - 10https://gerrit.wikimedia.org/r/865053 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [12:10:06] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [12:12:03] jouncebot: nowandnext [12:12:03] No deployments scheduled for the next 1 hour(s) and 47 minute(s) [12:12:04] In 1 hour(s) and 47 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1400) [12:12:04] In 1 hour(s) and 47 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1400) [12:14:54] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.13 refs T320518 [12:14:58] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [12:15:41] (03CR) 10Clément Goubert: [C: 03+1] thumbor: increase memory limit, replicas. [deployment-charts] - 10https://gerrit.wikimedia.org/r/864773 (https://phabricator.wikimedia.org/T323936) (owner: 10Hnowlan) [12:17:02] (03CR) 10Clément Goubert: [C: 03+1] thumbor: change exposed port to 8800 [deployment-charts] - 10https://gerrit.wikimedia.org/r/865054 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:21:26] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.13 refs T320518 [12:21:29] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [12:23:37] (03PS1) 10Hnowlan: maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) [12:23:55] (03PS1) 10Slyngshede: Ganeti: Add reimaging cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [12:24:54] (03CR) 10Krinkle: [C: 03+2] Revert "resourceloader: Modern ES6 code should be forced to target mobile" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864912 (https://phabricator.wikimedia.org/T323542) (owner: 10Kosta Harlan) [12:25:08] (03CR) 10Krinkle: Revert "resourceloader: Modern ES6 code should be forced to target mobile" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864912 (https://phabricator.wikimedia.org/T323542) (owner: 10Kosta Harlan) [12:25:13] (03CR) 10Krinkle: [C: 03+2] Revert "resourceloader: Modern ES6 code should be forced to target mobile" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864912 (https://phabricator.wikimedia.org/T323542) (owner: 10Kosta Harlan) [12:26:15] (03CR) 10CI reject: [V: 04-1] Ganeti: Add reimaging cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) (owner: 10Slyngshede) [12:26:58] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:18] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.13 refs T320518 (duration: 05m 52s) [12:27:20] !log jmm@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart-nginx (exit_code=1) rolling restart_daemons on A:wcqs-public [12:27:22] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [12:27:22] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [12:28:57] (03PS2) 10Hnowlan: maps: remove redis [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) [12:29:29] !log jnuche@deploy1002 Pruned MediaWiki: 1.40.0-wmf.10 (duration: 02m 09s) [12:35:16] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38596/console" [puppet] - 10https://gerrit.wikimedia.org/r/865056 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [12:36:50] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/865060 [12:39:07] (03CR) 10Kosta Harlan: "Thanks for the +2. Are you backporting this now as well?" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864912 (https://phabricator.wikimedia.org/T323542) (owner: 10Kosta Harlan) [12:40:18] (03PS1) 10Muehlenhoff: package_builder: Don't fail on cleanup jobs [puppet] - 10https://gerrit.wikimedia.org/r/865061 [12:41:24] (03CR) 10CI reject: [V: 04-1] package_builder: Don't fail on cleanup jobs [puppet] - 10https://gerrit.wikimedia.org/r/865061 (owner: 10Muehlenhoff) [12:41:32] (03Merged) 10jenkins-bot: Revert "resourceloader: Modern ES6 code should be forced to target mobile" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864912 (https://phabricator.wikimedia.org/T323542) (owner: 10Kosta Harlan) [12:41:44] (03PS2) 10Slyngshede: Ganeti: Add reimaging cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [12:42:26] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/865060 (owner: 10Muehlenhoff) [12:42:29] (03CR) 10Hnowlan: [C: 03+2] thumbor: increase memory limit, replicas. [deployment-charts] - 10https://gerrit.wikimedia.org/r/864773 (https://phabricator.wikimedia.org/T323936) (owner: 10Hnowlan) [12:42:35] jnuche: https://gerrit.wikimedia.org/r/864912 was just +2'ed for wmf.13, as a heads up. Krinkle, are you syncing that now? [12:43:00] RECOVERY - Check systemd state on wcqs2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [12:43:36] (03PS2) 10Muehlenhoff: package_builder: Don't fail on cleanup jobs [puppet] - 10https://gerrit.wikimedia.org/r/865061 [12:44:23] (03PS3) 10Slyngshede: Ganeti: Add reimaging cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/865057 (https://phabricator.wikimedia.org/T306661) [12:45:05] jnuche: I'm not. kostajh might be [12:45:24] Oh you're asking. [12:46:15] Sorry that was uncoordinated on my part. I should've +1ed. [12:46:41] I can roll it out if you like. I have to step out for a bit soon though [12:46:45] np. jnuche, maybe you could sync it as you're doing train things now? otherwise I don't mind doing it now. [12:47:44] kostajh: I was following up on the issue from last night/this morning, not really the conductor :) [12:47:58] (03Merged) 10jenkins-bot: thumbor: increase memory limit, replicas. [deployment-charts] - 10https://gerrit.wikimedia.org/r/864773 (https://phabricator.wikimedia.org/T323936) (owner: 10Hnowlan) [12:48:02] RECOVERY - Check systemd state on matomo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:05] (03PS1) 10KartikMistry: Update cxserver to 2022-12-06-121330-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/865063 (https://phabricator.wikimedia.org/T321781) [12:48:41] jnuche: ok, should I sync it now? [12:49:02] although, I'm a bit lost without the `scap backport` command, and the patch has already been merged to wmf.13 [12:49:17] fetch, rebabse, sync [12:49:43] !log installing glibc security updates on buster [12:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:25] kostajh: IIRC backport is clever enough to realize the patch has been merged and still go ahead [12:51:48] Correct [12:52:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864912 (https://phabricator.wikimedia.org/T323542) (owner: 10Kosta Harlan) [12:52:13] alright, let's see [12:52:32] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864912|Revert "resourceloader: Modern ES6 code should be forced to target mobile" (T323542)]] [12:52:35] T323542: Forbid new modern code from not targeting mobile - https://phabricator.wikimedia.org/T323542 [12:52:39] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all [12:54:20] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864912|Revert "resourceloader: Modern ES6 code should be forced to target mobile" (T323542)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [12:54:27] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [12:54:48] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [12:55:12] 10ops-drmrs: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10ayounsi) p:05Triage→03Low [12:55:42] syncing [12:59:44] 10SRE: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Urbanecm) [13:00:24] 10SRE, 10MediaWiki-extensions-SecurePoll, 10Performance Issue: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Reedy) [13:00:29] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864912|Revert "resourceloader: Modern ES6 code should be forced to target mobile" (T323542)]] (duration: 07m 57s) [13:00:33] T323542: Forbid new modern code from not targeting mobile - https://phabricator.wikimedia.org/T323542 [13:00:42] done [13:02:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all [13:03:56] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:04:18] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:05:39] (03PS5) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) [13:05:57] 10SRE, 10DBA, 10MediaWiki-extensions-SecurePoll, 10Performance Issue: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Reedy) [13:09:41] Thanks volans [13:09:58] 10SRE, 10DBA, 10MediaWiki-extensions-SecurePoll, 10Performance Issue: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Sotiale) I'll do scrutineering with VPN if needed (I've been advised t... [13:10:03] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache/FPM/Envoy on mwmaint/noc [puppet] - 10https://gerrit.wikimedia.org/r/865066 (https://phabricator.wikimedia.org/T135991) [13:10:04] np, it happens steve_munene [13:14:14] (03CR) 10Slyngshede: [C: 03+2] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [13:14:44] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Configuration: Add support for setting connection timeout. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/859453 (owner: 10Slyngshede) [13:16:07] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [13:17:08] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [13:17:28] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [13:17:55] (03CR) 10Raymond Ndibe: webservice cli: allow for deployment of custom harbor images (034 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [13:18:50] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [13:25:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [13:26:23] (03Abandoned) 10Muehlenhoff: Tools: Use LDAP for mail queries [puppet] - 10https://gerrit.wikimedia.org/r/237871 (owner: 10Tim Landscheidt) [13:36:57] (03PS1) 10Slyngshede: Version bump. Go to version 0.0.2. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 [13:39:14] (03PS1) 10Daniel Kinzler: hewiki: enable parser cache writes for parsoid's page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865070 (https://phabricator.wikimedia.org/T322672) [13:39:34] (03CR) 10CI reject: [V: 04-1] hewiki: enable parser cache writes for parsoid's page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865070 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [13:40:28] 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10BTullis) [13:40:41] (03PS2) 10Clément Goubert: P:mediawiki::deployment::server: set helm env [puppet] - 10https://gerrit.wikimedia.org/r/865068 (https://phabricator.wikimedia.org/T324553) [13:42:17] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38598/console" [puppet] - 10https://gerrit.wikimedia.org/r/865068 (https://phabricator.wikimedia.org/T324553) (owner: 10Clément Goubert) [13:43:06] (03PS1) 10Daniel Kinzler: Page 5% of calls to parsoid's page/html endpoint write to PC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865071 (https://phabricator.wikimedia.org/T322672) [13:43:19] (03CR) 10CI reject: [V: 04-1] Page 5% of calls to parsoid's page/html endpoint write to PC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865071 (https://phabricator.wikimedia.org/T322672) (owner: 10Daniel Kinzler) [13:43:47] (03PS1) 10DCausse: search: drop search-drop-query-clicks systemd timer (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/865072 [13:43:49] (03PS1) 10DCausse: search: drop search-drop-query-clicks systemd timer (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/865073 [13:43:59] (03PS2) 10Daniel Kinzler: hewiki: enable parser cache writes for parsoid's page/html endpoint. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865070 (https://phabricator.wikimedia.org/T322672) [13:44:17] (03PS2) 10Daniel Kinzler: Page 5% of calls to parsoid's page/html endpoint write to PC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865071 (https://phabricator.wikimedia.org/T322672) [13:44:33] (03CR) 10CI reject: [V: 04-1] search: drop search-drop-query-clicks systemd timer (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/865072 (owner: 10DCausse) [13:45:02] (03PS1) 10JMeybohm: pki: Allow to override the default expiry per intermediate [puppet] - 10https://gerrit.wikimedia.org/r/865075 [13:47:58] (03PS1) 10Kosta Harlan: NewImpact: Adjust hasMainspaceEditsCache check [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864919 (https://phabricator.wikimedia.org/T324285) [13:50:08] (03PS2) 10JMeybohm: pki: Allow to override the default expiry per intermediate [puppet] - 10https://gerrit.wikimedia.org/r/865075 [13:51:22] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38599/console" [puppet] - 10https://gerrit.wikimedia.org/r/865075 (owner: 10JMeybohm) [13:52:34] jouncebot: nowandnext [13:52:34] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [13:52:34] In 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1400) [13:52:34] In 0 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1400) [13:53:02] getting started on backports [13:53:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864919 (https://phabricator.wikimedia.org/T324285) (owner: 10Kosta Harlan) [13:55:44] (03PS2) 10DCausse: search: drop search-drop-query-clicks systemd timer (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/865072 [13:55:46] (03PS2) 10DCausse: search: drop search-drop-query-clicks systemd timer (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/865073 [13:56:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10BTullis) @Cmjohnson - I can shut down the machine at any time - or you can do it if it helps too. There's no depooling necessary, just dow... [13:58:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10BTullis) Set to failed in netbox. https://netbox.wikimedia.org/dcim/devices/3661/ {F35841349} [13:59:33] (03PS1) 10Klausman: APIGW/Liftwing: fix stray quote on outlink endpoint rewrite [deployment-charts] - 10https://gerrit.wikimedia.org/r/865076 (https://phabricator.wikimedia.org/T323916) [14:00:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10BTullis) [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1400). [14:00:04] kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1400) [14:00:10] o/ [14:00:13] o/ [14:00:31] kostajh: i take it that you're self-deploying :) [14:00:39] yep [14:01:06] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/865076 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [14:02:33] okay. I'm around if needed :) [14:02:48] (03CR) 10Klausman: [C: 03+2] APIGW/Liftwing: fix stray quote on outlink endpoint rewrite [deployment-charts] - 10https://gerrit.wikimedia.org/r/865076 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [14:04:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1004.eqiad.wmnet [14:04:28] (03PS1) 10Urbanecm: Localisation updates from https://translatewiki.net. [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/865077 [14:04:35] kostajh: backported the i18n patch for you ^^ [14:04:53] urbanecm: thank you! [14:05:26] (03CR) 10CI reject: [V: 04-1] NewImpact: Adjust hasMainspaceEditsCache check [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864919 (https://phabricator.wikimedia.org/T324285) (owner: 10Kosta Harlan) [14:06:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864919 (https://phabricator.wikimedia.org/T324285) (owner: 10Kosta Harlan) [14:07:43] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/861924 (https://phabricator.wikimedia.org/T324051) (owner: 10Majavah) [14:07:51] (03CR) 10Klausman: [V: 03+2 C: 03+2] APIGW/Liftwing: fix stray quote on outlink endpoint rewrite [deployment-charts] - 10https://gerrit.wikimedia.org/r/865076 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [14:07:53] (03PS2) 10David Caro: exim: Disable IPv6 on mail hosts on cloud vms [puppet] - 10https://gerrit.wikimedia.org/r/861924 (https://phabricator.wikimedia.org/T324051) (owner: 10Majavah) [14:08:20] (03Merged) 10jenkins-bot: APIGW/Liftwing: fix stray quote on outlink endpoint rewrite [deployment-charts] - 10https://gerrit.wikimedia.org/r/865076 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [14:09:27] (03PS1) 10Elukey: admin_ng: allow more hosts in the Istio's VirtualServer configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/865078 [14:09:29] (03PS1) 10Jgiannelos: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/865079 [14:09:32] 10SRE, 10DBA, 10MediaWiki-extensions-SecurePoll, 10Performance Issue: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Ladsgroup) It can be either of two things: - schema drift in codfw li... [14:10:37] (03CR) 10Klausman: [C: 03+1] admin_ng: allow more hosts in the Istio's VirtualServer configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/865078 (owner: 10Elukey) [14:11:18] (03CR) 10MSantos: [C: 03+1] mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/865079 (owner: 10Jgiannelos) [14:11:26] (03Merged) 10jenkins-bot: NewImpact: Adjust hasMainspaceEditsCache check [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864919 (https://phabricator.wikimedia.org/T324285) (owner: 10Kosta Harlan) [14:11:54] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864919|NewImpact: Adjust hasMainspaceEditsCache check (T324285)]] [14:12:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:02] T324285: NewImpact: Null state for "Last edited" - https://phabricator.wikimedia.org/T324285 [14:12:39] (03PS2) 10Elukey: admin_ng: allow more hosts in the Istio's VirtualServer configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/865078 [14:13:49] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864919|NewImpact: Adjust hasMainspaceEditsCache check (T324285)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [14:14:09] (03PS8) 10Awight: kartotherian: add kartotherian chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/531699 (https://phabricator.wikimedia.org/T231006) (owner: 10Mathew.onipe) [14:14:40] RECOVERY - Disk space on stat1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [14:15:00] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1004.eqiad.wmnet [14:15:12] syncing [14:17:06] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/865079 (owner: 10Jgiannelos) [14:20:46] (03PS3) 10JMeybohm: pki: Allow to override the default expiry per intermediate [puppet] - 10https://gerrit.wikimedia.org/r/865075 [14:20:59] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864919|NewImpact: Adjust hasMainspaceEditsCache check (T324285)]] (duration: 09m 04s) [14:21:02] T324285: NewImpact: Null state for "Last edited" - https://phabricator.wikimedia.org/T324285 [14:21:21] on to the next one [14:21:41] (03CR) 10Jgiannelos: [V: 03+2 C: 03+2] mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/865079 (owner: 10Jgiannelos) [14:21:55] (03Merged) 10jenkins-bot: mobileapps: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/865079 (owner: 10Jgiannelos) [14:21:57] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38600/console" [puppet] - 10https://gerrit.wikimedia.org/r/865075 (owner: 10JMeybohm) [14:22:54] (03PS2) 10Kosta Harlan: Instrumentation: Monitor navigation duration, transferSize, first paint [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864915 (https://phabricator.wikimedia.org/T324198) [14:23:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864915 (https://phabricator.wikimedia.org/T324198) (owner: 10Kosta Harlan) [14:23:11] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:23:14] 10SRE, 10DBA, 10MediaWiki-extensions-SecurePoll, 10Performance Issue: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Ladsgroup) Slightly unrelated but still. The special page makes 800 qu... [14:23:44] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:24:13] (03PS2) 10Urbanecm: Localisation updates from https://translatewiki.net. [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/865077 [14:24:49] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:25:49] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:26:20] (03PS1) 10Urbanecm: Localisation updates from https://translatewiki.net. [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865082 [14:27:11] (03Abandoned) 10Herron: vo-escalate: kill process if run time exceeds 10s [puppet] - 10https://gerrit.wikimedia.org/r/864776 (https://phabricator.wikimedia.org/T324466) (owner: 10Herron) [14:27:43] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:28:23] (03PS2) 10Urbanecm: Localisation updates from https://translatewiki.net. [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865082 [14:28:27] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:29:08] (03CR) 10Elukey: [C: 03+2] admin_ng: allow more hosts in the Istio's VirtualServer configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/865078 (owner: 10Elukey) [14:29:21] 10SRE, 10DBA, 10MediaWiki-extensions-SecurePoll, 10Performance Issue: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Ladsgroup) The second one is likely the cause with combination of the... [14:29:26] (03PS3) 10Urbanecm: Localisation updates from https://translatewiki.net. [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865082 [14:29:55] kostajh: both i18n patches are ready now [14:30:09] urbanecm: thank you! [14:30:11] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:30:30] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:30:32] np [14:31:18] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:31:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [14:31:52] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:32:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [14:32:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:32:36] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38601/console" [puppet] - 10https://gerrit.wikimedia.org/r/865043 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [14:32:51] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:33:36] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:34:09] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:34:13] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:34:36] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:34:53] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:38:56] (03PS4) 10Kosta Harlan: User impact: Do not show impact module if user has no mainspace edits [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864909 (https://phabricator.wikimedia.org/T324285) [14:41:50] (03Merged) 10jenkins-bot: Instrumentation: Monitor navigation duration, transferSize, first paint [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864915 (https://phabricator.wikimedia.org/T324198) (owner: 10Kosta Harlan) [14:42:18] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864915|Instrumentation: Monitor navigation duration, transferSize, first paint (T324198)]] [14:42:22] T324198: Special:Homepage: Add instrumentation for monitoring transfer size and firstPaint - https://phabricator.wikimedia.org/T324198 [14:44:12] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:864915|Instrumentation: Monitor navigation duration, transferSize, first paint (T324198)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:44:58] (03CR) 10Krinkle: Boilerplate for QUnit testing (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 (owner: 10Hashar) [14:47:19] (03PS3) 10Kosta Harlan: GrowthExperiments: Enable new impact module on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862840 (https://phabricator.wikimedia.org/T323686) [14:47:24] (03PS8) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) [14:52:26] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864915|Instrumentation: Monitor navigation duration, transferSize, first paint (T324198)]] (duration: 10m 07s) [14:52:29] T324198: Special:Homepage: Add instrumentation for monitoring transfer size and firstPaint - https://phabricator.wikimedia.org/T324198 [14:52:47] on to the next one [14:53:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/865077 (owner: 10Urbanecm) [14:54:03] (03CR) 10Volans: [C: 03+2] "Perfect! Thanks a lot for the contribution." [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah) [14:55:15] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865068 (https://phabricator.wikimedia.org/T324553) (owner: 10Clément Goubert) [14:57:21] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) p:05Triage→03Low [14:57:44] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) ipmi-sel log: ` cgoubert@restbase1018:~$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | No... [14:58:03] 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) racadm getsel log: ` ------------------------------------------------------------------------------- Record: 16 Date/Time: 10/18/2022 14:55:... [15:01:35] (03Merged) 10jenkins-bot: puppetdb: support using client certificates [software/cumin] - 10https://gerrit.wikimedia.org/r/863874 (owner: 10Majavah) [15:01:44] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:mediawiki::deployment::server: set helm env [puppet] - 10https://gerrit.wikimedia.org/r/865068 (https://phabricator.wikimedia.org/T324553) (owner: 10Clément Goubert) [15:05:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:09:37] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/865077 (owner: 10Urbanecm) [15:10:05] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:865077|Localisation updates from https://translatewiki.net.]] [15:13:37] !log kharlan@deploy1002 kharlan and urbanecm: Backport for [[gerrit:865077|Localisation updates from https://translatewiki.net.]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [15:14:21] syncing [15:16:08] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [15:16:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:52] (03CR) 10Hashar: Boilerplate for QUnit testing (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 (owner: 10Hashar) [15:19:24] topranks: I see you'd ack'd the BFD status alarms until 12-01, it's expired, since the phab task isn't resolved I assume I can re-ack them? [15:20:06] claime: good spot, yeah I assume those are the ones relating to the GTT services to drmrs? [15:20:18] topranks: yeah [15:20:32] unfortunately it's dragging on, so yes best to ack them again, probably till next week the way it's going [15:20:49] no problem [15:20:50] (it's gone to the datacenter to check cables, lots of tedious back-and-forth) [15:20:53] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:865077|Localisation updates from https://translatewiki.net.]] (duration: 10m 48s) [15:20:59] Yeah I saw the emails [15:21:21] yeah not so much fun :( [15:21:30] thanks for ack'ing them :) [15:21:38] np ;) [15:22:12] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/865093 [15:22:16] (03PS1) 10Reedy: STVTallierTest: Skip testFinishTally on PHP >= 8.0 [extensions/SecurePoll] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864920 (https://phabricator.wikimedia.org/T323056) [15:22:30] topranks: Re-ack'd for a week [15:22:47] (03PS1) 10Reedy: ListPager: Only call Voter::newFromId() if return value is needed [extensions/SecurePoll] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864921 (https://phabricator.wikimedia.org/T324556) [15:22:56] (03CR) 10Reedy: [C: 03+2] ListPager: Only call Voter::newFromId() if return value is needed [extensions/SecurePoll] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864921 (https://phabricator.wikimedia.org/T324556) (owner: 10Reedy) [15:23:06] (03PS1) 10Reedy: ListPager: Only call Voter::newFromId() if return value is needed [extensions/SecurePoll] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864922 (https://phabricator.wikimedia.org/T324556) [15:23:20] (03CR) 10Reedy: [C: 03+2] ListPager: Only call Voter::newFromId() if return value is needed [extensions/SecurePoll] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864922 (https://phabricator.wikimedia.org/T324556) (owner: 10Reedy) [15:23:22] (03PS3) 10Kosta Harlan: NewImpact: Show "999+" when we could not count edits/thanks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864911 (https://phabricator.wikimedia.org/T324286) [15:23:27] topranks: I only ack'd the BFD alarms, the OSPF don't seem to make their way to a.w.o [15:23:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864909 (https://phabricator.wikimedia.org/T324285) (owner: 10Kosta Harlan) [15:23:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865082 (owner: 10Urbanecm) [15:23:37] Ah, yeah, they do, my bad [15:23:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864911 (https://phabricator.wikimedia.org/T324286) (owner: 10Kosta Harlan) [15:23:56] topranks: should I ack the OSPF alarms too? [15:24:02] (03CR) 10Reedy: "We don't run PHP 8 tests on wmf gate, so this shouldn't be needed.. I think?" [extensions/SecurePoll] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864920 (https://phabricator.wikimedia.org/T323056) (owner: 10Reedy) [15:24:27] claime: if you can yeah please do [15:24:52] (03Merged) 10jenkins-bot: ListPager: Only call Voter::newFromId() if return value is needed [extensions/SecurePoll] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864921 (https://phabricator.wikimedia.org/T324556) (owner: 10Reedy) [15:25:44] (03Merged) 10jenkins-bot: ListPager: Only call Voter::newFromId() if return value is needed [extensions/SecurePoll] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864922 (https://phabricator.wikimedia.org/T324556) (owner: 10Reedy) [15:27:00] (03Abandoned) 10Reedy: STVTallierTest: Skip testFinishTally on PHP >= 8.0 [extensions/SecurePoll] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/864920 (https://phabricator.wikimedia.org/T323056) (owner: 10Reedy) [15:32:51] 10SRE, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-REST-API, 10RESTbase Sunsetting, and 3 others: Determine http cache control and active purging for REST endpoints serving parsoid output - https://phabricator.wikimedia.org/T308424 (10daniel) [15:33:57] !log reedy@deploy1002 Synchronized php-1.40.0-wmf.12/extensions/SecurePoll/includes/Pages/ListPager.php: T324556 (duration: 07m 13s) [15:34:01] T324556: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 [15:37:22] (03PS1) 10Krinkle: Turn off wgNavigationTimingOversampleFactor campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) [15:39:17] (03PS2) 10Krinkle: Turn off wgNavigationTimingOversampleFactor campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) [15:39:28] 10SRE, 10DBA, 10MediaWiki-extensions-SecurePoll, 10Patch-For-Review, 10Performance Issue: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 (10Reedy) ^ in theory, that should have a marked im... [15:39:42] (03PS3) 10Krinkle: Turn off wgNavigationTimingOversampleFactor campaigns [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865097 (https://phabricator.wikimedia.org/T286703) [15:41:33] !log reedy@deploy1002 Synchronized php-1.40.0-wmf.13/extensions/SecurePoll/includes/Pages/ListPager.php: T324556 (duration: 07m 01s) [15:41:36] T324556: vote.wikimedia.org's Special:Securepoll/list/1402 takes considerably longer in codfw than in eqiad, leading to timeouts - https://phabricator.wikimedia.org/T324556 [15:42:15] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/865093 (owner: 10Muehlenhoff) [15:42:57] (03Merged) 10jenkins-bot: User impact: Do not show impact module if user has no mainspace edits [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864909 (https://phabricator.wikimedia.org/T324285) (owner: 10Kosta Harlan) [15:43:00] (03Merged) 10jenkins-bot: Localisation updates from https://translatewiki.net. [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865082 (owner: 10Urbanecm) [15:43:06] (03Merged) 10jenkins-bot: NewImpact: Show "999+" when we could not count edits/thanks [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/864911 (https://phabricator.wikimedia.org/T324286) (owner: 10Kosta Harlan) [15:43:30] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:864909|User impact: Do not show impact module if user has no mainspace edits (T324285)]], [[gerrit:865082|Localisation updates from https://translatewiki.net.]], [[gerrit:864911|NewImpact: Show "999+" when we could not count edits/thanks (T324286)]] [15:43:35] T324286: NewImpact: edits and thanks are capped at 1000 - https://phabricator.wikimedia.org/T324286 [15:43:36] T324285: NewImpact: Null state for "Last edited" - https://phabricator.wikimedia.org/T324285 [15:44:44] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@6377d4c]: Deploying image_suggestions 0.5.0 on platform_eng Airflow instance [15:45:02] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@6377d4c]: Deploying image_suggestions 0.5.0 on platform_eng Airflow instance (duration: 00m 17s) [15:58:53] (03PS1) 10Ottomata: flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) [16:02:55] !log kharlan@deploy1002 kharlan and urbanecm and kharlan: Backport for [[gerrit:864909|User impact: Do not show impact module if user has no mainspace edits (T324285)]], [[gerrit:865082|Localisation updates from https://translatewiki.net.]], [[gerrit:864911|NewImpact: Show "999+" when we could not count edits/thanks (T324286)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad. [16:02:55] wmnet, mwdebug2001.codfw.wmnet [16:02:59] T324286: NewImpact: edits and thanks are capped at 1000 - https://phabricator.wikimedia.org/T324286 [16:02:59] T324285: NewImpact: Null state for "Last edited" - https://phabricator.wikimedia.org/T324285 [16:03:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] eventgate: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860518 (owner: 10Giuseppe Lavagetto) [16:04:02] jouncebot: nowandnext [16:04:02] No deployments scheduled for the next 0 hour(s) and 55 minute(s) [16:04:02] In 0 hour(s) and 55 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1700) [16:04:08] syncing [16:04:12] two config patches left to go after this [16:07:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:49] (03Merged) 10jenkins-bot: eventgate: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860518 (owner: 10Giuseppe Lavagetto) [16:10:53] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:12:00] !log robh@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:12:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:14] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:864909|User impact: Do not show impact module if user has no mainspace edits (T324285)]], [[gerrit:865082|Localisation updates from https://translatewiki.net.]], [[gerrit:864911|NewImpact: Show "999+" when we could not count edits/thanks (T324286)]] (duration: 29m 43s) [16:13:18] T324286: NewImpact: edits and thanks are capped at 1000 - https://phabricator.wikimedia.org/T324286 [16:13:18] T324285: NewImpact: Null state for "Last edited" - https://phabricator.wikimedia.org/T324285 [16:15:04] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:15:42] 10ops-codfw: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10Jhancock.wm) [16:15:54] starting the config patches [16:15:57] (03CR) 10Btullis: flink-kubernetes-operator - Initial commit of upstream helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:16:27] (03CR) 10Ottomata: flink-kubernetes-operator - Initial commit of upstream helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [16:16:59] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin new hosts - robh@cumin2002" [16:17:38] (03PS4) 10Kosta Harlan: GrowthExperiments: Enable new impact module on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862840 (https://phabricator.wikimedia.org/T323686) [16:17:53] (03PS9) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) [16:18:08] !log kharlan@deploy1002 backport aborted: (duration: 02m 53s) [16:18:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862840 (https://phabricator.wikimedia.org/T323686) (owner: 10Kosta Harlan) [16:19:29] (03Merged) 10jenkins-bot: GrowthExperiments: Enable new impact module on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862840 (https://phabricator.wikimedia.org/T323686) (owner: 10Kosta Harlan) [16:19:52] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:862840|GrowthExperiments: Enable new impact module on pilot wikis (T323686)]] [16:19:56] T323686: End imagerecommendation experiment - https://phabricator.wikimedia.org/T323686 [16:21:00] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin new hosts - robh@cumin2002" [16:21:00] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:21:10] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1029.eqiad.wmnet with OS bullseye [16:21:21] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs5005 [16:21:45] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs5005 [16:21:50] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:862840|GrowthExperiments: Enable new impact module on pilot wikis (T323686)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [16:22:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] calculator-service: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/862829 (owner: 10Giuseppe Lavagetto) [16:23:00] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1028.eqiad.wmnet with OS bullseye [16:26:04] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:37] (03PS1) 10Ilias Sarantopoulos: ml-services: Increase allocated RAM in staging to see if it improves performance. [deployment-charts] - 10https://gerrit.wikimedia.org/r/865104 (https://phabricator.wikimedia.org/T323624) [16:27:20] syncing [16:27:32] (03Merged) 10jenkins-bot: calculator-service: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/862829 (owner: 10Giuseppe Lavagetto) [16:30:06] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL: 1 puppet certs need to be renewed: https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [16:30:07] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:862840|GrowthExperiments: Enable new impact module on pilot wikis (T323686)]] (duration: 10m 14s) [16:30:10] T323686: End imagerecommendation experiment - https://phabricator.wikimedia.org/T323686 [16:31:43] last patch [16:32:23] (03PS10) 10Kosta Harlan: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) [16:32:25] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1027.eqiad.wmnet with OS bullseye [16:32:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [16:33:24] (03Merged) 10jenkins-bot: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [16:33:47] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:860867|GrowthExperiments: Start oldimpact experiment (T323526)]] [16:33:51] T323526: New Impact Module: Start experiment for the new Impact module on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T323526 [16:34:27] (03PS1) 10Filippo Giunchedi: base: remove support for plaintext remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) [16:35:31] (03CR) 10CI reject: [V: 04-1] base: remove support for plaintext remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) (owner: 10Filippo Giunchedi) [16:35:42] !log kharlan@deploy1002 kharlan and kharlan: Backport for [[gerrit:860867|GrowthExperiments: Start oldimpact experiment (T323526)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [16:36:20] (03CR) 10Ebernhardson: [C: 03+1] "lgtm, this is handled by the airflow dag that does general cleanup of our data in hdfs." [puppet] - 10https://gerrit.wikimedia.org/r/865072 (owner: 10DCausse) [16:37:32] (03PS2) 10Filippo Giunchedi: base: remove support for plaintext remote syslog [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) [16:39:34] syncing [16:39:44] (03PS1) 10Cathal Mooney: Default outbound DSCP marking possibility [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [16:40:09] 10SRE, 10SRE-Access-Requests, 10ops-codfw: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10Aklapper) For the records, the Phabricator account @Jhancock.wm is linked to a self-created SUL account and not to [a WMF ITS created account](https://meta.wikimedia.org/wiki/Spec... [16:40:38] (03CR) 10Hnowlan: [C: 03+2] thumbor: change exposed port to 8800 [deployment-charts] - 10https://gerrit.wikimedia.org/r/865054 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:40:58] (03CR) 10CI reject: [V: 04-1] Default outbound DSCP marking possibility [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [16:41:19] (03PS2) 10Cathal Mooney: Default outbound DSCP marking possibility [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [16:41:41] (03CR) 10Hnowlan: [C: 03+1] Add ability to specify a DPI value for PDF [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T256959) (owner: 10Vlad.shapik) [16:42:31] (03CR) 10CI reject: [V: 04-1] Default outbound DSCP marking possibility [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [16:44:26] (03PS3) 10Cathal Mooney: Default outbound DSCP marking possibility [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [16:44:42] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:860867|GrowthExperiments: Start oldimpact experiment (T323526)]] (duration: 10m 54s) [16:44:46] T323526: New Impact Module: Start experiment for the new Impact module on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T323526 [16:45:31] (03Merged) 10jenkins-bot: thumbor: change exposed port to 8800 [deployment-charts] - 10https://gerrit.wikimedia.org/r/865054 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:45:35] (03CR) 10CI reject: [V: 04-1] Default outbound DSCP marking possibility [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [16:46:00] !log UTC afternoon backports done [16:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:00] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [16:47:04] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [16:48:21] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:48:24] (03PS4) 10Cathal Mooney: Default outbound DSCP marking possibility [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [16:48:41] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:49:12] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38602/console" [puppet] - 10https://gerrit.wikimedia.org/r/865106 (https://phabricator.wikimedia.org/T301762) (owner: 10Filippo Giunchedi) [16:49:29] (03CR) 10CI reject: [V: 04-1] Default outbound DSCP marking possibility [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [16:50:26] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:50:44] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:51:01] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [16:51:36] (03CR) 10Elukey: [C: 03+2] ml-services: Increase allocated RAM in staging to see if it improves performance. [deployment-charts] - 10https://gerrit.wikimedia.org/r/865104 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [16:53:00] (03CR) 10Btullis: [C: 03+2] search: drop search-drop-query-clicks systemd timer (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/865072 (owner: 10DCausse) [17:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1700). Please do the needful. [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:28] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [17:02:25] (03CR) 10FNegri: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:03:00] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [17:04:32] (03CR) 10Majavah: cluster::cloud_management: create new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:05:36] (03CR) 10Jberkel: Make "make" available in all images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/864828 (https://phabricator.wikimedia.org/T320343) (owner: 10Jberkel) [17:08:28] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Burrow [puppet] - 10https://gerrit.wikimedia.org/r/865114 (https://phabricator.wikimedia.org/T135991) [17:08:52] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/865114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:11:49] (03CR) 10Papaul: [C: 03+2] Add new sretest codfw node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/863009 (https://phabricator.wikimedia.org/T322578) (owner: 10Papaul) [17:11:57] (03PS2) 10Papaul: Add new sretest codfw node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/863009 (https://phabricator.wikimedia.org/T322578) [17:12:18] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash1028.eqiad.wmnet with OS bullseye [17:13:24] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [17:15:55] 10SRE, 10Data-Engineering, 10Shared-Data-Infrastructure: geoip_update_main failure on puppetmaster1001 - https://phabricator.wikimedia.org/T324548 (10BTullis) I checked with @odimitrijevic and she believes that it will take a few days to get the updated licence. She'd prefer that we do not disable the downlo... [17:16:16] (03PS3) 10Btullis: search: drop search-drop-query-clicks systemd timer (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/865073 (owner: 10DCausse) [17:17:54] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash1029.eqiad.wmnet with OS bullseye [17:19:22] (03CR) 10Reedy: [C: 03+2] Cleanup [wikitech-static] - 10https://gerrit.wikimedia.org/r/865126 (https://phabricator.wikimedia.org/T324580) (owner: 10Reedy) [17:19:45] (03CR) 10Reedy: [V: 03+2 C: 03+2] Cleanup [wikitech-static] - 10https://gerrit.wikimedia.org/r/865126 (https://phabricator.wikimedia.org/T324580) (owner: 10Reedy) [17:20:00] (03CR) 10Reedy: [V: 03+2 C: 03+2] "Except I can't submit 😄" [wikitech-static] - 10https://gerrit.wikimedia.org/r/865126 (https://phabricator.wikimedia.org/T324580) (owner: 10Reedy) [17:21:03] (03PS3) 10Reedy: Update interwiki.php [wikitech-static] - 10https://gerrit.wikimedia.org/r/865127 [17:21:07] (03CR) 10Reedy: [V: 03+2 C: 03+2] Update interwiki.php [wikitech-static] - 10https://gerrit.wikimedia.org/r/865127 (owner: 10Reedy) [17:21:08] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs5006 [17:21:53] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs5006 [17:22:18] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1028.eqiad.wmnet with OS bullseye [17:22:24] (03CR) 10FNegri: [C: 03+1] cluster::cloud_management: create new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:23:34] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5005 [17:24:00] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti5005 [17:24:04] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5006 [17:24:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [17:24:30] (03CR) 10Volans: cluster::cloud_management: create new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:24:39] 10SRE, 10ops-codfw, 10Patch-For-Review: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye [17:24:47] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti5006 [17:24:54] (03PS2) 10Volans: cluster::cloud_management: create new role [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) [17:25:06] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5007 [17:25:27] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti5007 [17:25:35] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns5003 [17:25:56] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns5003 [17:27:09] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs5005.mgmt.eqsin.wmnet with reboot policy FORCED [17:27:27] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs5006.mgmt.eqsin.wmnet with reboot policy FORCED [17:28:10] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns5003.mgmt.eqsin.wmnet with reboot policy FORCED [17:29:07] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash1027.eqiad.wmnet with OS bullseye [17:29:15] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs5005.mgmt.eqsin.wmnet with reboot policy FORCED [17:29:28] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti5005.mgmt.eqsin.wmnet with reboot policy FORCED [17:31:43] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:33:08] (03PS1) 10Volans: cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) [17:34:10] (03CR) 10Volans: "The MAC address will be updated with the real one before merging, once the createvm cookbook has been run." [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:34:31] (03CR) 10CI reject: [V: 04-1] cloudcumin: setup the 2 new VMs [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:34:49] (03CR) 10Majavah: cluster::cloud_management: create new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:37:27] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:39:01] (03PS1) 10Jdlrobson: Avoid syntax error on hover in grade C browsers [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865128 (https://phabricator.wikimedia.org/T324514) [17:39:07] (03PS2) 10Giuseppe Lavagetto: Remove common_templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/862842 [17:40:56] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] "I need to bypass CI here as we've removed a directory and added one compared to master and that makes CI fail." [deployment-charts] - 10https://gerrit.wikimedia.org/r/862842 (owner: 10Giuseppe Lavagetto) [17:47:02] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti5005.mgmt.eqsin.wmnet with reboot policy FORCED [17:47:05] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs5006.mgmt.eqsin.wmnet with reboot policy FORCED [17:47:07] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns5003.mgmt.eqsin.wmnet with reboot policy FORCED [17:47:45] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti5006.mgmt.eqsin.wmnet with reboot policy FORCED [17:47:47] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti5007.mgmt.eqsin.wmnet with reboot policy FORCED [17:48:50] (03CR) 10Muehlenhoff: cluster::cloud_management: create new role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:49:02] (03PS1) 10Effie Mouzeli: ProductionServices: Replace use redis_misc servers for LockManager (1/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) [17:49:08] (03CR) 10Muehlenhoff: "And please add a Cumin alias alongside :-)" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:50:35] (03CR) 10CI reject: [V: 04-1] ProductionServices: Replace use redis_misc servers for LockManager (1/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [17:50:38] (03PS1) 10Effie Mouzeli: ProductionServices: Replace use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) [17:51:44] (03CR) 10Volans: "Ack, will do" [puppet] - 10https://gerrit.wikimedia.org/r/865049 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:51:52] (03CR) 10CI reject: [V: 04-1] ProductionServices: Replace use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [17:52:01] (03PS2) 10Effie Mouzeli: ProductionServices: Replace use redis_misc servers for LockManager (1/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) [17:52:25] (03PS2) 10Effie Mouzeli: ProductionServices: Replace use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) [17:55:52] (03CR) 10FNegri: "Looks good (once the MACs are added)" [puppet] - 10https://gerrit.wikimedia.org/r/865116 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:56:18] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1027'] [17:56:28] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1027'] [17:58:10] Reedy: do you want to apply that wikitech-static change or shall I? [17:58:32] andrewbogott: Part of it is applied, but I can't merge the patches [17:58:45] And there was some interesting divergence (array() vs []) on disk [17:58:45] was there a second patch? I merged one of them [17:58:48] (03PS1) 10Effie Mouzeli: ProductionServices: Replace use redis_misc servers for LockManager (3/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581) [17:58:59] I did another ontop to update the interwiki list [17:59:29] Reedy: you should have submit rights for that repo.. let me see [17:59:36] (03CR) 10CI reject: [V: 04-1] ProductionServices: Replace use redis_misc servers for LockManager (3/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [17:59:43] Reedy: link? [17:59:53] https://gerrit.wikimedia.org/r/865127 [18:00:17] (03PS2) 10Effie Mouzeli: ProductionServices: Replace use redis_misc servers for LockManager (3/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581) [18:00:19] (03PS1) 10Majavah: Review access change [wikitech-static] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/865129 [18:01:14] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Review access change [wikitech-static] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/865129 (owner: 10Majavah) [18:01:29] Reedy: can you submit now? [18:02:14] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1028'] [18:02:24] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1028'] [18:02:30] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti5006.mgmt.eqsin.wmnet with reboot policy FORCED [18:02:32] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti5007.mgmt.eqsin.wmnet with reboot policy FORCED [18:03:39] yeah it works now :) [18:04:20] great! [18:04:34] And the login link is gone :) [18:05:20] security through obscurity [18:05:58] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@4925134]: Revert Deploying image_suggestions 0.5.0 on platform_eng Airflow instance [18:06:03] !log robh@cumin2002 START - Cookbook sre.dns.netbox [18:06:07] (03CR) 10Dzahn: [C: 03+1] "I think it's fine. We have used those on other hosts without problems. But ultimately this should get a +1 from servicesops-core at least " [puppet] - 10https://gerrit.wikimedia.org/r/865066 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:06:07] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@4925134]: Revert Deploying image_suggestions 0.5.0 on platform_eng Airflow instance (duration: 00m 09s) [18:06:24] 10SRE, 10Infrastructure-Foundations: Broadcom BCM57412 10G NIC and Bullseye installer - https://phabricator.wikimedia.org/T286722 (10colewhite) The upgrade-firmware cookbook gets seems to get unexpected data from logstash102[78]: `sudo cookbook sre.hardware.upgrade-firmware logstash1028 -c nic --new` ` logsta... [18:07:11] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:07:28] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs5006 [18:07:32] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs5006 [18:08:02] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs5005.mgmt.eqsin.wmnet with reboot policy FORCED [18:08:47] 10SRE, 10ops-eqiad, 10Phabricator, 10decommission-hardware, 10serviceops-collab: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Dzahn) @marostegui as part of "decom of host phab1001" we can remove any mysql GRANTS for users coming from its former IP 10.64.16.8. I made a... [18:10:11] (03PS1) 10BBlack: eqsin cp: unify per-node hieradata [puppet] - 10https://gerrit.wikimedia.org/r/865120 (https://phabricator.wikimedia.org/T322048) [18:11:25] (03PS1) 10Effie Mouzeli: ProductionServices: Replace use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) [18:11:27] (03PS1) 10Effie Mouzeli: ProductionServices: Replace use redis_misc servers for LockManager (5/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865122 (https://phabricator.wikimedia.org/T267581) [18:12:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:13:52] (03PS1) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (6/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865123 (https://phabricator.wikimedia.org/T267581) [18:14:26] (03CR) 10BBlack: [C: 03+1] "PCC says NOOP for 4 test hosts: one from each eqsin cluster, and one each from ulsfo + eqiad just to double-check." [puppet] - 10https://gerrit.wikimedia.org/r/865120 (https://phabricator.wikimedia.org/T322048) (owner: 10BBlack) [18:14:51] (03PS3) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (1/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) [18:15:06] (03PS3) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) [18:15:20] (03PS3) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (3/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865119 (https://phabricator.wikimedia.org/T267581) [18:15:34] (03PS2) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) [18:15:50] (03PS2) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (5/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865122 (https://phabricator.wikimedia.org/T267581) [18:16:53] (03PS1) 10RobH: adding eqsin ganeti [puppet] - 10https://gerrit.wikimedia.org/r/865124 (https://phabricator.wikimedia.org/T322048) [18:17:11] (03CR) 10RobH: [C: 03+2] adding eqsin ganeti [puppet] - 10https://gerrit.wikimedia.org/r/865124 (https://phabricator.wikimedia.org/T322048) (owner: 10RobH) [18:17:34] (03PS4) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (2/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865118 (https://phabricator.wikimedia.org/T267581) [18:17:47] (03PS3) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (4/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865121 (https://phabricator.wikimedia.org/T267581) [18:17:57] (03PS3) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (5/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865122 (https://phabricator.wikimedia.org/T267581) [18:18:10] (03PS2) 10Effie Mouzeli: ProductionServices: Use redis_misc servers for LockManager (6/6) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865123 (https://phabricator.wikimedia.org/T267581) [18:18:44] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash1028.eqiad.wmnet with OS bullseye [18:21:18] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [18:23:14] jouncebot: nowandnext [18:23:14] No deployments scheduled for the next 0 hour(s) and 36 minute(s) [18:23:14] In 0 hour(s) and 36 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1900) [18:23:28] (03CR) 10Dzahn: "Hi Antoine, how about this. I just merge this and you can do the restarts, using your privileges as gerrit-root member, at a time that is " [puppet] - 10https://gerrit.wikimedia.org/r/865023 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar) [18:26:06] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lvs5005.mgmt.eqsin.wmnet with reboot policy FORCED [18:26:43] (03CR) 10Krinkle: [C: 03+1] "Confirmed that redis_lock is only used with MW's RedisLockManager (via wmf-config filebackend.php), and that RedisLockManager in MW indeed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [18:27:06] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1028'] [18:27:14] (03CR) 10Krinkle: [C: 03+1] "Also confirmed via Puppet that these take the same auth credentials as the memc ones." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865117 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [18:27:22] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1028'] [18:27:51] (03CR) 10Herron: [C: 03+1] Enable profile::auto_restarts::service for Burrow [puppet] - 10https://gerrit.wikimedia.org/r/865114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:28:21] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1028'] [18:28:37] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['logstash1028'] [18:31:28] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs5005'] [18:31:36] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['lvs5006'] [18:31:40] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns5003'] [18:31:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1027'] [18:32:05] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1027'] [18:32:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:33:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1027'] [18:39:47] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bullseye [18:39:51] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye executed with errors: - sretest2002 (**FAIL**) - Removed from Puppet and P... [18:42:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash1027'] [18:42:34] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs5005'] [18:42:37] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['lvs5006'] [18:42:38] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns5003'] [18:42:51] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1027'] [18:43:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1027'] [18:44:45] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti5005'] [18:45:30] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti5006'] [18:45:35] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti5007'] [18:45:37] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1027'] [18:47:45] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) 05Open→03In progress [18:55:00] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1027'] [18:56:13] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1027.eqiad.wmnet with OS bullseye [18:57:07] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti5005'] [18:57:17] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti5007'] [18:57:18] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti5006'] [19:00:05] ^demon and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1900). [19:03:29] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5005.eqsin.wmnet with OS bullseye [19:03:33] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5006.eqsin.wmnet with OS bullseye [19:03:34] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5007.eqsin.wmnet with OS bullseye [19:03:35] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti5005.eqsin.wmnet with OS bullseye [19:03:39] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti5006.eqsin.wmnet with OS bullseye [19:03:42] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti5007.eqsin.wmnet with OS bullseye [19:03:45] rise my servers, rissseeeee [19:04:23] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:09:11] 10SRE, 10DC-Ops, 10Patch-For-Review: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr) [19:09:23] jouncebot: nowandnext [19:09:23] For the next 1 hour(s) and 50 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T1900) [19:09:24] In 1 hour(s) and 50 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T2100) [19:09:48] let me know if I can backport the blocker, I don't know who is the operator. ^demon ? [19:10:29] (03CR) 10Ladsgroup: [C: 03+2] Avoid syntax error on hover in grade C browsers [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865128 (https://phabricator.wikimedia.org/T324514) (owner: 10Jdlrobson) [19:11:16] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865154 (https://phabricator.wikimedia.org/T320518) [19:11:18] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865154 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [19:12:08] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865154 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [19:12:40] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:12:56] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1027.eqiad.wmnet with reason: host reimage [19:13:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) 05Open→03Resolved [19:14:22] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1028'] [19:15:42] (03Merged) 10jenkins-bot: Avoid syntax error on hover in grade C browsers [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865128 (https://phabricator.wikimedia.org/T324514) (owner: 10Jdlrobson) [19:15:50] <^demon> Amir1: I am, but I might have to hand it off to my backup. I'm having issues getting into Logstash atm. [19:16:02] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1027.eqiad.wmnet with reason: host reimage [19:16:07] sure, just give me a minute to push the backport [19:16:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/Popups] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865128 (https://phabricator.wikimedia.org/T324514) (owner: 10Jdlrobson) [19:19:35] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.13 refs T320518 [19:19:39] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [19:19:54] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:865128|Avoid syntax error on hover in grade C browsers (T324514)]] [19:19:57] T324514: ext.popups uses a CSS selector not recognized by old browsers - https://phabricator.wikimedia.org/T324514 [19:21:25] 10SRE, 10SRE-Access-Requests, 10ops-codfw: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10Jhancock.wm) @Aklapper This is the same person. [19:21:47] !log ladsgroup@deploy1002 ladsgroup and jdlrobson: Backport for [[gerrit:865128|Avoid syntax error on hover in grade C browsers (T324514)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [19:21:52] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash1028'] [19:22:09] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1028'] [19:29:38] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10sbassett) [19:31:44] 10SRE, 10LDAP-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10sbassett) @Marostegui - Ok, @KHurd-WMF has a shell account now - **khurd**. Relevant wikitech account: https://wikitech.wikimedia.org/w/index.php?title=User... [19:32:03] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1028'] [19:32:08] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage [19:32:09] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5005.eqsin.wmnet with reason: host reimage [19:32:15] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5007.eqsin.wmnet with reason: host reimage [19:32:37] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:865128|Avoid syntax error on hover in grade C browsers (T324514)]] (duration: 12m 43s) [19:32:40] T324514: ext.popups uses a CSS selector not recognized by old browsers - https://phabricator.wikimedia.org/T324514 [19:35:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:35:12] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5006.eqsin.wmnet with reason: host reimage [19:37:10] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5007.eqsin.wmnet with reason: host reimage [19:38:01] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1027.eqiad.wmnet with OS bullseye [19:39:35] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti5005.eqsin.wmnet with reason: host reimage [19:40:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:40:30] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1028.eqiad.wmnet with OS bullseye [19:41:08] (03PS4) 10Ryan Kemper: snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 (owner: 10Ebernhardson) [19:41:19] (03CR) 10Ebernhardson: "currently waiting on the run from nov 23 to complete before absenting this. Not sure if it would kill the in-progress unit or not" [puppet] - 10https://gerrit.wikimedia.org/r/856655 (owner: 10Ebernhardson) [19:41:35] (03CR) 10CI reject: [V: 04-1] snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 (owner: 10Ebernhardson) [19:41:56] (03PS5) 10Ryan Kemper: snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [19:42:28] (03PS6) 10Ryan Kemper: snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [19:42:39] (03PS7) 10Ryan Kemper: snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [19:45:06] (03Abandoned) 10Ryan Kemper: elastic: disable saneitizer for perf reasons [puppet] - 10https://gerrit.wikimedia.org/r/811374 (https://phabricator.wikimedia.org/T309648) (owner: 10Ryan Kemper) [19:47:50] (03PS2) 10Ottomata: flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) [19:47:52] (03PS1) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [19:51:38] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10jhathaway) @Fuzzy would you kindly email me your email address, jhathaway@wikimedia.org? [19:55:06] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5006.eqsin.wmnet with OS bullseye [19:55:12] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti5006.eqsin.wmnet with OS bullseye completed: - ganeti5006 (**PASS**) - Removed from... [19:55:53] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:56:54] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5007.eqsin.wmnet with OS bullseye [19:57:00] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti5007.eqsin.wmnet with OS bullseye completed: - ganeti5007 (**PASS**) - Removed from... [19:57:22] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1028.eqiad.wmnet with reason: host reimage [19:57:48] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:59:30] (03PS2) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [20:00:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:00:32] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti5005.eqsin.wmnet with OS bullseye [20:00:43] (03CR) 10Ottomata: "I'm ready for a first pass review. I'm sure there's a bunch I'm missing, but as a copied and stripped down version of the upstream chart," [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [20:00:47] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti5005.eqsin.wmnet with OS bullseye completed: - ganeti5005 (**PASS**) - Removed from... [20:01:04] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [20:01:55] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) [20:01:55] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1028.eqiad.wmnet with reason: host reimage [20:02:42] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) @XenoRyet please approve [20:02:49] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) a:03jhathaway [20:03:13] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10XenoRyet) Hey, sorry for the delay. Approved. [20:03:28] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic: ganeti500[567] implementation tracking for serviceops - https://phabricator.wikimedia.org/T324610 (10RobH) [20:06:05] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1029'] [20:09:23] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) [20:09:29] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) @Ottomata kindly approve when you have a moment [20:10:19] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10Ottomata) Approved. [20:13:56] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logstash1029'] [20:15:56] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10Dzahn) 05Open→03Resolved In https://wikimedia.slack.com/archives/CTFK3B423/p1660308829761499 it has been confirmed that this ticket can be closed... [20:16:37] !log cwhite@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logstash1029'] [20:22:25] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logstash1029'] [20:24:04] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1028.eqiad.wmnet with OS bullseye [20:24:56] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1029.eqiad.wmnet with OS bullseye [20:25:05] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host logstash1029.eqiad.wmnet with OS bullseye [20:25:34] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash1029.eqiad.wmnet with OS bullseye [20:29:43] (03PS4) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) [20:32:05] (03CR) 10CI reject: [V: 04-1] elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper) [20:33:25] (03PS1) 10Cmjohnson: Adding cephosd servers to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/865186 (https://phabricator.wikimedia.org/T322760) [20:34:55] (03CR) 10Cmjohnson: [C: 03+2] Adding cephosd servers to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/865186 (https://phabricator.wikimedia.org/T322760) (owner: 10Cmjohnson) [20:35:28] (03PS5) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) [20:35:43] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10Fuzzy) >>! In T238090#8448413, @jhathaway wrote: > @Fuzzy would you kindly email me your email address, jhathaway@wikimedia.org? Oka... [20:37:08] (03CR) 10CI reject: [V: 04-1] elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper) [20:38:02] (03PS6) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) [20:39:57] (03CR) 10CI reject: [V: 04-1] elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper) [20:41:04] (03PS7) 10Ryan Kemper: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) [20:42:19] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash1029.eqiad.wmnet with reason: host reimage [20:42:42] (03CR) 10Ryan Kemper: "The underlying metric wasn't being generated. We've fixed that in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/865" [puppet] - 10https://gerrit.wikimedia.org/r/830240 (https://phabricator.wikimedia.org/T316712) (owner: 10Ebernhardson) [20:44:57] (03CR) 10Ryan Kemper: [C: 03+2] elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper) [20:45:29] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash1029.eqiad.wmnet with reason: host reimage [20:46:22] 10SRE, 10SRE-Access-Requests, 10Product-Analytics, 10Search-Console-access-request: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10jhathaway) Permissions updated @Fuzzy, as to adding the http property, @SCherukuwada is that something you would be able to do? [20:48:01] (03Merged) 10jenkins-bot: elastic: alert on per-node indexing not occurring [alerts] - 10https://gerrit.wikimedia.org/r/818214 (https://phabricator.wikimedia.org/T314078) (owner: 10Ryan Kemper) [20:48:14] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:wfan - https://phabricator.wikimedia.org/T324057 (10jhathaway) [20:49:45] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host cephosd1002.mgmt.eqiad.wmnet with reboot policy FORCED [21:00:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1083-production-search-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221206T2100). Please do the needful. [21:00:04] No Gerrit patches in the queue for this window AFAICS. [21:00:25] yup, nothing in the queue [21:07:35] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash1029.eqiad.wmnet with OS bullseye [21:10:16] (03PS3) 10Ottomata: flink-kubernetes-operator - modify for WMF [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [21:12:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cephosd1002.mgmt.eqiad.wmnet with reboot policy FORCED [21:16:35] TheresNoTime: tgr_ and I might add something to the queue [21:16:58] kostajh: sure thing :) are you going to self-serve deploy or should I? [21:17:11] one of us could do it, thx [21:17:20] okay :) [21:23:16] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10XenoRyet) [21:40:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Cmjohnson) [21:41:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Cmjohnson) 05Open→03Resolved @BTullis these servers are ready for you to image. BIOS/Network and firmware have been updated. I... [21:42:53] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T321572 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you get with @ayounsi regarding this, it could be an optic that needs to be replaced. [21:42:58] (03PS1) 10Gergő Tisza: Fix UserDatabaseHelper::hasMainspaceEdits() [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865130 (https://phabricator.wikimedia.org/T324285) [21:43:19] (03PS1) 10Gergő Tisza: Fix UserDatabaseHelper::hasMainspaceEdits() [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/865131 (https://phabricator.wikimedia.org/T324285) [21:46:32] deploying [21:49:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/865131 (https://phabricator.wikimedia.org/T324285) (owner: 10Gergő Tisza) [21:49:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865130 (https://phabricator.wikimedia.org/T324285) (owner: 10Gergő Tisza) [21:51:59] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1024.eqiad.wmnet with OS bullseye [21:52:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye [21:58:17] (03PS1) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [21:59:32] (03CR) 10CI reject: [V: 04-1] rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [22:00:39] (03PS2) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [22:01:48] (03CR) 10CI reject: [V: 04-1] rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [22:01:54] (03PS3) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [22:02:44] (03CR) 10CI reject: [V: 04-1] rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [22:05:23] (03PS4) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [22:06:19] (03CR) 10CI reject: [V: 04-1] rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [22:06:23] (03Merged) 10jenkins-bot: Fix UserDatabaseHelper::hasMainspaceEdits() [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/865131 (https://phabricator.wikimedia.org/T324285) (owner: 10Gergő Tisza) [22:06:42] (03Merged) 10jenkins-bot: Fix UserDatabaseHelper::hasMainspaceEdits() [extensions/GrowthExperiments] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/865130 (https://phabricator.wikimedia.org/T324285) (owner: 10Gergő Tisza) [22:07:11] !log tgr@deploy1002 Started scap: Backport for [[gerrit:865131|Fix UserDatabaseHelper::hasMainspaceEdits() (T324285)]], [[gerrit:865130|Fix UserDatabaseHelper::hasMainspaceEdits() (T324285)]] [22:07:14] T324285: NewImpact: Null state for "Last edited" - https://phabricator.wikimedia.org/T324285 [22:09:01] !log tgr@deploy1002 tgr and tgr: Backport for [[gerrit:865131|Fix UserDatabaseHelper::hasMainspaceEdits() (T324285)]], [[gerrit:865130|Fix UserDatabaseHelper::hasMainspaceEdits() (T324285)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [22:12:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:23:57] (03PS5) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [22:24:13] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:45] (03CR) 10CI reject: [V: 04-1] rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [22:25:12] (03PS1) 10JHathaway: Add Wenjun Fan to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/865177 (https://phabricator.wikimedia.org/T324057) [22:25:31] (03PS6) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [22:26:09] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:865131|Fix UserDatabaseHelper::hasMainspaceEdits() (T324285)]], [[gerrit:865130|Fix UserDatabaseHelper::hasMainspaceEdits() (T324285)]] (duration: 18m 58s) [22:26:13] T324285: NewImpact: Null state for "Last edited" - https://phabricator.wikimedia.org/T324285 [22:27:07] (03CR) 10CI reject: [V: 04-1] rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [22:30:34] (03PS7) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [22:30:36] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [22:31:06] (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/865177 (https://phabricator.wikimedia.org/T324057) (owner: 10JHathaway) [22:31:19] (03PS8) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [22:31:39] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:32:18] (03CR) 10CI reject: [V: 04-1] rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [22:32:33] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:33:27] (03PS9) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [22:34:24] (03CR) 10CI reject: [V: 04-1] rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [22:35:44] (03PS10) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [22:36:52] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubernetes1023 - cmjohnson@cumin1001" [22:37:09] !log UTC late backports done [22:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kubernetes1023 - cmjohnson@cumin1001" [22:39:12] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:40:14] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kubernetes1023.mgmt.eqiad.wmnet with reboot policy FORCED [22:43:15] (03PS1) 10Ebernhardson: Update ltr plugin to 7.10.2-wmf1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/865178 (https://phabricator.wikimedia.org/T324247) [22:46:01] 10SRE, 10SRE-Access-Requests, 10ops-codfw: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10jhathaway) @Aklapper are you okay with me proceeding with granting access to @Jhancock.wm or should we create a login linked to their ITS created account? [22:50:43] (03PS1) 10Cmjohnson: Adding kubernetes1023/1024 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/865189 (https://phabricator.wikimedia.org/T313873) [22:52:35] (03CR) 10Cmjohnson: [C: 03+2] Adding kubernetes1023/1024 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/865189 (https://phabricator.wikimedia.org/T313873) (owner: 10Cmjohnson) [22:52:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes1023.mgmt.eqiad.wmnet with reboot policy FORCED [22:54:39] (03PS1) 10Dzahn: phabricator/cloud: remove vcs related IP settings [puppet] - 10https://gerrit.wikimedia.org/r/865181 [23:02:11] (03PS1) 10Dzahn: phabricator: remove all vcs related code [puppet] - 10https://gerrit.wikimedia.org/r/865182 [23:04:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1024.eqiad.wmnet with OS bullseye [23:05:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye executed with... [23:06:11] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1023.eqiad.wmnet with OS bullseye [23:06:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1023.eqiad.wmnet with OS bullseye [23:08:52] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1024.eqiad.wmnet with OS bullseye [23:09:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q1:rack/setup/install kubernetes102[34] - https://phabricator.wikimedia.org/T313873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kubernetes1024.eqiad.wmnet with OS bullseye [23:11:13] (03PS11) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [23:11:15] (03PS1) 10Andrew Bogott: remote syslog: allow rsyslog client to use root CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) [23:12:54] (NodeTextfileStale) firing: (48) Stale textfile for cp1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:14:11] (03CR) 10Dzahn: scap: move firewall rules out of the module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/862378 (https://phabricator.wikimedia.org/T114209) (owner: 10Dzahn) [23:14:37] (03CR) 10CI reject: [V: 04-1] remote syslog: allow rsyslog client to use root CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [23:19:15] (03PS2) 10Andrew Bogott: remote syslog: allow rsyslog client to use root CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) [23:20:16] (03CR) 10CI reject: [V: 04-1] remote syslog: allow rsyslog client to use root CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [23:26:29] (03CR) 10SBassett: [C: 04-1] "We wouldn't likely allow this for mediawiki's primary CSP config, as this really isn't a great idea from a security perspective. Is there" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864327 (https://phabricator.wikimedia.org/T199055) (owner: 10AndyRussG) [23:27:42] 10SRE, 10SRE-Access-Requests, 10ops-codfw: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10Papaul) @Aklapper Jennifer is the new contractor that will be working int codfw. When her ldap account was created her personal email address was used and not here wiki email addr... [23:29:14] (03CR) 10SBassett: [C: 04-1] CentralNotice: Add wmflabs to banner preview CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/864327 (https://phabricator.wikimedia.org/T199055) (owner: 10AndyRussG) [23:34:06] 10SRE, 10SRE-Access-Requests, 10ops-codfw: Access request for datacenter-ops group - https://phabricator.wikimedia.org/T324585 (10wiki_willy) Since the issue with ITS accidentally linking Jenn's personal email address has been fixed, and changed to her Wikimedia email address, this is all approved on my side... [23:45:53] (03PS3) 10Andrew Bogott: remote syslog: allow rsyslog client to use root CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) [23:46:28] (03CR) 10Jforrester: [C: 03+1] "Can we write this to sniff for both so that we don't break all script runs during the train cut-over?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/863434 (https://phabricator.wikimedia.org/T184782) (owner: 10Zabe) [23:46:53] (03CR) 10CI reject: [V: 04-1] remote syslog: allow rsyslog client to use root CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)