[00:08:23] PROBLEM - dump of es5 in eqiad on backupmon1001 is CRITICAL: dump for es5 at eqiad (es1025) taken more than a week ago: Most recent backup 2022-09-06 00:00:04 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:15:27] PROBLEM - dump of es4 in codfw on backupmon1001 is CRITICAL: dump for es4 at codfw (es2022) taken more than a week ago: Most recent backup 2022-09-06 00:00:14 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:17:29] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [00:19:24] (03CR) 10Dzahn: gerrit: scap checks script to automatize deployment (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/831916 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [00:19:49] PROBLEM - dump of es4 in eqiad on backupmon1001 is CRITICAL: dump for es4 at eqiad (es1022) taken more than a week ago: Most recent backup 2022-09-06 00:00:04 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:20:29] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:22:36] (03CR) 10Dzahn: gerrit: ignore lint error in role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831932 (owner: 10Hashar) [00:24:14] (03CR) 10Dzahn: gerrit: ignore lint error in role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831932 (owner: 10Hashar) [00:25:16] (03CR) 10Dzahn: "agree it should be a profile, plus 1 for intention. will check more closely tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [00:26:43] (03CR) 10Dzahn: "just never 100% sure if it will need a hard service restart or not. but better to do it more often than needed than being surprised at an " [puppet] - 10https://gerrit.wikimedia.org/r/831913 (https://phabricator.wikimedia.org/T317412) (owner: 10Hashar) [00:27:19] PROBLEM - dump of es5 in codfw on backupmon1001 is CRITICAL: dump for es5 at codfw (es2025) taken more than a week ago: Most recent backup 2022-09-06 00:00:14 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:28:20] (03Abandoned) 10Dzahn: Revert "phabricator: Allow deploy user to keep scap3 environment variables with sudo" [puppet] - 10https://gerrit.wikimedia.org/r/831554 (owner: 10Dzahn) [00:29:15] (03CR) 10RLazarus: [C: 03+2] "Discussed with Joe on IRC, so going ahead and self-merging to clear the cumin1001 alert for httpbb on mw1418." [puppet] - 10https://gerrit.wikimedia.org/r/831997 (owner: 10RLazarus) [00:46:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T314041)', diff saved to https://phabricator.wikimedia.org/P34668 and previous config saved to /var/cache/conftool/dbconfig/20220914-004624-ladsgroup.json [00:46:28] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:01:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P34669 and previous config saved to /var/cache/conftool/dbconfig/20220914-010130-ladsgroup.json [01:07:53] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:11:03] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:16:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P34670 and previous config saved to /var/cache/conftool/dbconfig/20220914-011637-ladsgroup.json [01:18:54] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: elastic 6.8 -> 7.10 - bking@cumin1001 - T317686 [01:18:58] T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 - https://phabricator.wikimedia.org/T317686 [01:31:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T314041)', diff saved to https://phabricator.wikimedia.org/P34671 and previous config saved to /var/cache/conftool/dbconfig/20220914-013143-ladsgroup.json [01:31:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [01:31:49] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:31:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [01:32:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T314041)', diff saved to https://phabricator.wikimedia.org/P34672 and previous config saved to /var/cache/conftool/dbconfig/20220914-013204-ladsgroup.json [01:36:45] (JobUnavailable) firing: (2) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:45] (JobUnavailable) firing: (9) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (11) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (11) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:31] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:55:15] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:04:47] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-import-siteinfo-dumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: (6) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:51] RECOVERY - Check systemd state on ganeti5003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:10:20] (ProbeDown) firing: Service upload-https:443 has failed probes (http_upload-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:11:20] (ProbeDown) firing: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:11:35] (FrontendUnavailable) firing: HAProxy (cache_upload) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [02:11:45] (JobUnavailable) firing: (6) Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:47] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: OpenSent - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:12:22] here, looking [02:12:45] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: OpenSent - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:13:31] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [02:13:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:13:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:14:01] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 342, down: 7, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:14:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:14:21] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 127 probes of 687 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:15:03] RECOVERY - BGP status on cr2-eqsin is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:15:20] (ProbeDown) resolved: (3) Service text-https:443 has failed probes (http_text-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:16:20] (ProbeDown) resolved: (2) Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:16:35] (FrontendUnavailable) resolved: HAProxy (cache_upload) has reduced HTTP availability #page - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [02:20:39] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 58 probes of 687 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:23:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:23:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [02:24:55] RECOVERY - dump of es4 in eqiad on backupmon1001 is OK: Last dump for es4 at eqiad (es1022) taken on 2022-09-13 00:00:09 (3508 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:54:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T314041)', diff saved to https://phabricator.wikimedia.org/P34673 and previous config saved to /var/cache/conftool/dbconfig/20220914-025402-ladsgroup.json [02:54:07] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:56:25] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:08:05] (03CR) 10CDanis: [C: 04-1] "A fun discovery tonight!" [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [03:09:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P34674 and previous config saved to /var/cache/conftool/dbconfig/20220914-030908-ladsgroup.json [03:10:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T314041)', diff saved to https://phabricator.wikimedia.org/P34675 and previous config saved to /var/cache/conftool/dbconfig/20220914-031027-ladsgroup.json [03:10:31] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [03:20:19] (03CR) 10Subramanya Sastry: [C: 03+1] Disable wgParserEnableLegacyMediaDOM on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [03:24:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P34676 and previous config saved to /var/cache/conftool/dbconfig/20220914-032415-ladsgroup.json [03:24:31] PROBLEM - SSH on db1101.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:25:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P34677 and previous config saved to /var/cache/conftool/dbconfig/20220914-032533-ladsgroup.json [03:39:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T314041)', diff saved to https://phabricator.wikimedia.org/P34678 and previous config saved to /var/cache/conftool/dbconfig/20220914-033921-ladsgroup.json [03:39:26] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [03:40:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P34679 and previous config saved to /var/cache/conftool/dbconfig/20220914-034040-ladsgroup.json [03:47:37] RECOVERY - dump of es5 in eqiad on backupmon1001 is OK: Last dump for es5 at eqiad (es1025) taken on 2022-09-13 00:00:09 (3487 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:54:41] RECOVERY - dump of es4 in codfw on backupmon1001 is OK: Last dump for es4 at codfw (es2022) taken on 2022-09-13 00:00:03 (3508 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:55:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T314041)', diff saved to https://phabricator.wikimedia.org/P34680 and previous config saved to /var/cache/conftool/dbconfig/20220914-035546-ladsgroup.json [03:55:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [03:55:50] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [03:56:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [03:56:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [03:56:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [03:56:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T314041)', diff saved to https://phabricator.wikimedia.org/P34681 and previous config saved to /var/cache/conftool/dbconfig/20220914-035624-ladsgroup.json [04:06:33] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2022-09-13 00:00:03 (3487 GiB, +0.8 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:18:15] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:20:55] PROBLEM - SSH on mw1314.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:39:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T314041)', diff saved to https://phabricator.wikimedia.org/P34682 and previous config saved to /var/cache/conftool/dbconfig/20220914-043929-ladsgroup.json [04:39:34] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:40:33] (03PS2) 10Hashar: gerrit: change its templates to regular files [puppet] - 10https://gerrit.wikimedia.org/r/831963 [04:48:14] (03CR) 10Hashar: gerrit: ignore lint error in role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831932 (owner: 10Hashar) [04:49:53] (03CR) 10Hashar: gerrit: move proxy class to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [04:52:44] (03CR) 10Hashar: gerrit: move proxy class to a profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [04:53:05] (03PS4) 10Hashar: gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933 [04:54:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P34683 and previous config saved to /var/cache/conftool/dbconfig/20220914-045435-ladsgroup.json [05:04:17] (03PS1) 10Marostegui: mariadb: Misc codfw aren't critical [puppet] - 10https://gerrit.wikimedia.org/r/832014 [05:04:47] (03PS2) 10Marostegui: mariadb: Misc codfw aren't critical [puppet] - 10https://gerrit.wikimedia.org/r/832014 [05:05:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Misc codfw aren't critical [puppet] - 10https://gerrit.wikimedia.org/r/832014 (owner: 10Marostegui) [05:09:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P34684 and previous config saved to /var/cache/conftool/dbconfig/20220914-050942-ladsgroup.json [05:19:22] Good morning, extdist.wmflabs.org isn't working for me. [05:19:31] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:19:38] Is there any maintenance or something like that? [05:19:51] I get ERR_CONNECTION_TIMED_OUT when I want to open it, or download some extension via extension distributor. [05:23:50] Kizule: try #wikimedia-cloud [05:24:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T314041)', diff saved to https://phabricator.wikimedia.org/P34685 and previous config saved to /var/cache/conftool/dbconfig/20220914-052448-ladsgroup.json [05:24:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [05:24:53] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [05:24:59] Done, thank you RhinosF1. [05:25:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1149.eqiad.wmnet with reason: Maintenance [05:25:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T314041)', diff saved to https://phabricator.wikimedia.org/P34686 and previous config saved to /var/cache/conftool/dbconfig/20220914-052510-ladsgroup.json [05:33:53] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:35:25] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [05:37:49] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [05:51:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T317735 [05:51:25] T317735: Switchover s5 codfw master (db2123 -> db2113) - https://phabricator.wikimedia.org/T317735 [05:51:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T317735 [05:51:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db2113 with weight 0 T317735', diff saved to https://phabricator.wikimedia.org/P34687 and previous config saved to /var/cache/conftool/dbconfig/20220914-055156-marostegui.json [05:56:02] (03PS1) 10Marostegui: mariadb: Promote db2113 to s5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/832143 (https://phabricator.wikimedia.org/T317735) [05:57:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T314041)', diff saved to https://phabricator.wikimedia.org/P34688 and previous config saved to /var/cache/conftool/dbconfig/20220914-055749-ladsgroup.json [05:57:54] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [05:58:03] PROBLEM - Check systemd state on ms-be2061 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:31] PROBLEM - SSH on mw1316.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:03:03] (03PS1) 10KartikMistry: Enable Content/Section translation on WPs with new MT support from Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832145 (https://phabricator.wikimedia.org/T313296) [06:03:34] (03CR) 10CI reject: [V: 04-1] Enable Content/Section translation on WPs with new MT support from Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832145 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [06:04:05] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2061 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:06:34] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2113 to s5 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/832143 (https://phabricator.wikimedia.org/T317735) (owner: 10Marostegui) [06:07:15] !log Starting s5 codfw failover from db2123 to db2113 - T317735 [06:07:16] T317735: Switchover s5 codfw master (db2123 -> db2113) - https://phabricator.wikimedia.org/T317735 [06:08:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db2113 to s5 codfw primary T317735', diff saved to https://phabricator.wikimedia.org/P34689 and previous config saved to /var/cache/conftool/dbconfig/20220914-060807-marostegui.json [06:09:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2123 T317735', diff saved to https://phabricator.wikimedia.org/P34690 and previous config saved to /var/cache/conftool/dbconfig/20220914-060913-root.json [06:11:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db2123.codfw.wmnet with reason: down [06:11:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2123.codfw.wmnet with reason: down [06:12:00] (JobUnavailable) firing: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:12:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P34691 and previous config saved to /var/cache/conftool/dbconfig/20220914-061256-ladsgroup.json [06:14:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:15:05] RECOVERY - SSH on restbase2012.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:23:43] RECOVERY - Check systemd state on ms-be2061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34692 and previous config saved to /var/cache/conftool/dbconfig/20220914-062723-root.json [06:28:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P34693 and previous config saved to /var/cache/conftool/dbconfig/20220914-062802-ladsgroup.json [06:28:21] RECOVERY - SSH on db1101.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:33:10] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:20:00 on kafka-logging2003.codfw.wmnet with reason: Kafka PKI upgrade [06:33:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on kafka-logging2003.codfw.wmnet with reason: Kafka PKI upgrade [06:34:07] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2061 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:34:31] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::kafka::logging: move kafka on all codfw nodes to PKI certificates [puppet] - 10https://gerrit.wikimedia.org/r/831831 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [06:38:18] !log restart kafka on kafka-logging2003 to pick up the new PKI TLS settings [06:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34694 and previous config saved to /var/cache/conftool/dbconfig/20220914-064228-root.json [06:43:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T314041)', diff saved to https://phabricator.wikimedia.org/P34695 and previous config saved to /var/cache/conftool/dbconfig/20220914-064309-ladsgroup.json [06:43:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [06:43:12] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:43:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1179.eqiad.wmnet with reason: Maintenance [06:43:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T314041)', diff saved to https://phabricator.wikimedia.org/P34696 and previous config saved to /var/cache/conftool/dbconfig/20220914-064330-ladsgroup.json [06:57:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34697 and previous config saved to /var/cache/conftool/dbconfig/20220914-065733-root.json [06:58:53] (03CR) 10Muehlenhoff: [C: 03+1] "Doh, good catch" [puppet] - 10https://gerrit.wikimedia.org/r/831987 (https://phabricator.wikimedia.org/T306654) (owner: 10Volans) [07:00:05] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220914T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:39] RECOVERY - SSH on mw1316.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:02:07] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:05:36] oh, I'm here. Had issue with login to IRC. [07:06:08] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Marostegui) The data looks clean. [07:06:50] Any other patches for deployment? [07:07:54] (03CR) 10Muehlenhoff: [C: 03+2] wcqs/wdqs: New rolling restart nginx cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff) [07:08:56] OK. Unstable network. I'll move patch to later today.. [07:11:28] (03Merged) 10jenkins-bot: wcqs/wdqs: New rolling restart nginx cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff) [07:12:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34698 and previous config saved to /var/cache/conftool/dbconfig/20220914-071238-root.json [07:17:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:21:26] (03PS1) 10David Caro: novaproxy: Disable nchan module as it gives failures [puppet] - 10https://gerrit.wikimedia.org/r/832150 (https://phabricator.wikimedia.org/T316975) [07:22:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana-rw.wikimedia.org/d/000000435/kubernetes-api?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:23:53] (03CR) 10Muehlenhoff: [C: 03+2] ml-etcd: Also include staging hosts [puppet] - 10https://gerrit.wikimedia.org/r/831832 (owner: 10Muehlenhoff) [07:25:07] (03CR) 10CI reject: [V: 04-1] novaproxy: Disable nchan module as it gives failures [puppet] - 10https://gerrit.wikimedia.org/r/832150 (https://phabricator.wikimedia.org/T316975) (owner: 10David Caro) [07:25:28] (03CR) 10Muehlenhoff: [C: 03+2] scap: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/831039 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:26:23] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37250/console" [puppet] - 10https://gerrit.wikimedia.org/r/832150 (https://phabricator.wikimedia.org/T316975) (owner: 10David Caro) [07:26:54] (03CR) 10Majavah: [C: 03+1] "works fine in codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/832150 (https://phabricator.wikimedia.org/T316975) (owner: 10David Caro) [07:27:12] (03PS2) 10David Caro: novaproxy: Disable nchan module as it gives failures [puppet] - 10https://gerrit.wikimedia.org/r/832150 (https://phabricator.wikimedia.org/T316975) [07:27:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34699 and previous config saved to /var/cache/conftool/dbconfig/20220914-072743-root.json [07:28:14] (03PS1) 10Elukey: wikilabels::web: update wikilabels config repository [puppet] - 10https://gerrit.wikimedia.org/r/832151 (https://phabricator.wikimedia.org/T306110) [07:29:07] (03CR) 10Elukey: [C: 03+2] wikilabels::web: update wikilabels config repository [puppet] - 10https://gerrit.wikimedia.org/r/832151 (https://phabricator.wikimedia.org/T306110) (owner: 10Elukey) [07:31:15] (03CR) 10David Caro: [C: 03+2] novaproxy: Disable nchan module as it gives failures [puppet] - 10https://gerrit.wikimedia.org/r/832150 (https://phabricator.wikimedia.org/T316975) (owner: 10David Caro) [07:31:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/831871 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [07:32:11] (03CR) 10Slyngshede: [C: 03+2] Downed VMs will report None as vCPU allocation. [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/831871 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [07:32:15] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Downed VMs will report None as vCPU allocation. [debs/prometheus-ganeti-exporter] - 10https://gerrit.wikimedia.org/r/831871 (https://phabricator.wikimedia.org/T311288) (owner: 10Slyngshede) [07:42:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2123 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34700 and previous config saved to /var/cache/conftool/dbconfig/20220914-074248-root.json [07:43:54] (03PS1) 10Marostegui: db-production.php: Disable writes in es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832152 (https://phabricator.wikimedia.org/T317739) [07:44:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es5 T317739 [07:44:20] T317739: Switchover es5 master (es1024 -> es1023) - https://phabricator.wikimedia.org/T317739 [07:44:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es5 T317739 [07:44:40] (03CR) 10Marostegui: [C: 03+2] db-production.php: Disable writes in es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832152 (https://phabricator.wikimedia.org/T317739) (owner: 10Marostegui) [07:45:02] (03CR) 10Muehlenhoff: Allow cookbooks to handle restarts based on running one of more commands (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/826798 (owner: 10Muehlenhoff) [07:45:23] (03Merged) 10jenkins-bot: db-production.php: Disable writes in es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832152 (https://phabricator.wikimedia.org/T317739) (owner: 10Marostegui) [07:46:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es1023 with weight 0 T317739', diff saved to https://phabricator.wikimedia.org/P34701 and previous config saved to /var/cache/conftool/dbconfig/20220914-074617-marostegui.json [07:47:19] (03PS1) 10Marostegui: mariadb: Promote es1023 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/832153 (https://phabricator.wikimedia.org/T317739) [07:48:21] (03PS1) 10Marostegui: wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/832154 (https://phabricator.wikimedia.org/T317739) [07:49:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote es1023 to es5 master [puppet] - 10https://gerrit.wikimedia.org/r/832153 (https://phabricator.wikimedia.org/T317739) (owner: 10Marostegui) [07:49:58] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10HasanAkgun_WMDE) @WMDE-leszek @Dzahn Actually the problem with creating new user process was kernel username, I couldn't remove that from the old user so I couldn't add it to the new one.... [07:50:02] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Disable writes on es5 T317739 (duration: 04m 13s) [07:50:06] T317739: Switchover es5 master (es1024 -> es1023) - https://phabricator.wikimedia.org/T317739 [07:50:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:51:45] (JobUnavailable) resolved: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:51:55] (03PS1) 10Marostegui: Revert "db-production.php: Disable writes in es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831967 [07:54:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:54:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:55:02] !log Starting es5 eqiad failover from es1024 to es1023 T317739 [07:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1023 to es5 primary T317739', diff saved to https://phabricator.wikimedia.org/P34702 and previous config saved to /var/cache/conftool/dbconfig/20220914-075550-marostegui.json [07:55:53] T317739: Switchover es5 master (es1024 -> es1023) - https://phabricator.wikimedia.org/T317739 [07:56:16] (03CR) 10Marostegui: [C: 03+2] wmnet: Update es5-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/832154 (https://phabricator.wikimedia.org/T317739) (owner: 10Marostegui) [07:57:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1024 T317739', diff saved to https://phabricator.wikimedia.org/P34703 and previous config saved to /var/cache/conftool/dbconfig/20220914-075722-root.json [07:57:59] (03CR) 10Marostegui: [C: 03+2] Revert "db-production.php: Disable writes in es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831967 (owner: 10Marostegui) [07:58:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:58:43] (03Merged) 10jenkins-bot: Revert "db-production.php: Disable writes in es5" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831967 (owner: 10Marostegui) [08:01:57] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:02:53] !log marostegui@deploy1002 Synchronized wmf-config/db-production.php: Enable writes on es5 T317739 (duration: 03m 38s) [08:02:56] T317739: Switchover es5 master (es1024 -> es1023) - https://phabricator.wikimedia.org/T317739 [08:03:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on es1024.eqiad.wmnet with reason: down [08:03:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1024.eqiad.wmnet with reason: down [08:03:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:07:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:07:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:08:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:13:57] (03PS2) 10KartikMistry: Enable Content/Section translation on WPs with new MT support from Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832145 (https://phabricator.wikimedia.org/T313296) [08:14:31] (03CR) 10Volans: [C: 03+2] admin: fix sudo permission for datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/831987 (https://phabricator.wikimedia.org/T306654) (owner: 10Volans) [08:18:00] (03PS1) 10Ladsgroup: Stop writing to the old templatelinks columns of enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832157 (https://phabricator.wikimedia.org/T299417) [08:19:25] (03PS2) 10Muehlenhoff: nutcracker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811227 (https://phabricator.wikimedia.org/T308013) [08:19:52] (03PS2) 10Ladsgroup: Stop writing to the old templatelinks columns of enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832157 (https://phabricator.wikimedia.org/T312865) [08:20:33] jouncebot: nowandnext [08:20:33] No deployments scheduled for the next 4 hour(s) and 39 minute(s) [08:20:33] In 4 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220914T1300) [08:20:43] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to the old templatelinks columns of enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832157 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:21:11] (03CR) 10Hashar: "The puppet compiler can not tells the difference since the files are no more present when using a directory with recurse => true ( https:/" [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [08:21:40] (03Merged) 10jenkins-bot: Stop writing to the old templatelinks columns of enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832157 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:22:02] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4), 10Patch-For-Review: Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Volans) @Jclark-ctr patch merged, could you please retry the `sudo secure-cookbook sre.dns.netbox "noop"` one? [08:24:39] (03CR) 10Volans: "post-merge -1, there is a typo" [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff) [08:24:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832157 (https://phabricator.wikimedia.org/T312865) (owner: 10Ladsgroup) [08:25:14] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:832157|Stop writing to the old templatelinks columns of enwiki (T312865)]] [08:25:18] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [08:25:35] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:832157|Stop writing to the old templatelinks columns of enwiki (T312865)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [08:25:47] RECOVERY - SSH on mw1314.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:28:32] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832158 (https://phabricator.wikimedia.org/T316676) (owner: 10WMDE-Fisch) [08:29:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:30:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:30:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:32:06] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:832157|Stop writing to the old templatelinks columns of enwiki (T312865)]] (duration: 06m 51s) [08:32:09] T312865: Turn off writing to the old columns of templatelinks in beta and production - https://phabricator.wikimedia.org/T312865 [08:33:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maint needed [08:33:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maint needed [08:33:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:38:55] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart on A:wdqs-test [08:38:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart on A:wdqs-test [08:39:11] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/826798 (owner: 10Muehlenhoff) [08:40:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T314041)', diff saved to https://phabricator.wikimedia.org/P34704 and previous config saved to /var/cache/conftool/dbconfig/20220914-084039-ladsgroup.json [08:40:43] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [08:40:56] (03PS1) 10Muehlenhoff: sre.wdqs.restart-nginx: Fix action and rename title [cookbooks] - 10https://gerrit.wikimedia.org/r/832201 [08:41:08] (03CR) 10Muehlenhoff: [C: 03+2] wcqs/wdqs: New rolling restart nginx cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff) [08:43:09] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/832201 (owner: 10Muehlenhoff) [08:43:30] (03CR) 10Volans: wcqs/wdqs: New rolling restart nginx cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/831908 (owner: 10Muehlenhoff) [08:44:32] (03CR) 10Muehlenhoff: [C: 03+2] sre.wdqs.restart-nginx: Fix action and rename title [cookbooks] - 10https://gerrit.wikimedia.org/r/832201 (owner: 10Muehlenhoff) [08:49:09] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-test [08:49:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-test [08:50:43] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all [08:52:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 1%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34705 and previous config saved to /var/cache/conftool/dbconfig/20220914-085235-root.json [08:55:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P34706 and previous config saved to /var/cache/conftool/dbconfig/20220914-085545-ladsgroup.json [09:01:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all [09:01:59] (03PS2) 10FNegri: Fix get_osd_tree to handle empty children list [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) [09:03:15] (03CR) 10FNegri: Fix get_osd_tree to handle empty children list (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri) [09:04:22] ACKNOWLEDGEMENT - Check systemd state on cloudbackup2002 is CRITICAL: CRITICAL - degraded: The following units failed: block_sync-misc-project.service David Caro T317651 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:16] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [09:05:32] (03CR) 10David Caro: bootstrap_and_add: added preflight checks (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [09:05:34] (03CR) 10CI reject: [V: 04-1] Fix get_osd_tree to handle empty children list [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri) [09:06:49] 10SRE, 10Data-Persistence, 10Traffic: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10Vgutierrez) 05Open→03Stalled > Does that answer your question sufficiently? Yes, we've discussed this during yesterday's Traffic team meeting and we will plan accordingly after the SRE... [09:07:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [09:07:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1030.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) We verified that the problem is with the drive and not with the controller, because @Jclark-ctr moved the... [09:07:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1034.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) [09:07:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 3%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34707 and previous config saved to /var/cache/conftool/dbconfig/20220914-090740-root.json [09:09:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1034.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10fnegri) [09:10:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P34708 and previous config saved to /var/cache/conftool/dbconfig/20220914-091052-ladsgroup.json [09:12:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [09:12:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [09:13:05] (03PS1) 10Elukey: wikilabels::web: use the master branch for git repos [puppet] - 10https://gerrit.wikimedia.org/r/832208 (https://phabricator.wikimedia.org/T306110) [09:15:36] (03CR) 10Elukey: [C: 03+2] wikilabels::web: use the master branch for git repos [puppet] - 10https://gerrit.wikimedia.org/r/832208 (https://phabricator.wikimedia.org/T306110) (owner: 10Elukey) [09:15:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [09:15:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [09:17:41] (03CR) 10David Caro: Fix get_osd_tree to handle empty children list (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri) [09:21:05] (03PS3) 10FNegri: Fix get_osd_tree to handle empty children list [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) [09:22:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 5%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34709 and previous config saved to /var/cache/conftool/dbconfig/20220914-092245-root.json [09:23:55] (03CR) 10David Caro: Fix get_osd_tree to handle empty children list (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri) [09:25:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T314041)', diff saved to https://phabricator.wikimedia.org/P34710 and previous config saved to /var/cache/conftool/dbconfig/20220914-092558-ladsgroup.json [09:26:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [09:26:02] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:26:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [09:26:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T314041)', diff saved to https://phabricator.wikimedia.org/P34711 and previous config saved to /var/cache/conftool/dbconfig/20220914-092620-ladsgroup.json [09:26:59] !log installing zlib/libxslt security updates on buster [09:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:27] (03CR) 10FNegri: [C: 04-1] bootstrap_and_add: added preflight checks (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [09:31:04] (03CR) 10FNegri: [C: 04-1] bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [09:32:35] (03PS1) 10Elukey: Revert "wikilabels::web: use the master branch for git repos" [puppet] - 10https://gerrit.wikimedia.org/r/831968 [09:33:30] (03CR) 10Elukey: [C: 03+2] Revert "wikilabels::web: use the master branch for git repos" [puppet] - 10https://gerrit.wikimedia.org/r/831968 (owner: 10Elukey) [09:33:51] (03CR) 10David Caro: bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [09:37:27] (03CR) 10FNegri: Fix get_osd_tree to handle empty children list (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri) [09:37:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 10%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34712 and previous config saved to /var/cache/conftool/dbconfig/20220914-093750-root.json [09:38:13] (03CR) 10David Caro: bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [09:39:39] (03PS4) 10FNegri: Fix get_osd_tree to handle empty children list [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) [09:40:11] (03CR) 10FNegri: Fix get_osd_tree to handle empty children list (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri) [09:40:13] (03PS3) 10David Caro: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) [09:40:27] (03CR) 10David Caro: bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [09:43:19] (03PS1) 10Kosta Harlan: BlockMetrics: Update to new event schema version [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831969 (https://phabricator.wikimedia.org/T306018) [09:45:16] since wmf.1 is stalled and not deployed anywhere yet (T314190), could we backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/831969 outside the normal backport window time? [09:45:16] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [09:46:09] (03PS1) 10Hashar: gerrit: move jetty class to init [puppet] - 10https://gerrit.wikimedia.org/r/832230 [09:46:40] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [09:47:40] (03CR) 10Hashar: [C: 03+1] BlockMetrics: Update to new event schema version [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831969 (https://phabricator.wikimedia.org/T306018) (owner: 10Kosta Harlan) [09:47:41] kostajh: yes! I have +1ed the change [09:48:01] I can't deploy right now though since it is lunch time but I guess you can scap deploy it [09:48:47] (03CR) 10FNegri: [C: 04-1] "Still no luck unfortunately 😞" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [09:50:50] (03CR) 10Hashar: "I will apply it to the Gerrit WMCS instance and see what happens ;)" [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [09:52:05] (03CR) 10FNegri: [C: 04-1] bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [09:52:47] (03CR) 10FNegri: [C: 04-1] bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [09:52:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 25%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34713 and previous config saved to /var/cache/conftool/dbconfig/20220914-095255-root.json [09:53:23] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet [09:53:39] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet [09:53:58] (03PS1) 10Volans: homer: fix config override when using Netbox [software/homer] - 10https://gerrit.wikimedia.org/r/832233 [09:57:37] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet [09:57:57] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet [09:58:04] !log ladsgroup@cumin1001 conftool action : set/pooled=inactive; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet [09:59:37] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [10:00:13] !log ladsgroup@cumin1001 conftool action : set/pooled=no; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [10:06:33] (03PS1) 10Hnowlan: haproxy: use haproxy24 component [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/832235 (https://phabricator.wikimedia.org/T233196) [10:06:41] (03CR) 10David Caro: bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [10:06:45] (03CR) 10Vgutierrez: [C: 04-1] "Check inline comments, current version is broken" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [10:08:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 50%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34714 and previous config saved to /var/cache/conftool/dbconfig/20220914-100800-root.json [10:12:08] hashar: thanks [10:14:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:14:48] 10SRE, 10Data-Engineering, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) This is highly related to T317051 and I think we can close this one and just blame on how Varnish reports some... [10:17:10] (03CR) 10Kosta Harlan: [C: 03+2] "backport" [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831969 (https://phabricator.wikimedia.org/T306018) (owner: 10Kosta Harlan) [10:18:34] !log import routinator 0.11.3-1bullseye to thirdparty/routinator [10:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:45] hashar: afaik, I don't need to run `scap sync-file`, is that right? [10:19:12] (03Merged) 10jenkins-bot: BlockMetrics: Update to new event schema version [extensions/WikimediaEvents] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831969 (https://phabricator.wikimedia.org/T306018) (owner: 10Kosta Harlan) [10:19:24] or you can use the new `scap backport ` command :-) [10:20:30] oh [10:20:31] kostajh: or, sorry, misunderstood the question. no, you need to sync the change somehow even if wmf.1 is not active on any wikis [10:20:36] I'm following https://deploy-commands.toolforge.org/bacc/831969 [10:20:58] taavi: ack, running that now [10:22:06] and by 'somehow' I mean either scap sync-file or scap backport. the backport command is the new fancy thing that's supposed to replace sync and sync-file for the backporting workflow [10:23:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 75%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34715 and previous config saved to /var/cache/conftool/dbconfig/20220914-102305-root.json [10:24:39] !log kharlan@deploy1002 Synchronized php-1.40.0-wmf.1/extensions/WikimediaEvents/includes/BlockMetrics/BlockMetricsHooks.php: Backport: [[gerrit:831969|BlockMetrics: Update to new event schema version (T306018)]] (duration: 03m 48s) [10:24:43] T306018: Instrument blocked account registration - https://phabricator.wikimedia.org/T306018 [10:25:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:26:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:26:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:27:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:28:06] (03CR) 10FNegri: [C: 04-1] bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [10:31:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/homer] - 10https://gerrit.wikimedia.org/r/831951 (owner: 10Volans) [10:31:51] 10SRE, 10Traffic: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) [10:34:10] (03CR) 10Cathal Mooney: [C: 03+1] "Looks good! Thanks for taking the time to track down where it was going wrong :)" [software/homer] - 10https://gerrit.wikimedia.org/r/832233 (owner: 10Volans) [10:35:10] (03CR) 10Volans: [C: 03+2] cli: add --version option [software/homer] - 10https://gerrit.wikimedia.org/r/831951 (owner: 10Volans) [10:38:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es1024 (re)pooling @ 100%: Repooling for warm up after upgrade', diff saved to https://phabricator.wikimedia.org/P34717 and previous config saved to /var/cache/conftool/dbconfig/20220914-103810-root.json [10:38:54] (03PS1) 10Jbond: spec: drop stretch from default spec tests [puppet] - 10https://gerrit.wikimedia.org/r/832238 [10:39:46] (03Merged) 10jenkins-bot: cli: add --version option [software/homer] - 10https://gerrit.wikimedia.org/r/831951 (owner: 10Volans) [10:40:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37253/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [10:41:29] (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm, ping me on irc if you want me to merge" [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [10:41:57] (03CR) 10Volans: [C: 03+2] homer: fix config override when using Netbox [software/homer] - 10https://gerrit.wikimedia.org/r/832233 (owner: 10Volans) [10:44:17] jbond: hey, you +2'd https://gerrit.wikimedia.org/r/c/operations/puppet/+/831500, did you forget to actually merge it? [10:52:01] (03PS5) 10Aishik Rehman: Move namespace in the Bengali Wiktionary: উইকিসরাস → পরিশিষ্ট and set wgNamespaceAliases for newly created namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831970 (https://phabricator.wikimedia.org/T317745) [10:58:21] (03CR) 10CI reject: [V: 04-1] homer: fix config override when using Netbox [software/homer] - 10https://gerrit.wikimedia.org/r/832233 (owner: 10Volans) [10:58:57] dear jenkins, you were happy with the patch few minutes ago... why aren't you happy anymore? [10:59:39] meh python setup.py egg_info did not run successfully. [11:00:07] (03CR) 10Volans: [C: 03+2] homer: fix config override when using Netbox [software/homer] - 10https://gerrit.wikimedia.org/r/832233 (owner: 10Volans) [11:01:08] !log Prepping to upgrade JunOS on cr2-eqdfw. Adjusting OSPF costs to force traffic via alternate POPs. [11:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:26] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: router upgrade [11:02:40] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cr2-eqdfw,cr2-eqdfw IPv6 with reason: router upgrade [11:04:29] (03Merged) 10jenkins-bot: homer: fix config override when using Netbox [software/homer] - 10https://gerrit.wikimedia.org/r/832233 (owner: 10Volans) [11:07:37] (03PS1) 10Ladsgroup: auto_schema: Fix sneaky bug on running schema change with replication [software] - 10https://gerrit.wikimedia.org/r/832239 [11:10:58] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831970 (https://phabricator.wikimedia.org/T317745) (owner: 10Aishik Rehman) [11:14:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T314041)', diff saved to https://phabricator.wikimedia.org/P34719 and previous config saved to /var/cache/conftool/dbconfig/20220914-111400-ladsgroup.json [11:14:03] !log Shutting down internet transit and peering on cr2-eqdfw in advance of upgrade reboot [11:14:05] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:52] (03PS4) 10Jcrespo: bacula: Add production and db storage hosts to the backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/829813 (https://phabricator.wikimedia.org/T313582) [11:21:31] (03CR) 10Marostegui: [C: 03+1] auto_schema: Fix sneaky bug on running schema change with replication [software] - 10https://gerrit.wikimedia.org/r/832239 (owner: 10Ladsgroup) [11:25:21] (03CR) 10Jcrespo: [C: 03+2] bacula: Add production and db storage hosts to the backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/829813 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [11:28:18] (03PS1) 10Muehlenhoff: Add cookbook to restart/reboot the Docker registry [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 [11:29:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P34721 and previous config saved to /var/cache/conftool/dbconfig/20220914-112907-ladsgroup.json [11:29:15] !log rebooting cr2-eqdfw to complete upgrade [11:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:24] (03CR) 10Jcrespo: [C: 03+2] "I had a syntax error- creating a new patch as it only affects the new definition, not the other aliases." [puppet] - 10https://gerrit.wikimedia.org/r/829813 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [11:33:37] (03CR) 10Ladsgroup: [C: 03+2] auto_schema: Fix sneaky bug on running schema change with replication [software] - 10https://gerrit.wikimedia.org/r/832239 (owner: 10Ladsgroup) [11:34:10] (03Merged) 10jenkins-bot: auto_schema: Fix sneaky bug on running schema change with replication [software] - 10https://gerrit.wikimedia.org/r/832239 (owner: 10Ladsgroup) [11:34:19] PROBLEM - OSPF status on cr3-knams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:34:40] (03PS1) 10Jcrespo: dbbackups: Increase backup freshness check from 8 to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/832243 [11:34:42] (03PS1) 10Jcrespo: cumin: Fix backup alias, followup to e48d955 [puppet] - 10https://gerrit.wikimedia.org/r/832244 (https://phabricator.wikimedia.org/T313582) [11:34:46] ^^ that's the eqdfw upgrade [11:34:51] (cr3-knams ospf alert) [11:34:59] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:34:59] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:34:59] PROBLEM - BFD status on cr3-knams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:35:15] (03PS6) 10Aishik Rehman: Move namespace in the Bengali Wiktionary: উইকিসরাস → পরিশিষ্ট and set wgNamespaceAliases for newly created namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831970 (https://phabricator.wikimedia.org/T317745) [11:35:39] PROBLEM - BFD status on cr2-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:35:39] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:35:43] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:36:05] PROBLEM - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:36:33] thanks for the extra verbosity, it is apreciated [11:36:40] (on the heads up, I mean) [11:37:59] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:37:59] RECOVERY - BFD status on cr2-drmrs is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:38:27] RECOVERY - OSPF status on cr2-drmrs is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:39:03] RECOVERY - OSPF status on cr3-knams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:39:43] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:39:43] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:39:45] RECOVERY - BFD status on cr3-knams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:40:16] (03CR) 10Jcrespo: [C: 03+2] cumin: Fix backup alias, followup to e48d955 [puppet] - 10https://gerrit.wikimedia.org/r/832244 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [11:40:25] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:40:26] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Increase backup freshness check from 8 to 9 days [puppet] - 10https://gerrit.wikimedia.org/r/832243 (owner: 10Jcrespo) [11:43:19] (03PS3) 10KartikMistry: Enable Section Translation in Odia Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831872 (https://phabricator.wikimedia.org/T313300) [11:44:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P34723 and previous config saved to /var/cache/conftool/dbconfig/20220914-114413-ladsgroup.json [11:44:42] (03CR) 10Muehlenhoff: [C: 03+1] "Good riddance :-)" [puppet] - 10https://gerrit.wikimedia.org/r/832238 (owner: 10Jbond) [11:45:32] (03PS6) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [11:45:52] (03CR) 10Hnowlan: "Thanks a lot for all the feedback so far! I'll be adding fixtures in another patchset." [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:46:06] jbond: thanks for dropping `stretch` from `WMFConfig.test_on` ;-] [11:47:15] (03CR) 10CI reject: [V: 04-1] thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:48:05] (03CR) 10Hashar: "It is a noop on the WMCS instance:" [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [11:48:10] (03PS1) 10Muehlenhoff: profile::ci::docker: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/832249 [11:49:18] !log cmooney@cumin1001 START - Cookbook sre.hosts.remove-downtime for cr2-eqdfw,cr2-eqdfw IPv6 [11:49:19] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cr2-eqdfw,cr2-eqdfw IPv6 [11:49:41] jynus: no problem, better everyone has visibility [11:49:59] that's me done now, no more CR upgrades until week after the summit [11:50:08] great work! [11:50:30] (03PS5) 10Hashar: gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933 [11:51:33] (03CR) 10Hashar: "I have removed the TODO comment to convert gerrit::proxy to a profile, changed test_on() to use the default (thanks)." [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [11:59:08] (03PS1) 10Muehlenhoff: query_service: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/832251 (https://phabricator.wikimedia.org/T308013) [11:59:10] (03PS1) 10Muehlenhoff: docker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/832252 (https://phabricator.wikimedia.org/T308013) [11:59:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T314041)', diff saved to https://phabricator.wikimedia.org/P34725 and previous config saved to /var/cache/conftool/dbconfig/20220914-115920-ladsgroup.json [11:59:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:59:24] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:59:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:05:27] (03CR) 10Jbond: "As this is a copy from old code there are a few style nite but feel free to ignore. however please take a look specifically at the ldap_c" [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [12:07:45] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831932 (owner: 10Hashar) [12:08:58] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:09:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [12:10:09] (03PS1) 10Muehlenhoff: doc: Enable profile::auto_restarts::service for FPM [puppet] - 10https://gerrit.wikimedia.org/r/832253 (https://phabricator.wikimedia.org/T135991) [12:12:40] jbond: I will get that series of Gerrit patches deployed with mutante since he has all the history about it and is usually our point of contact for Gerrit things. Your reviews definitely help build confidence to roll it and thanks for all the modernization suggestions [12:43:49] (03PS1) 10Marostegui: db2093: Host not critical [puppet] - 10https://gerrit.wikimedia.org/r/832257 [12:44:37] (03CR) 10Marostegui: [C: 03+2] db2093: Host not critical [puppet] - 10https://gerrit.wikimedia.org/r/832257 (owner: 10Marostegui) [12:44:41] (03CR) 10Jcrespo: [C: 03+1] db2093: Host not critical [puppet] - 10https://gerrit.wikimedia.org/r/832257 (owner: 10Marostegui) [12:46:06] (03CR) 10Kosta Harlan: [C: 03+1] Add growthexperiments_user_impact to $private_tables [puppet] - 10https://gerrit.wikimedia.org/r/831542 (https://phabricator.wikimedia.org/T317534) (owner: 10Gergő Tisza) [12:49:20] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for nginx on archiva/proxy [puppet] - 10https://gerrit.wikimedia.org/r/832258 (https://phabricator.wikimedia.org/T135991) [12:54:52] !log imported rsyslog 8.2208.0-1~bpo11+1 into bullseye-wikimedia component/rsyslog-k8s - T289766 [12:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:56] T289766: Kubernetes logs (container stderr,strout) do not show up in Elasticsearch/Kibana - https://phabricator.wikimedia.org/T289766 [12:58:40] (03PS1) 10Muehlenhoff: xmldumps: Enable profile::auto_restarts::service for nginx [puppet] - 10https://gerrit.wikimedia.org/r/832259 (https://phabricator.wikimedia.org/T135991) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220914T1300). [13:00:05] kart_, kostajh, and Aishik: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:30] * kart_ is here.. [13:00:37] I can self deploy.. [13:00:42] ok! [13:01:24] I merged my patch earlier (wmf.1) [13:01:50] (03CR) 10KartikMistry: [C: 03+2] Enable Section Translation in Odia Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831872 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry) [13:02:40] (03Merged) 10jenkins-bot: Enable Section Translation in Odia Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831872 (https://phabricator.wikimedia.org/T313300) (owner: 10KartikMistry) [13:03:55] ok [13:05:20] Deploying first patch.. [13:09:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:09:05] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:831872|Enable Section Translation in Odia Wikipedia (T313300)]] (duration: 03m 55s) [13:09:08] T313300: Enable Section Translation on 9 more Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T313300 [13:09:54] (03PS3) 10KartikMistry: Enable Content/Section translation on WPs with new MT support from Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832145 (https://phabricator.wikimedia.org/T313296) [13:10:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:10:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:10:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:10:41] (03CR) 10Herron: [C: 03+1] mail::mx: Modify the Received header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [13:11:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 (owner: 10Muehlenhoff) [13:12:09] (03CR) 10Jbond: [C: 03+2] spec: drop stretch from default spec tests [puppet] - 10https://gerrit.wikimedia.org/r/832238 (owner: 10Jbond) [13:12:58] (03CR) 10KartikMistry: [C: 03+2] Enable Content/Section translation on WPs with new MT support from Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832145 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [13:13:44] (03Merged) 10jenkins-bot: Enable Content/Section translation on WPs with new MT support from Google [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832145 (https://phabricator.wikimedia.org/T313296) (owner: 10KartikMistry) [13:13:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:15:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:15:14] (03CR) 10Jbond: [C: 03+1] gerrit: move proxy class to a profile [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [13:16:06] (03PS2) 10Hashar: gerrit: move jetty class to init [puppet] - 10https://gerrit.wikimedia.org/r/832230 [13:16:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/832251 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:16:48] Deploying second patch.. [13:17:00] (03CR) 10CI reject: [V: 04-1] gerrit: move jetty class to init [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [13:18:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:19:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:19:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:20:33] of course I broke the arrow alignment plugin in vim bah [13:20:37] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:832145|Enable Content/Section translation on WPs with new MT support from Google (T313296)]] (duration: 03m 39s) [13:20:41] T313296: Enable Content and Section translation on wikipedias with new MT support from Google - https://phabricator.wikimedia.org/T313296 [13:20:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:23:03] (03CR) 10Jbond: [C: 03+1] "LGTM the changes from Zhuyifei1999 have been removed and the ones from juniorsys are only stylistic" [puppet] - 10https://gerrit.wikimedia.org/r/832251 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:23:27] Lucas_WMDE: I'm done with my patches. [13:23:35] ok, thanks! [13:25:01] Aishik: are you around for the bnwiktionary patch? [13:25:24] Yeap! I am here! [13:25:38] I’m looking at the patch now [13:25:55] shouldn’t it also add the old namespace name (উইকিসরাস) as an alias? [13:26:08] otherwise, what’s going to happen with all existing pages in that namespace, and links to them? [13:26:20] (I have to admit I haven’t deployed *that* many namespace changes before) [13:26:24] (03PS3) 10Hashar: gerrit: move jetty class to init [puppet] - 10https://gerrit.wikimedia.org/r/832230 [13:26:41] No need, We have nothing in this nameplate [13:26:47] *namespace [13:26:51] ah, ok [13:27:19] (03PS1) 10Hashar: gerrit: modernize spec [puppet] - 10https://gerrit.wikimedia.org/r/832260 [13:27:30] yup, looks like it according to Special:AllPages [13:27:50] (03PS7) 10Lucas Werkmeister (WMDE): Move namespace in the Bengali Wiktionary: উইকিসরাস → পরিশিষ্ট and set wgNamespaceAliases for newly created namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831970 (https://phabricator.wikimedia.org/T317745) (owner: 10Aishik Rehman) [13:27:53] (03CR) 10Hashar: gerrit: move jetty class to init (0319 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [13:27:53] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10conny-kawohl_WMDE) Hi, I am Conny Kawohl, Engineering Manager of Fundraising Tech for Wikimedia Germany, and @Tanuja_Doriya is my new team member. I hereby approve the request for access to... [13:28:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "LGTM; the old Appendix namespace name doesn’t need to be kept as an alias because there are currently no pages in it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831970 (https://phabricator.wikimedia.org/T317745) (owner: 10Aishik Rehman) [13:28:26] !log upgrading routinator on rpki2002 [13:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:05] (03Merged) 10jenkins-bot: Move namespace in the Bengali Wiktionary: উইকিসরাস → পরিশিষ্ট and set wgNamespaceAliases for newly created namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/831970 (https://phabricator.wikimedia.org/T317745) (owner: 10Aishik Rehman) [13:30:16] Aishik: the change is on mwdebug1001, can you test it? [13:30:41] (03CR) 10Volans: "one typo in the docstring" [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 (owner: 10Muehlenhoff) [13:31:05] PROBLEM - RPKI Validator RTR port on rpki2002 is CRITICAL: connect to address 10.192.0.103 and port 3323: Connection refused https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [13:31:11] PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: routinator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:31:40] Yes, it's working [13:31:45] (JobUnavailable) firing: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:31:46] \o/ [13:31:48] looks good on my end too [13:31:50] syncing [13:31:52] ^^ fyi I'm looking at rpki2002 [13:33:19] RECOVERY - RPKI Validator RTR port on rpki2002 is OK: TCP OK - 0.032 second response time on 10.192.0.103 port 3323 https://wikitech.wikimedia.org/wiki/RPKI%23RPKI_to_router_port [13:33:27] RECOVERY - Check systemd state on rpki2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:42] (03PS1) 10Filippo Giunchedi: Add missing dashboard/runbook annotations as TODOs [alerts] - 10https://gerrit.wikimedia.org/r/832261 [13:33:44] (03PS1) 10Filippo Giunchedi: Require dashboard and runbook annotations [alerts] - 10https://gerrit.wikimedia.org/r/832262 [13:33:51] (03CR) 10CI reject: [V: 04-1] Add missing dashboard/runbook annotations as TODOs [alerts] - 10https://gerrit.wikimedia.org/r/832261 (owner: 10Filippo Giunchedi) [13:34:01] (03CR) 10CI reject: [V: 04-1] Require dashboard and runbook annotations [alerts] - 10https://gerrit.wikimedia.org/r/832262 (owner: 10Filippo Giunchedi) [13:35:56] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:831970|Move namespace in the Bengali Wiktionary: উইকিসরাস → পরিশিষ্ট and set wgNamespaceAliases for newly created namespaces (T317745)]] (duration: 03m 41s) [13:35:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:36:01] T317745: Move namespace in the Bengali Wiktionary: উইকিসরাস → পরিশিষ্ট and set wgNamespaceAliases for newly created namespaces - https://phabricator.wikimedia.org/T317745 [13:36:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:36:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:36:59] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:37:18] (03PS2) 10Filippo Giunchedi: Add missing dashboard/runbook annotations as TODOs [alerts] - 10https://gerrit.wikimedia.org/r/832261 [13:37:19] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php bnwiktionary --fix # T317745 – dry run result: 6043 links to fix, 6043 were resolvable, 0 were deleted [13:37:20] (03PS2) 10Filippo Giunchedi: Require dashboard and runbook annotations [alerts] - 10https://gerrit.wikimedia.org/r/832262 [13:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:22] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/832260 (owner: 10Hashar) [13:37:44] (03PS1) 10Herron: install_server: add dhcp/netboot records for dispatch-be1001 [puppet] - 10https://gerrit.wikimedia.org/r/832263 (https://phabricator.wikimedia.org/T313229) [13:37:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:39:11] (03CR) 10CI reject: [V: 04-1] Require dashboard and runbook annotations [alerts] - 10https://gerrit.wikimedia.org/r/832262 (owner: 10Filippo Giunchedi) [13:40:11] (03CR) 10CI reject: [V: 04-1] Add missing dashboard/runbook annotations as TODOs [alerts] - 10https://gerrit.wikimedia.org/r/832261 (owner: 10Filippo Giunchedi) [13:40:17] !log UTC afternoon backport+config window done [13:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:37] (03PS1) 10Jbond: gerrit: add mock secrets [labs/private] - 10https://gerrit.wikimedia.org/r/832264 [13:41:46] (03CR) 10Herron: [C: 03+2] "standard vm provision, self-merging" [puppet] - 10https://gerrit.wikimedia.org/r/832263 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [13:41:53] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) cr2-eqdfw upgrade completed successfully today. [13:42:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Upgrade core routers to Junos 21+ - https://phabricator.wikimedia.org/T295690 (10cmooney) [13:42:32] (03PS3) 10Filippo Giunchedi: Add missing dashboard/runbook annotations as TODOs [alerts] - 10https://gerrit.wikimedia.org/r/832261 [13:42:36] (03PS3) 10Filippo Giunchedi: Require dashboard and runbook annotations [alerts] - 10https://gerrit.wikimedia.org/r/832262 [13:42:40] (03PS1) 10Filippo Giunchedi: Don't check TODO runbooks for existence [alerts] - 10https://gerrit.wikimedia.org/r/832265 [13:44:43] (03PS2) 10Muehlenhoff: Add cookbook to restart/reboot the Docker registry [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 [13:44:47] (03CR) 10Muehlenhoff: Add cookbook to restart/reboot the Docker registry (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 (owner: 10Muehlenhoff) [13:46:23] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/832262 (owner: 10Filippo Giunchedi) [13:46:45] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:47:30] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/832261 (owner: 10Filippo Giunchedi) [13:48:35] !log imported zlib 1:1.2.8.dfsg-5+deb9u1+wmf1 to apt.wikimedia.org [13:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:30] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 (owner: 10Muehlenhoff) [13:50:26] (03CR) 10Muehlenhoff: query_service: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832251 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:51:03] !og installing zlib security updates on stretch hosts [13:51:35] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/832249 (owner: 10Muehlenhoff) [13:51:45] (JobUnavailable) firing: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:52:20] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/832252 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:52:26] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [13:54:49] (03CR) 10Jbond: [C: 03+1] query_service: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832251 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:56:45] (JobUnavailable) resolved: (2) Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:57:59] fun I found out puppet-strings in operations/puppet is still 1.0.0 released in November 2016. It is used to generate https://doc.wikimedia.org/puppet/ [13:59:08] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.thumbor rolling restart_daemons on A:thumbor-codfw [14:01:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.thumbor (exit_code=0) rolling restart_daemons on A:thumbor-codfw [14:05:57] !log ladsgroup@cumin1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [14:06:18] !log ladsgroup@cumin1001 conftool action : set/pooled=inactive; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [14:06:31] (03CR) 10JHathaway: mail::mx: Modify the Received header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [14:06:34] (03CR) 10JHathaway: [C: 03+2] mail::mx: Modify the Received header [puppet] - 10https://gerrit.wikimedia.org/r/831625 (https://phabricator.wikimedia.org/T317574) (owner: 10JHathaway) [14:09:51] (03PS2) 10Muehlenhoff: profile::ci::docker: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/832249 [14:10:40] (03PS25) 10Jbond: C:varnish: Rate limit hotlinking [puppet] - 10https://gerrit.wikimedia.org/r/768723 [14:10:42] (03PS1) 10Jbond: C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines [puppet] - 10https://gerrit.wikimedia.org/r/832268 [14:11:23] (03CR) 10Jbond: "t" [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [14:11:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37254/console" [puppet] - 10https://gerrit.wikimedia.org/r/768723 (owner: 10Jbond) [14:12:44] (03PS1) 10Krinkle: session: Fix broken SessionTest case due to PHPUnit dependency change [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831984 (https://phabricator.wikimedia.org/T317750) [14:14:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T314041)', diff saved to https://phabricator.wikimedia.org/P34729 and previous config saved to /var/cache/conftool/dbconfig/20220914-141434-ladsgroup.json [14:14:39] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [14:15:13] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-ntsako-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:42] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Hide the client IP address in the SMTP Received header for authenticated relay clients - https://phabricator.wikimedia.org/T317574 (10TAndic) @jhathaway confirming it worked! Sent you a test email as well in case you're curious/want to have the record. [14:18:41] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Hide the client IP address in the SMTP Received header for authenticated relay clients - https://phabricator.wikimedia.org/T317574 (10jhathaway) >>! In T317574#8236259, @TAndic wrote: > @jhathaway confirming it worked! Sent you a test email as well in... [14:19:47] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:36] (03PS2) 10Krinkle: session: Fix broken SessionTest case due to PHPUnit dependency change [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831984 (https://phabricator.wikimedia.org/T317750) [14:29:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P34730 and previous config saved to /var/cache/conftool/dbconfig/20220914-142941-ladsgroup.json [14:30:38] 10SRE, 10Observability-Logging, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1): Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) kafka logging codfw on PKI, all hosts moved! Next step: kafka-logging-eqiad :) [14:38:11] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:39:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "should be okay to backport" [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831984 (https://phabricator.wikimedia.org/T317750) (owner: 10Krinkle) [14:40:32] (03CR) 10Muehlenhoff: [C: 03+2] profile::ci::docker: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/832249 (owner: 10Muehlenhoff) [14:43:55] (03CR) 10Krinkle: [C: 03+2] session: Fix broken SessionTest case due to PHPUnit dependency change [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831984 (https://phabricator.wikimedia.org/T317750) (owner: 10Krinkle) [14:44:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P34731 and previous config saved to /var/cache/conftool/dbconfig/20220914-144449-ladsgroup.json [14:59:27] (03Merged) 10jenkins-bot: session: Fix broken SessionTest case due to PHPUnit dependency change [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831984 (https://phabricator.wikimedia.org/T317750) (owner: 10Krinkle) [14:59:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T314041)', diff saved to https://phabricator.wikimedia.org/P34732 and previous config saved to /var/cache/conftool/dbconfig/20220914-145956-ladsgroup.json [15:00:00] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [15:03:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:04:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:04:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:05:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:07:18] (ProbeDown) firing: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:17] (03PS1) 10Hashar: doc: update puppet-strings 1.0.0..2.9.0 [puppet] - 10https://gerrit.wikimedia.org/r/832272 [15:09:19] (03PS1) 10Hashar: doc: invoke strings:generate task with a hash of args [puppet] - 10https://gerrit.wikimedia.org/r/832273 [15:12:18] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:16:45] (03CR) 10Jbond: [C: 03+2] "lgtm will merge (will need a follow up to the ci image)" [puppet] - 10https://gerrit.wikimedia.org/r/832272 (owner: 10Hashar) [15:17:18] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:17:40] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [15:17:47] (03CR) 10Jbond: [C: 03+2] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/832273 (owner: 10Hashar) [15:18:09] (03CR) 10Muehlenhoff: [C: 03+2] Add cookbook to restart/reboot the Docker registry [cookbooks] - 10https://gerrit.wikimedia.org/r/832241 (owner: 10Muehlenhoff) [15:22:18] (ProbeDown) resolved: Service api-https:443 has failed probes (http_api-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#api-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:22:26] !log cwhite@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:22:32] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [15:23:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:42] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:24:50] (03PS1) 10Muehlenhoff: Align includes with current practice [puppet] - 10https://gerrit.wikimedia.org/r/832278 [15:31:25] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:34:54] jbond: thanks! I was still digging in some options for the documentation generator but I gave up. I ran the job https://integration.wikimedia.org/ci/job/operations-puppet-doc/ and the new doc has been published [15:34:55] thank you! [15:35:21] I could not find out how to collect the various warnings so that maybe they get addressed [15:35:27] anyway it is probably not important [15:36:30] (03PS4) 10David Caro: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) [15:38:32] (03PS1) 10Zabe: rdbms: Use plain array to store position data [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831985 (https://phabricator.wikimedia.org/T317606) [15:38:55] (03CR) 10Dduvall: [V: 03+2 C: 03+2] scap: Remove use of --preserve-env for sudo'd scripts [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/831944 (https://phabricator.wikimedia.org/T313953) (owner: 10Dduvall) [15:39:37] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.6.1 [software/homer] - 10https://gerrit.wikimedia.org/r/832281 [15:39:56] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.6.1 [software/homer] - 10https://gerrit.wikimedia.org/r/832281 (owner: 10Volans) [15:40:06] (03PS1) 10Dduvall: scap: Target only phab2002.codfw.wmnet for now [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/832282 (https://phabricator.wikimedia.org/T313954) [15:40:16] (03CR) 10CI reject: [V: 04-1] bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [15:43:13] (03CR) 10Dduvall: [V: 03+2 C: 03+2] scap: Target only phab2002.codfw.wmnet for now [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/832282 (https://phabricator.wikimedia.org/T313954) (owner: 10Dduvall) [15:45:04] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.1 [software/homer] - 10https://gerrit.wikimedia.org/r/832281 (owner: 10Volans) [15:46:49] jouncebot: now [15:46:49] No deployments scheduled for the next 2 hour(s) and 13 minute(s) [15:48:10] !log testing phabricator deployment to phab2002. should have no production impact (not serving traffic, no access to r/w db) [15:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:46] mutante: fyi ^ [15:49:31] !log dduvall@deploy1002 Started deploy [phabricator/deployment@3137c92]: testing phabricator deployment to phab2002 [15:50:10] !log dduvall@deploy1002 Finished deploy [phabricator/deployment@3137c92]: testing phabricator deployment to phab2002 (duration: 00m 39s) [15:55:10] (03CR) 10David Caro: [C: 03+2] "LGTM!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri) [15:55:27] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2001.codfw.wmnet on all recursors [15:55:31] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2001.codfw.wmnet on all recursors [15:57:00] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash1027.eqiad.wmnet on all recursors [15:57:03] (03PS12) 10BCornwall: varnish/tests: Remove extraneous test checks [puppet] - 10https://gerrit.wikimedia.org/r/826367 [15:57:04] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash1027.eqiad.wmnet on all recursors [15:57:30] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash1028.eqiad.wmnet on all recursors [15:57:33] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash1028.eqiad.wmnet on all recursors [15:57:42] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash1029.eqiad.wmnet on all recursors [15:57:45] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash1029.eqiad.wmnet on all recursors [15:57:59] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash1026.eqiad.wmnet on all recursors [15:58:03] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash1026.eqiad.wmnet on all recursors [15:58:16] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2002.codfw.wmnet on all recursors [15:58:20] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2002.codfw.wmnet on all recursors [15:58:39] !log cwhite@cumin2002 START - Cookbook sre.dns.wipe-cache logstash2024.codfw.wmnet on all recursors [15:58:42] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) logstash2024.codfw.wmnet on all recursors [15:58:43] (03Merged) 10jenkins-bot: Fix get_osd_tree to handle empty children list [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831833 (https://phabricator.wikimedia.org/T317219) (owner: 10FNegri) [15:58:45] (03CR) 10BCornwall: "Rebased on production rather than the other CR. I'll get that other CR based upon this one. In the meantime, running `./docker_run.sh cp60" [puppet] - 10https://gerrit.wikimedia.org/r/826367 (owner: 10BCornwall) [15:58:46] 10SRE, 10Infrastructure-Foundations: Add surveys@wikimedia.org as an additional SMTP login - https://phabricator.wikimedia.org/T317783 (10jhathaway) [16:01:02] (03PS1) 10Volans: Release v0.6.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/832287 [16:01:31] (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.6.1 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/832287 (owner: 10Volans) [16:04:04] !log volans@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 - volans@cumin1001 [16:05:42] !log volans@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.1 - volans@cumin1001 [16:06:31] (03PS1) 10Vgutierrez: trafficserver: Enforce origin server TLS cert validation for ATS 8.x [puppet] - 10https://gerrit.wikimedia.org/r/832288 (https://phabricator.wikimedia.org/T317660) [16:08:07] !log cwhite@cumin2002 START - Cookbook sre.dns.netbox [16:08:41] 10SRE, 10Infrastructure-Foundations: Add surveys@wikimedia.org as an additional SMTP login - https://phabricator.wikimedia.org/T317783 (10jhathaway) 05Open→03Resolved This has been added to our secret hiera data and tested successfully by @TAndic [16:08:44] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) [16:10:10] !log cwhite@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:12:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1034.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Cmjohnson) I am submitting a ticket with Dell for a new disk today. The h/w logs do not show an error but we will... [16:12:20] (03PS4) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) [16:14:18] (03CR) 10Andrew Bogott: "pcc output: https://puppet-compiler.wmflabs.org/pcc-worker1001/37258/tools-proxy-06.tools.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:17:26] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [16:17:38] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T316121 (10Cmjohnson) 05Open→03Resolved a:05Jclark-ctr→03Cmjohnson The PDU has been balanced [16:17:59] (03PS1) 10Hnowlan: changeprop: add num_workers support for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/832289 (https://phabricator.wikimedia.org/T233196) [16:18:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:20] (03CR) 10C. Scott Ananian: [C: 03+1] "C+1 (but also noting the double-negative in the commit subject, since what this is doing is *turning on* new-style Media output on enwikiv" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830707 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [16:20:55] (03CR) 10Ssingh: [C: 03+1] "Matches what we are doing for ATS9 too." [puppet] - 10https://gerrit.wikimedia.org/r/832288 (https://phabricator.wikimedia.org/T317660) (owner: 10Vgutierrez) [16:21:19] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10Jclark-ctr) Thanks @Volans ` jclark@cumin1001:~$ sudo secure-cookbook sre.dns.netbox "noop" START - Cookbook sre.dns.netbox Generating the DNS record... [16:21:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1034.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Cmjohnson) Requested a disk You have successfully submitted request SR151635668. [16:24:09] (03CR) 10David Caro: toolviews.py: Record unique IP page views along with total pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:24:14] (03CR) 10David Caro: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:24:20] (03CR) 10David Caro: [C: 03+1] toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:25:04] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) a:05Cmjohnson→03Jclark-ctr [16:27:16] (03PS2) 10Vgutierrez: trafficserver: Enforce origin server TLS cert validation for ATS 8.x [puppet] - 10https://gerrit.wikimedia.org/r/832288 (https://phabricator.wikimedia.org/T317660) [16:27:48] (03CR) 10Ssingh: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/832288 (https://phabricator.wikimedia.org/T317660) (owner: 10Vgutierrez) [16:31:02] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Submitted ticket with Dell Confirmed: Service Request 151636326 was successfully submitted. [16:31:08] (03CR) 10Majavah: [C: 04-1] "I'm a bit concerned about storing the IP hashes since those don't seem to change - meaning that you could query the tools viewed by an ind" [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:32:44] (03CR) 10Ssingh: [C: 03+1] "https://puppet-compiler.wmflabs.org/pcc-worker1001/37259/cp1085.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/832288 (https://phabricator.wikimedia.org/T317660) (owner: 10Vgutierrez) [16:35:20] (03PS3) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) [16:35:28] (03PS1) 10Btullis: Create new airflow package for version 2.3.2 [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/832292 (https://phabricator.wikimedia.org/T317210) [16:35:42] (03CR) 10Hnowlan: helmfile.d: add thumbor configuration (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [16:36:25] (03CR) 10David Caro: [C: 03+1] toolviews.py: Record unique IP page views along with total pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:36:46] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Enforce origin server TLS cert validation for ATS 8.x [puppet] - 10https://gerrit.wikimedia.org/r/832288 (https://phabricator.wikimedia.org/T317660) (owner: 10Vgutierrez) [16:37:50] (03PS5) 10David Caro: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) [16:38:49] (03CR) 10David Caro: bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [16:40:16] (03PS1) 10Vgutierrez: trafficserver: Remove trailing space [puppet] - 10https://gerrit.wikimedia.org/r/832293 [16:41:15] 10SRE-tools, 10Infrastructure-Foundations, 10Release-Engineering-Team: Investigate sharing releng common python code to pywmflib - https://phabricator.wikimedia.org/T316757 (10thcipriani) 05Open→03Declined For the time being, I don't think we have anything appropriate to upstream. [16:41:28] (03CR) 10Ssingh: [C: 03+1] "Sorry for not noticing this in the earlier review!" [puppet] - 10https://gerrit.wikimedia.org/r/832293 (owner: 10Vgutierrez) [16:41:35] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Remove trailing space [puppet] - 10https://gerrit.wikimedia.org/r/832293 (owner: 10Vgutierrez) [16:42:23] (03PS1) 10Btullis: Failback hive to the primary server [dns] - 10https://gerrit.wikimedia.org/r/832294 [16:43:50] (03CR) 10David Caro: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/832265 (owner: 10Filippo Giunchedi) [16:43:56] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Sustainability (Incident Followup): Add basic alerting to the Beta Cluster - https://phabricator.wikimedia.org/T315695 (10TheresNoTime) [16:44:34] (03CR) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:44:58] (03PS5) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) [16:45:57] (03CR) 10Majavah: toolviews.py: Record unique IP page views along with total pageviews (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:46:10] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frauth1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T314517 (10Cmjohnson) This still shows active in netbox. Please update to decommission when it's ready @Jgreen [16:46:45] 10SRE, 10Data-Engineering, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) >>! In T306181#8235549, @Vgutierrez wrote: > This is highly related to T317051 and I think we can close this one... [16:47:00] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:47:13] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission frlog1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T315924 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson Thanks @jgreen this also shows active in netbox. I want to confirm it's okay to decom. [16:47:52] 10SRE, 10Data-Engineering, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Vgutierrez) 05Open→03Resolved Sorry about the fuss and thanks for your thorough investigation @BTullis [16:48:07] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T315352 (10Cmjohnson) 05Open→03Resolved these are new kafka-logging servers, I will update enter these today. [16:48:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1034.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) Drive was already ordered just replaced right now @Cmjohnson [16:52:01] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [16:52:30] (03CR) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [16:52:38] 10SRE, 10SRE-Access-Requests, 10Data-Engineering: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BTullis) I have created the principal and sent the mail to the user, as per the instructions here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_princi... [16:54:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: one disk not working properly in cloudcephosd1034.eqiad.wmnet - https://phabricator.wikimedia.org/T316673 (10Jclark-ctr) 05Open→03Resolved [16:56:37] (03PS1) 10Btullis: Enable the krb flag for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/832295 (https://phabricator.wikimedia.org/T317545) [16:58:59] (03CR) 10BCornwall: [C: 03+1] "Thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/832295 (https://phabricator.wikimedia.org/T317545) (owner: 10Btullis) [17:05:21] (03Abandoned) 10Hnowlan: changeprop: add num_workers support for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/832289 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [17:08:02] (03CR) 10FNegri: [C: 03+2] bootstrap_and_add: added preflight checks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [17:08:10] (03PS6) 10FNegri: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [17:22:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:23:51] 10SRE-tools, 10Infrastructure-Foundations, 10Release-Engineering-Team: Investigate sharing releng common python code to pywmflib - https://phabricator.wikimedia.org/T316757 (10Volans) Ok, no problem. Feel free to re-open if that changes. [17:27:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:13] (03PS6) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) [17:36:19] (03CR) 10FNegri: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [17:41:16] (03Merged) 10jenkins-bot: bootstrap_and_add: added preflight checks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/831122 (https://phabricator.wikimedia.org/T316021) (owner: 10David Caro) [17:48:59] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall) [17:49:30] (03PS7) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) [17:50:06] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall) I switched "has shell access" to "No" because the user haas is not in data.yaml (they are in LDAP though). [17:59:20] (03CR) 10Andrew Bogott: toolviews.py: Record unique IP page views along with total pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [18:00:05] dancy and jeena: May I have your attention please! Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220914T1800) [18:00:05] dancy and jeena: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220914T1800). [18:00:43] Train is still blocked on https://phabricator.wikimedia.org/T317606 [18:00:43] 10SRE, 10conftool: requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10RLazarus) [18:01:05] 10SRE, 10conftool: requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10RLazarus) p:05Triage→03Medium [18:04:30] dancy, the patch for that has been merged, the backport just needs to be deployed [18:04:56] Thanks zabe! I'll do that. [18:05:26] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/831985/ ? [18:10:09] yes [18:10:14] dancy, ^ [18:10:23] ack [18:12:33] (03CR) 10Ahmon Dancy: [C: 03+2] rdbms: Use plain array to store position data [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831985 (https://phabricator.wikimedia.org/T317606) (owner: 10Zabe) [18:15:45] 10SRE, 10serviceops, 10Patch-For-Review: mediawiki::api: net.ipv4.local_port_range sysctl config does not exist - https://phabricator.wikimedia.org/T317454 (10BCornwall) p:05Triage→03Medium [18:27:47] 10SRE, 10conftool, 10Sustainability (Incident Followup): requestctl can't act on cache hits - https://phabricator.wikimedia.org/T317794 (10RLazarus) [18:28:41] (03Merged) 10jenkins-bot: rdbms: Use plain array to store position data [core] (wmf/1.40.0-wmf.1) - 10https://gerrit.wikimedia.org/r/831985 (https://phabricator.wikimedia.org/T317606) (owner: 10Zabe) [18:29:06] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall) Added @KFrancis as per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#wmde_access to ensure that the NDA has been signed. [18:30:25] Let's do this thing [18:30:26] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations (FY2021/2022-Q4): Request sudo access for Jclark-ctr - https://phabricator.wikimedia.org/T306654 (10BCornwall) 05Open→03Resolved Thanks for verifying! It looks like this ticket can be closed. [18:30:36] 10SRE, 10Sustainability (Incident Followup): Rate limiting for hotlinked images - https://phabricator.wikimedia.org/T317799 (10RLazarus) p:05Triage→03High [18:30:45] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832304 (https://phabricator.wikimedia.org/T314190) [18:30:46] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832304 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [18:31:36] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832304 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [18:31:57] !log dancy@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.1 refs T314190 [18:32:01] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [18:33:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:33:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:33:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:34:00] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Tanuja Doriya - https://phabricator.wikimedia.org/T317613 (10BCornwall) Adding @KFrancis as per https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#wmde_access to ensure that the NDA has been signed. [18:34:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:36:38] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.1 refs T314190 (duration: 04m 41s) [18:36:41] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Sustainability (Incident Followup): Add basic alerting to the Beta Cluster - https://phabricator.wikimedia.org/T315695 (10TheresNoTime) a:03TheresNoTime [18:36:54] I'm going to let it marinate for 15 [18:39:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:40:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:40:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:41:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:44:21] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10KFrancis) @BCornwall @HasanAkgun_WMDE Please send me Hasan's WMDE email address. If you prefer not to post it here, please send it to kfrancis@wikimedia.org. Once I have that informa... [18:47:21] (03PS1) 10BCornwall: Prometheus: Remove ATS gauge periods [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) [18:48:01] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@e358893]: drop-snapshots: tables are partitioned by wiki [18:48:54] (03CR) 10Hashar: "I have cherry picked it on puppetmaster-1001.devtools.eqiad1.wikimedia.cloud and ran puppet on gerrit-prod-1001.devtools.eqiad1.wikimedia." [puppet] - 10https://gerrit.wikimedia.org/r/831933 (owner: 10Hashar) [18:50:04] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10BCornwall) @KFrancis I sent you the email address. Thanks! [18:50:06] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@e358893]: drop-snapshots: tables are partitioned by wiki (duration: 02m 05s) [18:51:12] (03CR) 10Hashar: "Cherry picked on puppetmaster-1001.devtools.eqiad1.wikimedia.cloud and ran puppet on gerrit-prod-1001.devtools.eqiad1.wikimedia.cloud. Th" [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [18:52:21] (03PS4) 10Hashar: gerrit: move jetty class to init [puppet] - 10https://gerrit.wikimedia.org/r/832230 [18:52:48] (03CR) 10Herron: "Interested in your thoughts about this. Eventually it should go a step further to update an article with the roster as well, but for now " [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) (owner: 10Herron) [18:53:46] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832330 (https://phabricator.wikimedia.org/T314190) [18:53:48] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832330 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [18:53:56] (03CR) 10Hashar: "I have rebased the change to keep the series of patch up to date." [puppet] - 10https://gerrit.wikimedia.org/r/832230 (owner: 10Hashar) [18:54:40] (03CR) 10BCornwall: "This can be observed by manually changing the dots to underscores on a live instance:" [puppet] - 10https://gerrit.wikimedia.org/r/832327 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [18:55:51] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832330 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [18:58:37] (03PS3) 10Hashar: gerrit: change its templates to regular files [puppet] - 10https://gerrit.wikimedia.org/r/831963 [18:59:19] (03PS6) 10Herron: victorps.py: add print_weekly_schedule command [software/klaxon] - 10https://gerrit.wikimedia.org/r/827562 (https://phabricator.wikimedia.org/T309115) [18:59:40] (03CR) 10Hashar: "I have further rebased it on top of the rest of the patches and cherry picked it again. It still doing what it is expected, the templates " [puppet] - 10https://gerrit.wikimedia.org/r/831963 (owner: 10Hashar) [18:59:53] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.1 refs T314190 [18:59:56] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [19:01:23] (03PS2) 10Hashar: gerrit: modernize spec [puppet] - 10https://gerrit.wikimedia.org/r/832260 [19:01:37] (03CR) 10Andrew Bogott: [C: 03+2] toolviews.py: Record unique IP page views along with total pageviews [puppet] - 10https://gerrit.wikimedia.org/r/832001 (https://phabricator.wikimedia.org/T317714) (owner: 10Andrew Bogott) [19:01:54] (03CR) 10Hashar: "rebased on top of the rest of the chain." [puppet] - 10https://gerrit.wikimedia.org/r/832260 (owner: 10Hashar) [19:01:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:02:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:02:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:02:59] 10SRE, 10Data-Persistence (Consultation), 10MediaWiki-extensions-Phonos, 10serviceops, 10Community-Tech (CommTech-Sprint-33): SRE/Data Persistence consultation — use of FSFileBackend for caching audio files - https://phabricator.wikimedia.org/T314789 (10NRodriguez) 05Open→03Resolved a:03NRodriguez [19:03:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:08:36] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Hasan Akgün (WMDE) - https://phabricator.wikimedia.org/T317637 (10KFrancis) Hello all, the agreement is out for signatures. I'll update you when it's complete. [19:08:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:15:15] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@48e506e]: drop-snapshots: Remove directory handling [19:15:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:15:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:17:19] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@48e506e]: drop-snapshots: Remove directory handling (duration: 02m 03s) [19:22:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:22:47] (03PS4) 10Ebernhardson: apifeatureusage: Drop mapping type from template [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434) [19:22:59] (03CR) 10Ebernhardson: "this should be ready for merge/deploy now" [puppet] - 10https://gerrit.wikimedia.org/r/815784 (https://phabricator.wikimedia.org/T313434) (owner: 10Ebernhardson) [19:24:16] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [19:24:55] !log dancy@deploy1002 Started scap: testing [19:26:58] !log dancy@deploy1002 touch /var/lib/deploy-mwdebug/pause [19:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:19] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:36:27] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:38:20] !log dancy@deploy1002 sync-world aborted: testing (duration: 13m 25s) [19:38:51] !log dancy@deploy1002 Started scap: testing [19:39:39] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:39:39] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:39:42] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:46:06] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:46:14] !log dancy@deploy1002 scap failed: CalledProcessError Command '['helmfile', '-e', 'eqiad', 'apply']' returned non-zero exit status 1. (duration: 07m 23s) [19:47:31] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: buildkitd.service,docker-gc.service,docker-resource-monitor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:18] !log dancy@deploy1002 Started scap: testing [19:48:39] (HelmReleaseBadStatus) firing: Helm release mwdebug/pinkunicorn on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [19:49:05] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:49:05] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:49:08] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:49:49] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:26] (03PS1) 10TsepoThoabala: Enable action blocks on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832339 (https://phabricator.wikimedia.org/T317157) [19:51:41] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:51:45] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:55:21] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:55:30] !log dancy@deploy1002 scap failed: CalledProcessError Command '['helmfile', '-e', 'eqiad', 'apply']' returned non-zero exit status 1. (duration: 07m 12s) [19:57:38] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: sync [19:57:39] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync [20:00:05] RoanKattouw, Urbanecm, cjming, and TheresNoTime: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220914T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:17] o7 [20:00:41] Indeed, nothing to deploy [20:00:46] Excellent. [20:00:59] I will continue testing on the deploy server.. I'll roll the train to group1 shortly [20:01:13] \o/ [20:01:27] !log Nothing to deploy in this UTC late backport window [20:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:03] !log dancy@deploy1002 Started scap: testing [20:02:51] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:02:51] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:03:39] (HelmReleaseBadStatus) resolved: Helm release mwdebug/pinkunicorn on k8s@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:06:30] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:09:22] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:09:30] !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:09:43] !log dancy@deploy1002 Sync cancelled. [20:10:43] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832342 (https://phabricator.wikimedia.org/T314190) [20:10:45] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832342 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [20:11:28] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/832342 (https://phabricator.wikimedia.org/T314190) (owner: 10TrainBranchBot) [20:12:20] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:12:29] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:12:55] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T317804 (10phaultfinder) [20:13:03] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:13:21] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:14:05] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:18:04] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.1 refs T314190 [20:18:10] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [20:18:47] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:19:29] !log dancy@deploy1002 sync-file aborted: group1 wikis to 1.40.0-wmf.1 refs T314190 (duration: 01m 24s) [20:19:29] !log dancy@deploy1002 deploy-promote aborted: (duration: 08m 52s) [20:21:10] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:24:43] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:24:51] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:27:04] (03PS1) 10Hashar: gerrit: gerrit-theme.html is long gone [puppet] - 10https://gerrit.wikimedia.org/r/832343 (https://phabricator.wikimedia.org/T299877) [20:27:06] (03PS1) 10Hashar: gerrit: remove unused mysql-connector-java lib [puppet] - 10https://gerrit.wikimedia.org/r/832344 [20:27:08] (03PS1) 10Hashar: gerrit: decouple scap and daemon users [puppet] - 10https://gerrit.wikimedia.org/r/832345 [20:27:42] (03CR) 10Hashar: "gerrit-theme.html has been removed from the server, it is no more supported by Gerrit 3.4." [puppet] - 10https://gerrit.wikimedia.org/r/832343 (https://phabricator.wikimedia.org/T299877) (owner: 10Hashar) [20:28:26] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:29:21] (03CR) 10Hashar: "I have noticed we still have it on gerrit1001:" [puppet] - 10https://gerrit.wikimedia.org/r/832344 (owner: 10Hashar) [20:32:03] 10SRE, 10LDAP-Access-Requests: Request for changing LDAP (Wikitech) username - https://phabricator.wikimedia.org/T317623 (10Dzahn) @HasanAkgun_WMDE Hi, Do you mean the shell username (haak)? Can you just use a different name there as well? Are you expecting to use shell access anyways? Maybe it doesn't even ma... [20:32:39] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.1 refs T314190 [20:32:42] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [20:33:22] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:34:04] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:34:11] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:34:56] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:38:29] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.1 refs T314190 (duration: 05m 49s) [20:38:32] T314190: 1.40.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T314190 [20:39:34] !log dancy@deploy1002 Started scap: testing [20:40:27] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:40:27] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:44:07] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:44:24] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:44:32] !log dancy@deploy1002 dancy: testing synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:44:49] !log dancy@deploy1002 Sync cancelled. [20:45:18] Done testing for now. [20:47:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:47:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:47:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:47:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:52:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:56:09] 10SRE-tools, 10Infrastructure-Foundations, 10Release-Engineering-Team: Investigate sharing releng common python code to pywmflib - https://phabricator.wikimedia.org/T316757 (10thcipriani) >>! In T316757#8236998, @Volans wrote: > Ok, no problem. Feel free to re-open if that changes. Thank you for answering a... [20:57:27] (03CR) 10Cwhite: [C: 03+2] smart: restore get_fact and deprecate get_raid_drivers [puppet] - 10https://gerrit.wikimedia.org/r/831114 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [21:06:46] (03PS2) 10Cwhite: logstash: use rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824316 (https://phabricator.wikimedia.org/T315500) [21:11:05] (03CR) 10Cwhite: [C: 03+2] logstash: use rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824316 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [21:12:58] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:12:58] (03CR) 10BCornwall: [C: 03+2] Enable the krb flag for hokwelum [puppet] - 10https://gerrit.wikimedia.org/r/832295 (https://phabricator.wikimedia.org/T317545) (owner: 10Btullis) [21:15:17] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) [21:20:05] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Patch-For-Review: Data Engineering Access for Hannah - https://phabricator.wikimedia.org/T317545 (10BCornwall) Hi, @Hokwelum ! You should now have all the access you require. Could you test out your new superpowers to confirm? Perhaps @Milimetric could pr... [21:22:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T314041)', diff saved to https://phabricator.wikimedia.org/P34735 and previous config saved to /var/cache/conftool/dbconfig/20220914-212225-ladsgroup.json [21:22:30] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [21:32:26] dancy: how is mw looking? would it be ok to do the scap deployment now? [21:33:01] Train seems settled for the day. It's all yours [21:33:06] ty! [21:33:08] jouncebot: now [21:33:08] No deployments scheduled for the next 8 hour(s) and 26 minute(s) [21:33:36] (03PS1) 10Andrew Bogott: toolviews.py: add hourly and daily prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/832355 [21:34:34] 10Puppet, 10SRE, 10Infrastructure-Foundations: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293 (10colewhite) raid_mgmt_tools does not detect raid on `clouddb1021` ` cwhite@clouddb1021:~$ sudo /usr/bin/ruby /var/lib/puppet/lib/facter/raid.rb | jq . { "raid": [ "megaraid" ] }... [21:34:53] !log Deploying scap 4.19.1 (https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/832297/1/changelog) [21:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:19] !log dduvall@deploy1002 Installing scap version "4.19.1" for 561 hosts [21:35:37] !log dduvall@deploy1002 Installation of scap version "4.19.1" completed for 561 hosts [21:36:03] all done [21:36:42] !log testing phabricator deployment to phab2002. should have no production impact (not serving traffic, no access to r/w db) [21:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:50] (03PS3) 10Cwhite: logstash: add tcpircbot logging tests [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) [21:37:03] !log dduvall@deploy1002 Started deploy [phabricator/deployment@3137c92]: testing phabricator deployment to phab2002 [21:37:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P34736 and previous config saved to /var/cache/conftool/dbconfig/20220914-213732-ladsgroup.json [21:37:52] 10SRE, 10Observability-Alerting, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Dzahn) I would have fixed it if there were separate definitions for the valid text for each of the projects. But there isn't. The check_legal_html.py expects all project... [21:38:51] !log dduvall@deploy1002 Finished deploy [phabricator/deployment@3137c92]: testing phabricator deployment to phab2002 (duration: 01m 48s) [21:40:43] (03CR) 10CI reject: [V: 04-1] logstash: add tcpircbot logging tests [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [21:42:53] 10SRE, 10Observability-Alerting, 10observability: en.wikibooks.org has changed legal footer - https://phabricator.wikimedia.org/T317169 (10Dzahn) @Vahurzpu cc: to the above. I don't know what to do about it, I am just reporting it and would have made an easy fix but it's not an easy fix. We would have to imp... [21:46:08] (03PS5) 10BCornwall: varnish/tests: improve UX, refactor run.py [puppet] - 10https://gerrit.wikimedia.org/r/771863 (owner: 10Giuseppe Lavagetto) [21:47:32] (03CR) 10BCornwall: "I rebased onto https://gerrit.wikimedia.org/r/c/operations/puppet/+/826367 but the questions I have still apply, particularly the one rega" [puppet] - 10https://gerrit.wikimedia.org/r/771863 (owner: 10Giuseppe Lavagetto) [21:50:53] (03PS4) 10Cwhite: logstash: add tcpircbot logging tests [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) [21:52:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P34737 and previous config saved to /var/cache/conftool/dbconfig/20220914-215238-ladsgroup.json [21:55:55] (03CR) 10Cwhite: [C: 03+2] logstash: add tcpircbot logging tests [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [21:57:26] brett: I'll merge your changes [21:57:52] cwhite: D'oh, forgot to do that. Thank you very much [22:06:00] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Ladsgroup) >>! In T317662#8233007, @Marostegui wrote: > Started mysql for now. Will do a data check but will leave the host depooled. I think mysql went down again. [22:07:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T314041)', diff saved to https://phabricator.wikimedia.org/P34738 and previous config saved to /var/cache/conftool/dbconfig/20220914-220744-ladsgroup.json [22:07:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [22:07:49] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [22:08:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [22:08:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [22:08:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance [22:08:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T314041)', diff saved to https://phabricator.wikimedia.org/P34739 and previous config saved to /var/cache/conftool/dbconfig/20220914-220822-ladsgroup.json [22:08:24] (03PS1) 10Dzahn: phabricator: remove Icinga monitoring for phd supervising processes [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) [22:08:48] PROBLEM - Check systemd state on mw1313 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:49] (03PS2) 10Dzahn: phabricator: remove Icinga monitoring for phd supervising processes [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) [22:08:53] (03PS2) 10Andrew Bogott: toolviews.py: add hourly and daily prometheus stats [puppet] - 10https://gerrit.wikimedia.org/r/832355 [22:11:28] (03PS3) 10Dzahn: phabricator: remove Icinga monitoring for phd supervising processes [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) [22:11:45] (03PS4) 10Dzahn: phabricator: remove Icinga monitoring for phd supervising processes [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) [22:11:53] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/T315962#8180510" [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) (owner: 10Dzahn) [22:13:40] (03CR) 10Andrew Bogott: "I've confirmed that this writes some things to a file. As to weather it's useful for prometheus... ???" [puppet] - 10https://gerrit.wikimedia.org/r/832355 (owner: 10Andrew Bogott) [22:18:34] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1002/37261/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/832368 (https://phabricator.wikimedia.org/T315962) (owner: 10Dzahn) [22:24:04] (03PS1) 10Dzahn: phabricator: remove absented code for icinga phd superiving procs [puppet] - 10https://gerrit.wikimedia.org/r/832371 (https://phabricator.wikimedia.org/T315962) [22:25:39] 10SRE, 10ops-eqiad, 10DBA: db1189 broken memory - https://phabricator.wikimedia.org/T317662 (10Jclark-ctr) Parts should be in tomorrow or Friday [22:26:07] (03CR) 10Dzahn: [C: 03+2] phabricator: remove absented code for icinga phd superiving procs [puppet] - 10https://gerrit.wikimedia.org/r/832371 (https://phabricator.wikimedia.org/T315962) (owner: 10Dzahn) [22:28:22] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:38:37] PROBLEM - SSH on mw1313.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:42:40] google calendar down (404) hmm [22:43:45] no, a deleted event [22:44:36] (03CR) 10Dzahn: [C: 03+1] scap: Target only phab2002.codfw.wmnet for now [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/832282 (https://phabricator.wikimedia.org/T313954) (owner: 10Dduvall) [23:01:31] RECOVERY - Check systemd state on mw1313 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:39:53] RECOVERY - SSH on mw1313.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:55:09] (03CR) 10Dzahn: [C: 03+2] gerrit: ignore lint error in role [puppet] - 10https://gerrit.wikimedia.org/r/831932 (owner: 10Hashar) [23:56:30] (03CR) 10Dzahn: [C: 03+2] gerrit: ignore lint error in role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/831932 (owner: 10Hashar)