[00:01:54] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [00:04:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32976 and previous config saved to /var/cache/conftool/dbconfig/20220825-000443-ladsgroup.json [00:05:12] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:16] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:06] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:30] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T314041)', diff saved to https://phabricator.wikimedia.org/P32977 and previous config saved to /var/cache/conftool/dbconfig/20220825-001949-ladsgroup.json [00:19:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [00:19:55] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [00:20:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [00:20:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [00:21:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [00:21:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T314041)', diff saved to https://phabricator.wikimedia.org/P32978 and previous config saved to /var/cache/conftool/dbconfig/20220825-002120-ladsgroup.json [00:23:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T314041)', diff saved to https://phabricator.wikimedia.org/P32979 and previous config saved to /var/cache/conftool/dbconfig/20220825-002306-ladsgroup.json [00:29:05] 10SRE, 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10CDunn) Approved [00:32:38] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.265 second response time https://wikitech.wikimedia.org/wiki/Swift [00:34:52] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift [00:38:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P32980 and previous config saved to /var/cache/conftool/dbconfig/20220825-003812-ladsgroup.json [00:42:58] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:08] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:28] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.241 second response time https://wikitech.wikimedia.org/wiki/Swift [00:46:46] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift [00:53:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P32981 and previous config saved to /var/cache/conftool/dbconfig/20220825-005318-ladsgroup.json [01:08:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T314041)', diff saved to https://phabricator.wikimedia.org/P32982 and previous config saved to /var/cache/conftool/dbconfig/20220825-010824-ladsgroup.json [01:08:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [01:08:30] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:08:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance [01:08:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T314041)', diff saved to https://phabricator.wikimedia.org/P32983 and previous config saved to /var/cache/conftool/dbconfig/20220825-010845-ladsgroup.json [01:10:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T314041)', diff saved to https://phabricator.wikimedia.org/P32984 and previous config saved to /var/cache/conftool/dbconfig/20220825-011032-ladsgroup.json [01:25:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P32985 and previous config saved to /var/cache/conftool/dbconfig/20220825-012538-ladsgroup.json [01:27:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:40:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P32986 and previous config saved to /var/cache/conftool/dbconfig/20220825-014044-ladsgroup.json [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T314041)', diff saved to https://phabricator.wikimedia.org/P32987 and previous config saved to /var/cache/conftool/dbconfig/20220825-015550-ladsgroup.json [01:55:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [01:55:56] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [01:56:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [01:56:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32988 and previous config saved to /var/cache/conftool/dbconfig/20220825-015612-ladsgroup.json [01:58:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32989 and previous config saved to /var/cache/conftool/dbconfig/20220825-015800-ladsgroup.json [02:00:02] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:12] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P32990 and previous config saved to /var/cache/conftool/dbconfig/20220825-021306-ladsgroup.json [02:21:12] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.207 second response time https://wikitech.wikimedia.org/wiki/Swift [02:25:50] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P32991 and previous config saved to /var/cache/conftool/dbconfig/20220825-022812-ladsgroup.json [02:43:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32992 and previous config saved to /var/cache/conftool/dbconfig/20220825-024318-ladsgroup.json [02:43:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [02:43:24] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [02:43:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [02:43:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32993 and previous config saved to /var/cache/conftool/dbconfig/20220825-024339-ladsgroup.json [02:45:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32994 and previous config saved to /var/cache/conftool/dbconfig/20220825-024527-ladsgroup.json [02:56:54] PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.286 second response time https://wikitech.wikimedia.org/wiki/Swift [03:00:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P32995 and previous config saved to /var/cache/conftool/dbconfig/20220825-030033-ladsgroup.json [03:01:20] RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Swift [03:09:04] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P32996 and previous config saved to /var/cache/conftool/dbconfig/20220825-031539-ladsgroup.json [03:16:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:23:10] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:33] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:24] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:12] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32997 and previous config saved to /var/cache/conftool/dbconfig/20220825-033045-ladsgroup.json [03:30:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [03:30:51] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [03:31:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [03:31:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T314041)', diff saved to https://phabricator.wikimedia.org/P32998 and previous config saved to /var/cache/conftool/dbconfig/20220825-033107-ladsgroup.json [03:32:10] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T314041)', diff saved to https://phabricator.wikimedia.org/P32999 and previous config saved to /var/cache/conftool/dbconfig/20220825-033253-ladsgroup.json [03:41:43] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:42:10] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:28] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:54] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P33000 and previous config saved to /var/cache/conftool/dbconfig/20220825-034759-ladsgroup.json [04:03:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P33001 and previous config saved to /var/cache/conftool/dbconfig/20220825-040306-ladsgroup.json [04:08:06] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:16] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:18:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33002 and previous config saved to /var/cache/conftool/dbconfig/20220825-041812-ladsgroup.json [04:18:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [04:18:17] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [04:18:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [04:18:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T314041)', diff saved to https://phabricator.wikimedia.org/P33003 and previous config saved to /var/cache/conftool/dbconfig/20220825-041833-ladsgroup.json [04:20:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T314041)', diff saved to https://phabricator.wikimedia.org/P33004 and previous config saved to /var/cache/conftool/dbconfig/20220825-042020-ladsgroup.json [04:23:55] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Marostegui) That's ok from my side [04:25:56] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Marostegui) Please note that the last hostnames should be: db1201 db1202 db1203 [04:35:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P33005 and previous config saved to /var/cache/conftool/dbconfig/20220825-043527-ladsgroup.json [04:41:10] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/826385 (https://phabricator.wikimedia.org/T304924) (owner: 10Cwhite) [04:50:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P33006 and previous config saved to /var/cache/conftool/dbconfig/20220825-045033-ladsgroup.json [05:05:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T314041)', diff saved to https://phabricator.wikimedia.org/P33007 and previous config saved to /var/cache/conftool/dbconfig/20220825-050539-ladsgroup.json [05:05:45] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [05:06:52] (03PS1) 10Marostegui: db1186: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826416 (https://phabricator.wikimedia.org/T313569) [05:07:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130', diff saved to https://phabricator.wikimedia.org/P33008 and previous config saved to /var/cache/conftool/dbconfig/20220825-050713-root.json [05:08:18] (03CR) 10Marostegui: [C: 03+2] db1186: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826416 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:09:50] (03PS1) 10Marostegui: instances.yaml: Add db1186 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826417 (https://phabricator.wikimedia.org/T313569) [05:10:30] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1186 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826417 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:11:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1186 to dbctl', diff saved to https://phabricator.wikimedia.org/P33010 and previous config saved to /var/cache/conftool/dbconfig/20220825-051130-marostegui.json [05:11:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1186 with minimal weight in s1 T313569', diff saved to https://phabricator.wikimedia.org/P33011 and previous config saved to /var/cache/conftool/dbconfig/20220825-051155-root.json [05:12:00] T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569 [05:13:52] (03PS1) 10Marostegui: db1188: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826418 (https://phabricator.wikimedia.org/T313569) [05:14:32] (03CR) 10Marostegui: [C: 03+2] db1188: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826418 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:15:41] (03PS1) 10Marostegui: instances.yaml: Add db1188 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826419 (https://phabricator.wikimedia.org/T313569) [05:16:19] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1188 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826419 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:17:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1188 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33012 and previous config saved to /var/cache/conftool/dbconfig/20220825-051737-marostegui.json [05:17:42] T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569 [05:17:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1188 with minimal weight in s2 T313569', diff saved to https://phabricator.wikimedia.org/P33013 and previous config saved to /var/cache/conftool/dbconfig/20220825-051754-root.json [05:18:43] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826334 [05:18:52] (03PS1) 10Marostegui: Revert "parsercache: Promote pc1014 to pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/826335 [05:19:08] (03PS1) 10Marostegui: Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/826336 [05:19:17] (03PS2) 10Marostegui: Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/826336 [05:22:32] (03CR) 10Marostegui: [C: 03+2] Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/826336 (owner: 10Marostegui) [05:23:22] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Aklapper) [05:23:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s4 T315419 [05:23:48] T315419: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T315419 [05:23:57] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Aklapper) [05:23:59] 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 (2.44.10) - https://phabricator.wikimedia.org/T193352 (10Aklapper) [05:24:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T315419 [05:24:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1160 with weight 0 T315419', diff saved to https://phabricator.wikimedia.org/P33015 and previous config saved to /var/cache/conftool/dbconfig/20220825-052415-ladsgroup.json [05:25:26] (03CR) 10Ladsgroup: [C: 03+2] Display page namespace with spaces instead of underscores when page doesn't exist [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826332 (https://phabricator.wikimedia.org/T316092) (owner: 10Ladsgroup) [05:25:45] (03CR) 10Ladsgroup: [C: 03+1] Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826334 (owner: 10Marostegui) [05:26:38] (03PS1) 10Marostegui: db1190: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826420 [05:29:13] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [05:29:19] (03CR) 10Marostegui: [C: 03+2] db1190: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826420 (owner: 10Marostegui) [05:30:45] (03PS1) 10Marostegui: instances.yaml: Add db1190 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826421 (https://phabricator.wikimedia.org/T313569) [05:32:05] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1190 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826421 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:32:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1190 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33016 and previous config saved to /var/cache/conftool/dbconfig/20220825-053253-marostegui.json [05:32:58] T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569 [05:33:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1190 with minimal weight in s4 T313569', diff saved to https://phabricator.wikimedia.org/P33017 and previous config saved to /var/cache/conftool/dbconfig/20220825-053310-root.json [05:33:19] (03CR) 10Andrea Denisse: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/824316 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [05:33:24] (03CR) 10Andrea Denisse: [C: 03+1] logstash: use rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824316 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [05:34:03] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [05:34:30] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/826384 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite) [05:35:59] (03PS1) 10Marostegui: db1191: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826422 (https://phabricator.wikimedia.org/T313569) [05:37:11] (03CR) 10Marostegui: [C: 03+2] db1191: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826422 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:40:29] (03PS1) 10Marostegui: mariadb: Productionize db1185 [puppet] - 10https://gerrit.wikimedia.org/r/826423 (https://phabricator.wikimedia.org/T313569) [05:41:18] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1185 [puppet] - 10https://gerrit.wikimedia.org/r/826423 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:41:36] (03Merged) 10jenkins-bot: Display page namespace with spaces instead of underscores when page doesn't exist [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826332 (https://phabricator.wikimedia.org/T316092) (owner: 10Ladsgroup) [05:43:27] (03PS1) 10Marostegui: site.pp: Remove insetup from db1185 [puppet] - 10https://gerrit.wikimedia.org/r/826424 (https://phabricator.wikimedia.org/T313569) [05:44:17] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1185 [puppet] - 10https://gerrit.wikimedia.org/r/826424 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:45:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [05:46:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [05:46:08] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.26/includes/page/Article.php: Backport: [[gerrit:826332|Display page namespace with spaces instead of underscores when page doesn't exist (T316092)]] (duration: 03m 32s) [05:46:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [05:46:14] T316092: Underscore displayed in namespace prefix for non-existent pages (e.g. "User_talk") - https://phabricator.wikimedia.org/T316092 [05:46:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [05:48:10] (03PS1) 10Marostegui: instances.yaml: Add db1191 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826425 (https://phabricator.wikimedia.org/T313569) [05:49:46] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1191 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826425 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [05:50:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1191 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33018 and previous config saved to /var/cache/conftool/dbconfig/20220825-055038-marostegui.json [05:50:43] T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569 [05:50:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1191 with minimal weight in s7 T313569', diff saved to https://phabricator.wikimedia.org/P33019 and previous config saved to /var/cache/conftool/dbconfig/20220825-055057-root.json [05:58:43] sigh, I haven't moved anything and it's stuck on the 10.6 replica for a full half an hour now [05:59:15] what was the timeout? [05:59:20] 25 [05:59:32] I'm fairly certain it passed 25 minutes [05:59:48] yeah, the problem is that that host is so stuck that even the kills aren't working [06:00:05] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T0600). [06:00:08] ah it timed out now [06:00:08] so the did it went thru now? [06:00:10] yeah [06:00:11] I forced it [06:00:12] he [06:00:27] now we need to do the rest. Should I re-run it? [06:00:41] yes, but I wonder if it will attempt to go for db1143 again [06:00:43] (03PS1) 10Andrea Denisse: librenms: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [06:00:51] :( [06:01:01] let me try one thing [06:01:10] what's the new master, db1160? [06:01:13] yup [06:01:16] (03PS2) 10Andrea Denisse: librenms: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [06:02:34] Amir1: ok, so use db-move-replica with each instance, so you can leave db1143 aside. I just ran this: db-move-replica --timeout 25 db1141 db1160 and it worked [06:02:41] you can continue with all the other hosts [06:02:52] I see [06:02:53] ok [06:03:00] (03PS3) 10Andrea Denisse: librenms: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [06:03:49] marostegui: otoh, db1141 is lagging behind (like db1147) [06:03:57] semi sync again?\ [06:04:26] yeah [06:04:28] just fixed it [06:04:34] thanks [06:04:39] I think db-switchover does disable it before every move [06:06:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114', diff saved to https://phabricator.wikimedia.org/P33020 and previous config saved to /var/cache/conftool/dbconfig/20220825-060601-root.json [06:08:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 5%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33022 and previous config saved to /var/cache/conftool/dbconfig/20220825-060816-root.json [06:12:16] (03PS1) 10Andrea Denisse: librenms: Reserve id for the LibreNMS user; Use systemd::sysuser instead of user. [puppet] - 10https://gerrit.wikimedia.org/r/826429 (https://phabricator.wikimedia.org/T315388) [06:13:12] (03CR) 10CI reject: [V: 04-1] librenms: Reserve id for the LibreNMS user; Use systemd::sysuser instead of user. [puppet] - 10https://gerrit.wikimedia.org/r/826429 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [06:14:49] (03PS4) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) [06:16:21] (03PS2) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826429 (https://phabricator.wikimedia.org/T315388) [06:21:46] (03PS2) 10Ladsgroup: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/824147 (https://phabricator.wikimedia.org/T315419) (owner: 10Gerrit maintenance bot) [06:21:51] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/824147 (https://phabricator.wikimedia.org/T315419) (owner: 10Gerrit maintenance bot) [06:22:22] (03PS1) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) [06:22:37] (03Abandoned) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826429 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [06:22:39] !log Starting s4 eqiad failover from db1138 to db1160 - T315419 [06:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:44] T315419: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T315419 [06:23:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33023 and previous config saved to /var/cache/conftool/dbconfig/20220825-062321-root.json [06:23:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T315419', diff saved to https://phabricator.wikimedia.org/P33024 and previous config saved to /var/cache/conftool/dbconfig/20220825-062353-ladsgroup.json [06:24:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1160 to s4 primary and set section read-write T315419', diff saved to https://phabricator.wikimedia.org/P33025 and previous config saved to /var/cache/conftool/dbconfig/20220825-062425-ladsgroup.json [06:26:00] (03CR) 10Muehlenhoff: "Did you capture the error, what was failing specifically?" [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [06:26:45] (03PS2) 10Ladsgroup: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/824148 (https://phabricator.wikimedia.org/T315419) (owner: 10Gerrit maintenance bot) [06:26:50] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/824148 (https://phabricator.wikimedia.org/T315419) (owner: 10Gerrit maintenance bot) [06:26:55] (03PS2) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) [06:28:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1138 T315419', diff saved to https://phabricator.wikimedia.org/P33026 and previous config saved to /var/cache/conftool/dbconfig/20220825-062852-ladsgroup.json [06:28:57] T315419: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T315419 [06:29:16] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/36966/" [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [06:30:22] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maint on s4 old master [06:32:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maint on s4 old master [06:34:10] (03PS1) 10Andrea Denisse: doc: Fix smalll typos in the systemd::sysuser documentation. [puppet] - 10https://gerrit.wikimedia.org/r/826490 [06:34:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [06:34:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance [06:34:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [06:35:00] (03CR) 10Muehlenhoff: "Isn't profile::mediawiki::common a more logical choice? I think we also want this on the snapshot* hosts as well, having dumps complete fa" [puppet] - 10https://gerrit.wikimedia.org/r/826405 (https://phabricator.wikimedia.org/T315398) (owner: 10Tim Starling) [06:35:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance [06:37:26] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.138 second response time https://wikitech.wikimedia.org/wiki/Swift [06:38:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33027 and previous config saved to /var/cache/conftool/dbconfig/20220825-063826-root.json [06:38:38] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [06:39:50] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:51] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse) [06:43:35] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) 05Open→03In progress [06:46:10] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:31] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) [06:48:52] (03PS2) 10Slyngshede: c:dynamicproxy clean up after cronjob removal. [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) [06:49:14] (03CR) 10CI reject: [V: 04-1] c:dynamicproxy clean up after cronjob removal. [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:50:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [06:50:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [06:50:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [06:51:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [06:51:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T314041)', diff saved to https://phabricator.wikimedia.org/P33028 and previous config saved to /var/cache/conftool/dbconfig/20220825-065128-ladsgroup.json [06:51:33] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:51:42] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:57] (03PS3) 10Slyngshede: c:dynamicproxy clean up after cronjob removal. [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) [06:53:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T314041)', diff saved to https://phabricator.wikimedia.org/P33029 and previous config saved to /var/cache/conftool/dbconfig/20220825-065315-ladsgroup.json [06:53:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33030 and previous config saved to /var/cache/conftool/dbconfig/20220825-065331-root.json [06:56:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:56:47] (03CR) 10Slyngshede: c:dynamicproxy clean up after cronjob removal. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:59:34] (03CR) 10Slyngshede: [C: 03+2] c:dynamicproxy clean up after cronjob removal. [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:00:04] Amir1, apergos, jnuche, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T0700). [07:00:14] good morning! there are no trainees signed up today and no patches scheduled in the window. [07:00:16] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:48] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.163 second response time https://wikitech.wikimedia.org/wiki/Swift [07:01:19] that looks not great [07:01:51] could be related to the problem Amir rised? [07:03:00] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [07:03:42] jynus: just seen a report on irc of someone getting "04:00:48 On commons, "Error deleting file: An unknown error occurred in storage backend "local-swift-eqiad". "" [07:03:53] That's 4 hours ago [07:06:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [07:08:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P33031 and previous config saved to /var/cache/conftool/dbconfig/20220825-070821-ladsgroup.json [07:08:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33032 and previous config saved to /var/cache/conftool/dbconfig/20220825-070835-root.json [07:11:58] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:32] (03CR) 10Muehlenhoff: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse) [07:12:53] there seems to be at times spikes of 504 from eqiad [07:13:01] *from codfw, not eqiad [07:13:20] could be some higher network latency or a proxy overload [07:16:26] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:22] (03CR) 10Jcrespo: [C: 03+1] "This is ready- x1 snapshots on codfw failed twice, but can be retried after maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [07:18:01] 10SRE, 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup) [07:23:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P33033 and previous config saved to /var/cache/conftool/dbconfig/20220825-072327-ladsgroup.json [07:23:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33034 and previous config saved to /var/cache/conftool/dbconfig/20220825-072340-root.json [07:29:18] (03CR) 10Marostegui: [C: 03+2] "This was meant to be: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/826420 (owner: 10Marostegui) [07:30:10] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826334 (owner: 10Marostegui) [07:30:53] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826334 (owner: 10Marostegui) [07:31:41] (03PS2) 10Marostegui: Revert "parsercache: Promote pc1014 to pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/826335 [07:32:55] (03CR) 10Marostegui: [C: 03+2] Revert "parsercache: Promote pc1014 to pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/826335 (owner: 10Marostegui) [07:34:40] !log Promote pc1012 back as pc2 master T315526 [07:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:44] T315526: Promote pc1014 to pc2 master - https://phabricator.wikimedia.org/T315526 [07:36:09] !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1012 to pc2 master T315526 (duration: 03m 39s) [07:38:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:38:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T314041)', diff saved to https://phabricator.wikimedia.org/P33035 and previous config saved to /var/cache/conftool/dbconfig/20220825-073834-ladsgroup.json [07:38:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [07:38:38] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [07:38:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [07:38:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T314041)', diff saved to https://phabricator.wikimedia.org/P33036 and previous config saved to /var/cache/conftool/dbconfig/20220825-073855-ladsgroup.json [07:39:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:39:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:40:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:40:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T314041)', diff saved to https://phabricator.wikimedia.org/P33037 and previous config saved to /var/cache/conftool/dbconfig/20220825-074041-ladsgroup.json [07:40:51] (03PS1) 10Marostegui: install_server: Do not reimage db1185, db1186 and db1188 [puppet] - 10https://gerrit.wikimedia.org/r/826494 [07:42:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1137.eqiad.wmnet with reason: Maintenance [07:42:07] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1185, db1186 and db1188 [puppet] - 10https://gerrit.wikimedia.org/r/826494 (owner: 10Marostegui) [07:42:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1137.eqiad.wmnet with reason: Maintenance [07:42:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1137 (T312160)', diff saved to https://phabricator.wikimedia.org/P33038 and previous config saved to /var/cache/conftool/dbconfig/20220825-074220-ladsgroup.json [07:42:25] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [07:43:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33039 and previous config saved to /var/cache/conftool/dbconfig/20220825-074307-root.json [07:43:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33040 and previous config saved to /var/cache/conftool/dbconfig/20220825-074315-root.json [07:43:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33041 and previous config saved to /var/cache/conftool/dbconfig/20220825-074323-root.json [07:44:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33042 and previous config saved to /var/cache/conftool/dbconfig/20220825-074400-root.json [07:44:38] (03PS1) 10Marostegui: mariadb: Productionize db1192 [puppet] - 10https://gerrit.wikimedia.org/r/826495 (https://phabricator.wikimedia.org/T313569) [07:45:39] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1192 [puppet] - 10https://gerrit.wikimedia.org/r/826495 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [07:51:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Switchover m1 T315864 [07:51:24] T315864: Switchover m1 master (db1164 -> db1195) - https://phabricator.wikimedia.org/T315864 [07:51:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Switchover m1 T315864 [07:52:52] (03PS1) 10Slyngshede: c:spamassassin debug alerting on Spamassassin timer. [puppet] - 10https://gerrit.wikimedia.org/r/826496 [07:54:29] (03PS3) 10Marostegui: mariadb: Promote db1195 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826222 (https://phabricator.wikimedia.org/T315864) [07:55:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P33044 and previous config saved to /var/cache/conftool/dbconfig/20220825-075547-ladsgroup.json [07:56:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1195 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826222 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [07:58:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33045 and previous config saved to /var/cache/conftool/dbconfig/20220825-075811-root.json [07:58:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33046 and previous config saved to /var/cache/conftool/dbconfig/20220825-075820-root.json [07:58:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33047 and previous config saved to /var/cache/conftool/dbconfig/20220825-075828-root.json [07:59:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33048 and previous config saved to /var/cache/conftool/dbconfig/20220825-075905-root.json [07:59:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P33049 and previous config saved to /var/cache/conftool/dbconfig/20220825-075924-root.json [08:00:04] hashar and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T0800). [08:01:17] (03PS2) 10Slyngshede: c:spamassassin debug alerting on Spamassassin timer. [puppet] - 10https://gerrit.wikimedia.org/r/826496 [08:03:13] (03PS1) 10Marostegui: mariadb: Productionize db1193 [puppet] - 10https://gerrit.wikimedia.org/r/826498 (https://phabricator.wikimedia.org/T313569) [08:03:57] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.190 second response time https://wikitech.wikimedia.org/wiki/Swift [08:04:53] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift [08:05:05] (03PS3) 10Slyngshede: c:spamassassin debug alerting on Spamassassin timer. [puppet] - 10https://gerrit.wikimedia.org/r/826496 [08:06:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1193 [puppet] - 10https://gerrit.wikimedia.org/r/826498 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [08:06:46] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36970/console" [puppet] - 10https://gerrit.wikimedia.org/r/826496 (owner: 10Slyngshede) [08:07:01] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:26] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] c:spamassassin debug alerting on Spamassassin timer. [puppet] - 10https://gerrit.wikimedia.org/r/826496 (owner: 10Slyngshede) [08:09:58] !log Reboot db1195 for kernel upgrade T315864 [08:10:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:03] T315864: Switchover m1 master (db1164 -> db1195) - https://phabricator.wikimedia.org/T315864 [08:10:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P33050 and previous config saved to /var/cache/conftool/dbconfig/20220825-081053-ladsgroup.json [08:10:56] I am going to stop bacula for some time, please avoid accidental deleting of production data in the next hour or so [08:12:53] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:12:59] ^ me [08:13:01] !log stopping bacula services on backup1001 T315864 [08:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:07] jynus: Oh man there goes my morning task of dropping the prod databases :( [08:13:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33051 and previous config saved to /var/cache/conftool/dbconfig/20220825-081316-root.json [08:13:18] (sorry) [08:13:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33052 and previous config saved to /var/cache/conftool/dbconfig/20220825-081325-root.json [08:13:27] claime: please wait until maintenance is complete, apologies for disturbance [08:13:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33053 and previous config saved to /var/cache/conftool/dbconfig/20220825-081333-root.json [08:13:36] x) [08:13:44] it should be done in less than 1h [08:14:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33054 and previous config saved to /var/cache/conftool/dbconfig/20220825-081410-root.json [08:14:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P33055 and previous config saved to /var/cache/conftool/dbconfig/20220825-081429-root.json [08:14:37] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2396 is CRITICAL: etcd last index (1119153) is outdated compared to the master one (1119159) https://wikitech.wikimedia.org/wiki/Etcd [08:15:15] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:15:53] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:16:05] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2396 is OK: etcd last index (1119159) matches the master one (1119159) https://wikitech.wikimedia.org/wiki/Etcd [08:17:45] (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:17:56] ^that is me [08:18:01] bacula is down at the moment [08:19:38] (03CR) 10Vgutierrez: Varnish: Stop sending analytics cookies to API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [08:22:14] (03CR) 10Vgutierrez: [C: 03+2] Increase roll-out of query-sorting to 5% [puppet] - 10https://gerrit.wikimedia.org/r/826398 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [08:22:51] Increase roll-out of query-sorting to 5% [08:23:10] !log Increase roll-out of query-sorting to 5% - T314868 [08:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:14] T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 [08:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T314041)', diff saved to https://phabricator.wikimedia.org/P33056 and previous config saved to /var/cache/conftool/dbconfig/20220825-082559-ladsgroup.json [08:26:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [08:26:04] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [08:26:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [08:26:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T314041)', diff saved to https://phabricator.wikimedia.org/P33057 and previous config saved to /var/cache/conftool/dbconfig/20220825-082621-ladsgroup.json [08:28:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T314041)', diff saved to https://phabricator.wikimedia.org/P33058 and previous config saved to /var/cache/conftool/dbconfig/20220825-082807-ladsgroup.json [08:28:11] (03CR) 10JMeybohm: [C: 04-1] Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff) [08:28:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33059 and previous config saved to /var/cache/conftool/dbconfig/20220825-082821-root.json [08:28:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33060 and previous config saved to /var/cache/conftool/dbconfig/20220825-082830-root.json [08:28:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33061 and previous config saved to /var/cache/conftool/dbconfig/20220825-082837-root.json [08:29:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33062 and previous config saved to /var/cache/conftool/dbconfig/20220825-082915-root.json [08:29:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P33063 and previous config saved to /var/cache/conftool/dbconfig/20220825-082933-root.json [08:30:01] !log Failover m1 from db1164 to db1195 - T315864 [08:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:05] T315864: Switchover m1 master (db1164 -> db1195) - https://phabricator.wikimedia.org/T315864 [08:30:42] done [08:33:49] (03CR) 10Marostegui: [C: 03+2] dbbackups: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui) [08:39:40] !log restarting backupmon1001 [08:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:15] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service,cfssl-ocsprefresh-cloud_wmnet_ca.service,cfssl-ocsprefresh-debmonitor.service,cfssl-ocsprefresh-discovery.service,cfssl-ocsprefresh-etcd.service,cfssl-ocsprefresh-kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:29] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-debmonitor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:31] you gotta love how fast vms reboot compared to its physical counterparts :-D [08:42:45] (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:43:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P33064 and previous config saved to /var/cache/conftool/dbconfig/20220825-084313-ladsgroup.json [08:43:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33065 and previous config saved to /var/cache/conftool/dbconfig/20220825-084326-root.json [08:43:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33066 and previous config saved to /var/cache/conftool/dbconfig/20220825-084334-root.json [08:43:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33067 and previous config saved to /var/cache/conftool/dbconfig/20220825-084342-root.json [08:44:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33068 and previous config saved to /var/cache/conftool/dbconfig/20220825-084419-root.json [08:44:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P33069 and previous config saved to /var/cache/conftool/dbconfig/20220825-084438-root.json [08:50:15] !log installing gnutls28 security updates on bullseye [08:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:35] did we get the recovery for the bacula prometheus job? [08:54:00] good morning, I have overslept [08:54:37] !log installing curl security updates on bullseye [08:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:51] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:56:06] (03PS1) 10Marostegui: dbproxy1012,dbproxy1014: Replace db1164 with db1117 [puppet] - 10https://gerrit.wikimedia.org/r/826506 [08:56:29] (03PS1) 10Hashar: Revert "group1 wikis to 1.39.0-wmf.26" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826507 (https://phabricator.wikimedia.org/T316085) [08:56:29] Oh, I missed it above "(JobUnavailable) resolved:" [08:56:46] (03CR) 10CI reject: [V: 04-1] dbproxy1012,dbproxy1014: Replace db1164 with db1117 [puppet] - 10https://gerrit.wikimedia.org/r/826506 (owner: 10Marostegui) [08:57:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup) [08:57:13] (03CR) 10Hashar: [C: 03+2] "Got applied yesterday manually but I forgot to push it to Gerrit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826507 (https://phabricator.wikimedia.org/T316085) (owner: 10Hashar) [08:57:29] (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826508 (https://phabricator.wikimedia.org/T314187) [08:57:33] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826508 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [08:57:55] hashar: Ha, whoops. [08:57:56] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.39.0-wmf.26" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826507 (https://phabricator.wikimedia.org/T316085) (owner: 10Hashar) [08:58:05] Also tsk. ;-) [08:58:18] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826508 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [08:58:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P33070 and previous config saved to /var/cache/conftool/dbconfig/20220825-085819-ladsgroup.json [08:58:27] (03PS2) 10Marostegui: dbproxy1012,dbproxy1014: Replace db1164 with db1117 [puppet] - 10https://gerrit.wikimedia.org/r/826506 [08:58:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33071 and previous config saved to /var/cache/conftool/dbconfig/20220825-085831-root.json [08:58:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33072 and previous config saved to /var/cache/conftool/dbconfig/20220825-085839-root.json [08:58:46] (03CR) 10Jcrespo: [C: 03+1] "IPs and service on the port double-checked." [puppet] - 10https://gerrit.wikimedia.org/r/826506 (owner: 10Marostegui) [08:58:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33073 and previous config saved to /var/cache/conftool/dbconfig/20220825-085847-root.json [08:59:10] (03CR) 10Marostegui: [C: 03+2] dbproxy1012,dbproxy1014: Replace db1164 with db1117 [puppet] - 10https://gerrit.wikimedia.org/r/826506 (owner: 10Marostegui) [08:59:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33074 and previous config saved to /var/cache/conftool/dbconfig/20220825-085924-root.json [08:59:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P33075 and previous config saved to /var/cache/conftool/dbconfig/20220825-085943-root.json [09:01:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:02:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:02:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:02:26] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.26 refs T314187 [09:02:30] T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187 [09:03:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:05:57] !log hashar@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.26 refs T314187 (duration: 03m 30s) [09:07:50] (03PS1) 10Ladsgroup: admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878) [09:08:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:09:01] (03PS1) 10Marostegui: mariadb: Move db1164 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/826510 (https://phabricator.wikimedia.org/T316187) [09:09:05] (03CR) 10CI reject: [V: 04-1] admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878) (owner: 10Ladsgroup) [09:09:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:09:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:09:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1164 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/826510 (https://phabricator.wikimedia.org/T316187) (owner: 10Marostegui) [09:10:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:11:36] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:12:56] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:06] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:13:12] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:13:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T314041)', diff saved to https://phabricator.wikimedia.org/P33077 and previous config saved to /var/cache/conftool/dbconfig/20220825-091325-ladsgroup.json [09:13:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [09:13:30] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [09:13:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33078 and previous config saved to /var/cache/conftool/dbconfig/20220825-091336-root.json [09:13:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [09:13:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33079 and previous config saved to /var/cache/conftool/dbconfig/20220825-091344-root.json [09:13:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33080 and previous config saved to /var/cache/conftool/dbconfig/20220825-091351-root.json [09:14:22] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:14:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [09:14:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33081 and previous config saved to /var/cache/conftool/dbconfig/20220825-091428-root.json [09:14:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [09:14:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T314041)', diff saved to https://phabricator.wikimedia.org/P33082 and previous config saved to /var/cache/conftool/dbconfig/20220825-091447-ladsgroup.json [09:14:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P33083 and previous config saved to /var/cache/conftool/dbconfig/20220825-091448-root.json [09:15:32] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I can +2 and merge if that helps." [puppet] - 10https://gerrit.wikimedia.org/r/817907 (owner: 10Bearloga) [09:16:14] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:24] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:16:30] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:16:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T314041)', diff saved to https://phabricator.wikimedia.org/P33084 and previous config saved to /var/cache/conftool/dbconfig/20220825-091633-ladsgroup.json [09:18:45] (03CR) 10Btullis: [C: 03+1] "This seems fine to me. I'm happy to +2 and merge if it helps." [puppet] - 10https://gerrit.wikimedia.org/r/817903 (owner: 10Bearloga) [09:19:03] (03PS2) 10Ladsgroup: admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878) [09:19:47] oh nice uploads to commons looks broken [09:20:07] `/w/api.php` PHP Warning: fopen(): Filename cannot be empty [09:21:09] and Fancy captcha have some Swift related `Iterator page I/O error.` [09:21:58] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:22:05] (03CR) 10Vgutierrez: [C: 03+1] "Tests are currently happy. Even if we don't alter GeoIP behaviour in this CR I think it's ok to have it on the VTC code to ensure that api" [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [09:23:00] might have been transient [09:23:03] (03PS1) 10Slyngshede: c:spamassassin remove cronjob, and use systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826513 [09:23:13] (03CR) 10Btullis: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36973/console" [puppet] - 10https://gerrit.wikimedia.org/r/817907 (owner: 10Bearloga) [09:23:28] (03PS3) 10Ladsgroup: admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878) [09:23:31] (03PS1) 10Marostegui: site.pp: Remove db1193 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/826514 (https://phabricator.wikimedia.org/T313569) [09:23:33] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878) (owner: 10Ladsgroup) [09:23:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T312160)', diff saved to https://phabricator.wikimedia.org/P33085 and previous config saved to /var/cache/conftool/dbconfig/20220825-092356-ladsgroup.json [09:24:01] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [09:24:02] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:24:40] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1193 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/826514 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [09:24:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup) 05Open→03Resolved You should be able to have access in half an hour. [09:25:19] I will do the rest of the wikis after our itimezone lunch or in 2-3 hours from now [09:27:01] (03CR) 10Slyngshede: [C: 03+2] c:dynamicproxy move cronjob to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819505 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:28:19] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36974/console" [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede) [09:28:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33086 and previous config saved to /var/cache/conftool/dbconfig/20220825-092840-root.json [09:28:46] (03PS1) 10Marostegui: install_server: Do not reimage db1192 and db1193 [puppet] - 10https://gerrit.wikimedia.org/r/826515 (https://phabricator.wikimedia.org/T313569) [09:28:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33087 and previous config saved to /var/cache/conftool/dbconfig/20220825-092848-root.json [09:28:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33088 and previous config saved to /var/cache/conftool/dbconfig/20220825-092856-root.json [09:29:30] (03PS1) 10Ladsgroup: Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140) [09:29:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33089 and previous config saved to /var/cache/conftool/dbconfig/20220825-092933-root.json [09:30:16] (03CR) 10Muehlenhoff: Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff) [09:30:32] (03CR) 10CI reject: [V: 04-1] Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140) (owner: 10Ladsgroup) [09:31:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P33090 and previous config saved to /var/cache/conftool/dbconfig/20220825-093140-ladsgroup.json [09:32:07] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1192 and db1193 [puppet] - 10https://gerrit.wikimedia.org/r/826515 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [09:32:22] 10SRE, 10Traffic, 10Patch-For-Review: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10Vgutierrez) https://gerrit.wikimedia.org/r/824793 submitted by @BCornwall removes `WMF-Last-Access` cookie from api.wikimedia.org, as he mentioned this also remove... [09:33:47] (03PS2) 10Ladsgroup: Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140) [09:35:28] !log restart backup2001 [09:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:46] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:36:08] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P33091 and previous config saved to /var/cache/conftool/dbconfig/20220825-093902-ladsgroup.json [09:39:09] !log Reboot stand by dbproxy hosts [09:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:34] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:43:29] ACKNOWLEDGEMENT - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T316194 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gat [09:43:37] 10SRE, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T316194 (10ops-monitoring-bot) [09:43:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33092 and previous config saved to /var/cache/conftool/dbconfig/20220825-094345-root.json [09:43:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33093 and previous config saved to /var/cache/conftool/dbconfig/20220825-094353-root.json [09:44:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33094 and previous config saved to /var/cache/conftool/dbconfig/20220825-094401-root.json [09:44:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33095 and previous config saved to /var/cache/conftool/dbconfig/20220825-094438-root.json [09:46:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P33096 and previous config saved to /var/cache/conftool/dbconfig/20220825-094646-ladsgroup.json [09:46:50] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [09:48:39] 10SRE, 10Security, 10cloud-services-team (Kanban): Reboot WMCS proxies - https://phabricator.wikimedia.org/T316195 (10Marostegui) [09:49:02] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:49:40] !log restart backup1002, backup2002 [09:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:22] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:50:26] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [09:50:27] (03CR) 10Hnowlan: [C: 03+2] Revert setting expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/825788 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [09:50:53] (03PS1) 10Ladsgroup: auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 [09:51:05] !log installing libxslt security updates on bullseye [09:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:39] (03CR) 10CI reject: [V: 04-1] auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup) [09:52:09] (03Merged) 10jenkins-bot: Revert setting expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/825788 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik) [09:54:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P33097 and previous config saved to /var/cache/conftool/dbconfig/20220825-095408-ladsgroup.json [09:56:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P33098 and previous config saved to /var/cache/conftool/dbconfig/20220825-095611-ladsgroup.json [09:56:52] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:59:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [09:59:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [09:59:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:59:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:59:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T314041)', diff saved to https://phabricator.wikimedia.org/P33099 and previous config saved to /var/cache/conftool/dbconfig/20220825-095942-ladsgroup.json [09:59:48] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:00:00] (03PS5) 10Hnowlan: install_server: add reimage role for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) [10:00:05] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1000). [10:00:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P33100 and previous config saved to /var/cache/conftool/dbconfig/20220825-100010-root.json [10:00:41] (03PS2) 10Ladsgroup: auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 [10:02:36] (03PS1) 10Marostegui: mariadb: Productionize db1194 [puppet] - 10https://gerrit.wikimedia.org/r/826524 (https://phabricator.wikimedia.org/T313569) [10:02:55] (03CR) 10Hnowlan: [C: 03+2] install_server: add reimage role for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan) [10:03:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet [10:03:46] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:04:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1194 [puppet] - 10https://gerrit.wikimedia.org/r/826524 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [10:04:28] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:54] (03PS1) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) [10:06:33] (03CR) 10CI reject: [V: 04-1] Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [10:08:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet [10:09:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T312160)', diff saved to https://phabricator.wikimedia.org/P33102 and previous config saved to /var/cache/conftool/dbconfig/20220825-100915-ladsgroup.json [10:09:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:09:21] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [10:09:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:09:36] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:42] (03PS2) 10Muehlenhoff: Stop reporting releng images to debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/826211 [10:10:16] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:13:16] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:13:32] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw [10:15:10] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.283 second response time https://wikitech.wikimedia.org/wiki/Swift [10:15:35] (03CR) 10Clément Goubert: [C: 03+1] "LGTM, will amend 826245 once merged with the removal of `profile::docker::engine::force_default_docker_storage` if you don't want to do re" [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [10:16:31] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: server-glitch hampering deletions: backend-fail-internal - https://phabricator.wikimedia.org/T316188 (10jcrespo) This is in ongoing investigation. [10:16:44] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:22] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: server-glitch hampering deletions: backend-fail-internal - https://phabricator.wikimedia.org/T316188 (10jcrespo) p:05Triage→03Unbreak! [10:17:24] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:17:26] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Swift [10:19:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T314041)', diff saved to https://phabricator.wikimedia.org/P33103 and previous config saved to /var/cache/conftool/dbconfig/20220825-101930-ladsgroup.json [10:19:36] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [10:22:37] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo) [10:22:46] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [10:23:06] !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [10:24:48] (03PS1) 10Marostegui: site.pp: Remove insetup from db1194 [puppet] - 10https://gerrit.wikimedia.org/r/826526 (https://phabricator.wikimedia.org/T313569) [10:25:30] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.241 second response time https://wikitech.wikimedia.org/wiki/Swift [10:25:57] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1194 [puppet] - 10https://gerrit.wikimedia.org/r/826526 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui) [10:27:30] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:28:17] (03PS1) 10Marostegui: site.pp: Fix db1194 location [puppet] - 10https://gerrit.wikimedia.org/r/826527 [10:28:37] (03CR) 10Marostegui: [V: 03+2 C: 03+2] site.pp: Fix db1194 location [puppet] - 10https://gerrit.wikimedia.org/r/826527 (owner: 10Marostegui) [10:30:08] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [10:32:40] (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/826528 (https://phabricator.wikimedia.org/T316186) [10:33:12] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [10:33:40] (03PS2) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) [10:33:57] (03CR) 10Majavah: [C: 03+2] "I tested this (by copying the script to my home directory and manually editing a deployment) and it works fine." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [10:34:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P33104 and previous config saved to /var/cache/conftool/dbconfig/20220825-103436-ladsgroup.json [10:34:45] (03Merged) 10jenkins-bot: python39: Use shell reimplementation of webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [10:37:03] (03CR) 10Slyngshede: [V: 03+1] "I've re-enabled the spamassassin update timer on otrs1001 and I'm unable to reproduce the error." [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede) [10:40:40] (03PS5) 10Btullis: Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) [10:40:56] (03CR) 10Btullis: Enable the dse-k8s-worker nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [10:42:44] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:42:51] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad [10:44:32] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [10:49:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P33105 and previous config saved to /var/cache/conftool/dbconfig/20220825-104942-ladsgroup.json [10:50:05] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) I am planning to do this switchover on Monday 29th at 08:30 AM UTC. The expected impact would be around 15-30 seconds of RO time. Reads won... [10:50:57] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [10:58:46] (03PS1) 10Muehlenhoff: Mark two disks as failed on ms-be1071 [puppet] - 10https://gerrit.wikimedia.org/r/826532 [10:59:22] (03PS1) 10Vgutierrez: swift: Set sd[dz]1@ms-be1071 as failed [puppet] - 10https://gerrit.wikimedia.org/r/826533 (https://phabricator.wikimedia.org/T315437) [10:59:27] (03CR) 10Ladsgroup: [C: 03+1] Mark two disks as failed on ms-be1071 [puppet] - 10https://gerrit.wikimedia.org/r/826532 (owner: 10Muehlenhoff) [11:00:11] (03PS2) 10Hnowlan: api-gateway: custom host overrides in discovery services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/825729 [11:01:02] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36975/console" [puppet] - 10https://gerrit.wikimedia.org/r/826532 (owner: 10Muehlenhoff) [11:01:06] (03Abandoned) 10Vgutierrez: swift: Set sd[dz]1@ms-be1071 as failed [puppet] - 10https://gerrit.wikimedia.org/r/826533 (https://phabricator.wikimedia.org/T315437) (owner: 10Vgutierrez) [11:04:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T314041)', diff saved to https://phabricator.wikimedia.org/P33106 and previous config saved to /var/cache/conftool/dbconfig/20220825-110448-ladsgroup.json [11:04:58] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [11:07:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1109.eqiad.wmnet with reason: Maintenance [11:07:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1109.eqiad.wmnet with reason: Maintenance [11:08:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [11:08:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance [11:08:58] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] Mark two disks as failed on ms-be1071 [puppet] - 10https://gerrit.wikimedia.org/r/826532 (owner: 10Muehlenhoff) [11:10:28] 10SRE, 10Search-Console-access-request: [REQUEST] Access to GSC for Wikipedia for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T316212 (10soworu) [11:11:38] (03PS1) 10Majavah: Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536 [11:13:31] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/826528 (https://phabricator.wikimedia.org/T316186) (owner: 10Marostegui) [11:14:06] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/826528 (https://phabricator.wikimedia.org/T316186) (owner: 10Marostegui) [11:14:20] (03CR) 10CI reject: [V: 04-1] Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [11:16:31] (03PS8) 10Hnowlan: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [11:17:13] (03CR) 10Clément Goubert: [C: 03+1] Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [11:19:55] (03CR) 10Clément Goubert: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36977/console" [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [11:24:07] (03PS2) 10Majavah: Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536 [11:26:44] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.163 second response time https://wikitech.wikimedia.org/wiki/Swift [11:27:45] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36980/console" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah) [11:28:30] (03PS9) 10Hnowlan: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [11:29:31] !log Failover m1-master [11:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:13] 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) [11:32:34] !log restart swift-proxy on ms-fe1010 [11:32:36] jynus: ^ [11:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:41] thanks [11:33:16] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [11:40:20] !log depool ms-fe1012, leave swift-proxy alone for investigation [11:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:20] (03PS1) 10Filippo Giunchedi: Revert "swift: bump proxy memcache max connections" [puppet] - 10https://gerrit.wikimedia.org/r/826340 [11:49:57] (03CR) 10CI reject: [V: 04-1] Revert "swift: bump proxy memcache max connections" [puppet] - 10https://gerrit.wikimedia.org/r/826340 (owner: 10Filippo Giunchedi) [11:50:11] WAT [11:50:19] (03PS1) 10KartikMistry: CX3 Build 0.2.0+20220825 [extensions/ContentTranslation] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826341 (https://phabricator.wikimedia.org/T309986) [11:50:54] (03PS2) 10Filippo Giunchedi: Revert "swift: bump proxy memcache max connections" [puppet] - 10https://gerrit.wikimedia.org/r/826340 [11:51:18] ok commit message reformatted, gods of CI appeased [11:51:20] (03Abandoned) 10Muehlenhoff: Mark two disks as failed on ms-be1071 [puppet] - 10https://gerrit.wikimedia.org/r/826532 (owner: 10Muehlenhoff) [11:52:23] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "swift: bump proxy memcache max connections" [puppet] - 10https://gerrit.wikimedia.org/r/826340 (owner: 10Filippo Giunchedi) [11:52:46] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo) We have focused on updating primarily the status page (https://www.wikimediastatus.net), but we believ... [11:53:36] 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10taavi) [11:56:32] ACKNOWLEDGEMENT - Disk space on ms-be1071 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdz1 is not accessible: Input/output error Muehlenhoff T315437 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1071&var-datasource=eqiad+prometheus/ops [11:56:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance [11:56:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance [11:57:52] !log roll-restart swift-proxy on thanos-fe* and ms-fe* (not ms-fe1012) [11:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:11] jynus: FYI ^ [11:58:25] thanks for the ping, I had missed that [12:02:29] (03PS2) 10Hnowlan: restbase: add restbase103[123] [puppet] - 10https://gerrit.wikimedia.org/r/803520 [12:03:46] PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.183 second response time https://wikitech.wikimedia.org/wiki/Swift [12:05:48] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:58] RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [12:06:40] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 11 days, 0:00:00 on ms-fe1012.eqiad.wmnet with reason: known depooled, left for investigation [12:06:53] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 11 days, 0:00:00 on ms-fe1012.eqiad.wmnet with reason: known depooled, left for investigation [12:15:12] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:16:27] (03CR) 10Ayounsi: [C: 04-1] "Nice! One change needed then lgtm." [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [12:16:29] (03Abandoned) 10Kosta Harlan: GrowthExperiments: Enable AddLink for next round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723517 (https://phabricator.wikimedia.org/T290011) (owner: 10Kosta Harlan) [12:17:43] (03PS2) 10Kosta Harlan: GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno) [12:17:51] (03PS3) 10Kosta Harlan: GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno) [12:19:54] (03PS3) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) [12:19:56] (03PS3) 10Kosta Harlan: Declare mediawiki.createaccount_blocked_user schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno) [12:20:59] (03CR) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [12:24:26] (03PS7) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [12:24:35] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [12:31:06] (03CR) 10Btullis: [C: 03+2] Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis) [12:31:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Testing a script [12:31:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Testing a script [12:34:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [12:34:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [12:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T316186)', diff saved to https://phabricator.wikimedia.org/P33108 and previous config saved to /var/cache/conftool/dbconfig/20220825-123448-ladsgroup.json [12:35:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reboot-single for host db2114.codfw.wmnet [12:38:24] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:39:26] jouncebot: now [12:39:26] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [12:39:45] I am going to promote the rest of the wikis to 1.39.0-wmf.26 [12:40:28] (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826554 (https://phabricator.wikimedia.org/T314187) [12:40:30] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826554 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [12:40:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1001.eqiad.wmnet [12:41:11] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826554 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot) [12:44:20] hashar: Can I go ahead with my wmf.26 backport patch as scheduled in approx 15 min? [12:45:17] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.26 refs T314187 [12:45:21] T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187 [12:46:08] !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db2114.codfw.wmnet [12:46:36] PROBLEM - mysqld processes on db2114 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:46:36] PROBLEM - MariaDB Replica SQL: s6 on db2114 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:46:36] PROBLEM - MariaDB Replica IO: s6 on db2114 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:46:38] PROBLEM - MariaDB read only s6 on db2114 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:46:48] uh? [12:46:58] (KubernetesRsyslogDown) firing: (3) rsyslog on dse-k8s-worker1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:46:59] that's the candidate master [12:47:24] Amir1: ^ [12:47:42] marostegui: rebooting it, it should come back online [12:47:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:47:49] I personally downtimed it for a day [12:48:06] Amir1: but the alert arrived? [12:48:31] ah, didn't see it failed [12:48:32] (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db2114.codfw.wmnet [12:48:34] sigh [12:48:43] how downtime fails :/ [12:48:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:48:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:48:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet [12:49:06] previous ones passed tho (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [12:49:41] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1002.eqiad.wmnet [12:49:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:50:21] sigh, why uptime has different value [12:50:28] anyway, separate issue [12:51:04] Amir1: and mysql isn't up either [12:51:08] is that expected? [12:51:11] yeah, on it [12:51:15] cool np [12:51:40] (03CR) 10JMeybohm: [C: 03+1] sre: alert on appserver unavailability [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:51:58] (KubernetesRsyslogDown) firing: (4) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:52:44] (03CR) 10JMeybohm: [C: 03+1] Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff) [12:53:42] RECOVERY - mysqld processes on db2114 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:53:42] RECOVERY - MariaDB Replica SQL: s6 on db2114 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:53:42] RECOVERY - MariaDB Replica IO: s6 on db2114 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:53:42] RECOVERY - MariaDB read only s6 on db2114 is OK: Version 10.4.25-MariaDB-log, Uptime 109s, read_only: True, event_scheduler: True, 1526.83 QPS, connection latency: 0.004922s, query latency: 0.000788s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:54:09] (03CR) 10JMeybohm: [C: 03+1] kubestage: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826229 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [12:54:30] I think I know why it's erroring out, the cookbook removes donwtime [12:54:51] (03CR) 10JMeybohm: [C: 03+1] kubernetes: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826236 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [12:55:19] (03CR) 10JMeybohm: [C: 03+1] ml-staging: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [12:55:58] 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo) p:05Unbreak!→03High We believe this is solved now- RFO seemed to be an iss... [12:56:21] (03CR) 10JMeybohm: [C: 03+1] ml-serve: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826238 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [12:56:58] (KubernetesRsyslogDown) resolved: (4) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:57:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet [12:58:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P33109 and previous config saved to /var/cache/conftool/dbconfig/20220825-125806-ladsgroup.json [12:58:07] marostegui: Amir1: has MediaWiki overloaded that s6 db2114 database or is that unrelated? [12:58:23] hashar: Unrelated [12:58:25] db2114 is codfw, not getting any traffic [12:58:28] (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-worker1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:58:35] great thank you for the confirmation [12:58:58] (KubernetesRsyslogDown) firing: (5) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:58:59] marostegui: let me try another thing for the next restart, is that fine with you? [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1300). [13:00:05] kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:19] * kart_ is here [13:00:39] hashar: Can I go ahead for backport deployment? [13:00:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [13:00:44] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [13:01:44] (03PS1) 10Ayounsi: Inital FHRP support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/826559 (https://phabricator.wikimedia.org/T311218) [13:02:13] (KubernetesRsyslogDown) firing: (6) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:02:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:02:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [13:02:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T316186)', diff saved to https://phabricator.wikimedia.org/P33110 and previous config saved to /var/cache/conftool/dbconfig/20220825-130235-ladsgroup.json [13:02:52] hashar: ping ping :) [13:07:13] (KubernetesRsyslogDown) firing: (7) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:08:26] (03CR) 10JMeybohm: [C: 03+1] "Great, thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [13:08:52] (03PS1) 10Ayounsi: Add FHRP group support to generate_dns_snippets [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218) [13:09:18] (03PS2) 10Vgutierrez: trafficserver: Disable origin coalescing [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) [13:09:41] PROBLEM - Check systemd state on dse-k8s-worker1003 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P33111 and previous config saved to /var/cache/conftool/dbconfig/20220825-130950-ladsgroup.json [13:10:45] PROBLEM - Check systemd state on dse-k8s-worker1007 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:47] 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo) [13:11:39] (03PS3) 10Vgutierrez: trafficserver: Disable origin coalescing [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) [13:11:51] kart_: yeah sorry [13:12:10] was digging in grafana and logs [13:12:13] (KubernetesRsyslogDown) firing: (3) rsyslog on dse-k8s-worker1005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:12:38] I am trying the `scap backport` command [13:12:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [extensions/ContentTranslation] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826341 (https://phabricator.wikimedia.org/T309986) (owner: 10KartikMistry) [13:12:45] hashar: OK. Going ahead. Will take 15 min to merge anyway.. [13:12:52] ah [13:13:00] we I should have +2 ed it ahead of time [13:13:07] and really should speed up those CI jobs [13:13:27] I found a potential opitmization to bring the selenium one from ~15 to 10 which would help [13:14:19] cool. I see patch is merged via scap backport? [13:14:24] being merged.. [13:14:26] (03CR) 10Ayounsi: [C: 03+1] Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [13:14:26] that `scap backport` is quite great. It found out the patch from the Deployments page, found it the change is open and +2ed it [13:14:41] now it waits for the merge to happen [13:14:56] 13:12:45 Waiting for changes to be merged. This may take some time if there are long running tests. [13:14:56] Change 826341 status: NEW, mergeable: True [13:14:56] Change 826341 status: NEW, mergeable: True [13:15:12] and will it do all magic? or should I need to normal scap run? [13:16:01] btullis: kubelet not starting is probably a cgroup issue (with bullseye only mounting cgroup v2) [13:17:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [13:17:23] btullis: yep...there was a manual change needed (https://phabricator.wikimedia.org/T300744#7700797) [13:17:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [13:17:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T316186)', diff saved to https://phabricator.wikimedia.org/P33112 and previous config saved to /var/cache/conftool/dbconfig/20220825-131735-ladsgroup.json [13:18:07] (03CR) 10Ssingh: [C: 03+1] trafficserver: Disable origin coalescing [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [13:18:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36981/console" [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [13:18:46] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Disable origin coalescing [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [13:19:50] !log disable origin coalescing in ats-be globally - T315911 [13:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:54] T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911 [13:20:06] kart_: i think it does all the magic yes [13:20:21] the idea releng has is to make the deployment as automated as possible [13:20:34] so that in theory anyone can process the deployments with just a few lines of documentation [13:21:55] hashar: can you point me to scap backport document? [13:22:13] (KubernetesRsyslogDown) firing: (3) rsyslog on dse-k8s-worker1005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:23:03] PROBLEM - Check systemd state on dse-k8s-worker1004 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:16] kart_: I don't think it is documented yet [13:23:28] (KubernetesRsyslogDown) resolved: (3) rsyslog on dse-k8s-worker1005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:23:30] ah. [13:23:40] https://doc.wikimedia.org/scap/search.html?q=backport gives nothing and the wiki doc at https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers does not mention it yet [13:23:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T316186)', diff saved to https://phabricator.wikimedia.org/P33113 and previous config saved to /var/cache/conftool/dbconfig/20220825-132356-ladsgroup.json [13:24:29] PROBLEM - Check systemd state on dse-k8s-worker1008 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:24:53] kart_: I have asked in our team channel. I am guessing it is not ready yet for wide spread adoption [13:25:13] I have reviewed a patch to it yesterday [13:26:18] hashar: I hope it won't break anything :D [13:28:58] (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:30:37] (03CR) 10Hnowlan: [C: 03+2] jobqueue: increase num_workers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [13:31:39] (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20220825 [extensions/ContentTranslation] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826341 (https://phabricator.wikimedia.org/T309986) (owner: 10KartikMistry) [13:32:13] !log hashar@deploy1002 Started scap: Backport for [[gerrit:826341|CX3 Build 0.2.0+20220825 (T309986 T301222)]] [13:32:18] T309986: Persist selection of translation service across sessions - https://phabricator.wikimedia.org/T309986 [13:32:18] T301222: Instrumentation of new SX entrypoints - https://phabricator.wikimedia.org/T301222 [13:33:04] PROBLEM - Check systemd state on dse-k8s-worker1005 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:33:16] hmm [13:33:19] kart_: looks like it works [13:33:29] (03PS1) 10Joal: Add linktarget to sqooped tables [puppet] - 10https://gerrit.wikimedia.org/r/826564 (https://phabricator.wikimedia.org/T314666) [13:34:55] (03Merged) 10jenkins-bot: jobqueue: increase num_workers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [13:35:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:35:44] kart_: looks like the `scap backport` script runs a full sync directly bypassing the manual verification steps through `mwdebug*` hosts [13:35:53] ah. [13:36:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:36:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:36:24] hashar: That's fine. Patch is tested in master already. [13:36:46] But, would love to see mwdebug* deploy first. [13:37:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:13] (03CR) 10Btullis: [C: 03+2] Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [13:38:44] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:39:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P33114 and previous config saved to /var/cache/conftool/dbconfig/20220825-133902-ladsgroup.json [13:39:18] (03Merged) 10jenkins-bot: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [13:39:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1003.eqiad.wmnet [13:40:16] hmm [13:40:25] Changes synced to: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet. [13:40:25] Please do any necessary checks before continuing. [13:40:28] kart_: I was wrong ;) [13:41:25] so you can test on mwdebug hosts or I can `Y` to do the full deployment [13:41:41] (sorry I am learning about that command) [13:42:00] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Using docker in WMF production network outside of kubernetes - https://phabricator.wikimedia.org/T275551 (10akosiaris) >>! In T275551#8176053, @fkaelin wrote: > Reviving this discussion, though I renamed the phab to "Running docker containers in a non-produc... [13:42:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1120.eqiad.wmnet with reason: Maintenance [13:43:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1120.eqiad.wmnet with reason: Maintenance [13:43:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1120 (T312160)', diff saved to https://phabricator.wikimedia.org/P33115 and previous config saved to /var/cache/conftool/dbconfig/20220825-134318-ladsgroup.json [13:43:23] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [13:43:45] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Krinkle) [13:43:49] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) The two patches above should allow us to use the `FHRP group` feature in production, without leveraging additional fields like priority or... [13:44:00] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Krinkle) [13:44:23] kart_: I am syncing it [13:45:10] RECOVERY - Check systemd state on dse-k8s-worker1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet [13:47:54] PROBLEM - Check systemd state on dse-k8s-worker1006 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:31] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10MoritzMuehlenhoff) Instead let's move these to a baremetal host instead? We're hitting some limits of what makes sense with Ganeti for these, one other issue is high rate... [13:49:26] (03PS4) 10Vgutierrez: trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) [13:49:55] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10MoritzMuehlenhoff) That would also be a fine opportunity to move away from the confusing naming scheme, given that webperf1003 and 1004 are totally different services, so... [13:52:07] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36982/console" [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [13:54:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P33116 and previous config saved to /var/cache/conftool/dbconfig/20220825-135408-ladsgroup.json [13:56:14] (03CR) 10Herron: [C: 03+1] "Thanks for the fixes!" [puppet] - 10https://gerrit.wikimedia.org/r/826490 (owner: 10Andrea Denisse) [13:57:10] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:826341|CX3 Build 0.2.0+20220825 (T309986 T301222)]] (duration: 24m 56s) [13:57:12] (03PS1) 10Hnowlan: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196) [13:57:19] T309986: Persist selection of translation service across sessions - https://phabricator.wikimedia.org/T309986 [13:57:19] T301222: Instrumentation of new SX entrypoints - https://phabricator.wikimedia.org/T301222 [13:57:51] hashar: Thanks! [13:58:21] (03CR) 10Herron: [C: 03+1] logstash: alerts to use yearly rotation [puppet] - 10https://gerrit.wikimedia.org/r/826385 (https://phabricator.wikimedia.org/T304924) (owner: 10Cwhite) [13:58:33] (03PS2) 10Hnowlan: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196) [13:59:20] (03CR) 10Herron: [C: 03+1] logstash: set ecs routing only when the output is logstash [puppet] - 10https://gerrit.wikimedia.org/r/826384 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite) [14:00:25] (03CR) 10Herron: [C: 03+1] logstash: add tcpircbot logging tests [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [14:01:57] (03PS5) 10Vgutierrez: trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) [14:02:48] (03CR) 10Herron: [C: 03+1] logstash: use rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824316 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [14:02:54] (03PS1) 10Btullis: Add btullis to users to allow for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) [14:04:54] (03PS1) 10Milimetric: Add datahub lineage plugin to the build [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/826573 [14:05:41] (03CR) 10Herron: [C: 03+1] rsyslog: add rsyslog-namespaced fields to syslog_cee [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [14:06:12] (03CR) 10Milimetric: "Adding the latest version of this plugin. It should be forwards-compatible, so hopefully doesn't need lots of updating. But we may want " [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/826573 (owner: 10Milimetric) [14:06:25] (03PS6) 10Vgutierrez: trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) [14:06:31] (03CR) 10Ssingh: [C: 03+1] trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [14:07:33] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host people2002.codfw.wmnet [14:07:51] (03CR) 10Ssingh: [C: 03+1] trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [14:08:07] kart_: you are welcome, and sorry for the delay [14:08:11] (03CR) 10Ayounsi: Add btullis to users to allow for router configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [14:09:12] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [14:09:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T316186)', diff saved to https://phabricator.wikimedia.org/P33117 and previous config saved to /var/cache/conftool/dbconfig/20220825-140915-ladsgroup.json [14:11:00] (03CR) 10Vgutierrez: [C: 03+1] Varnish: Stop sending analytics cookies to API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [14:11:15] (03CR) 10Herron: [C: 03+1] logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite) [14:11:28] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2002.codfw.wmnet [14:11:33] (03CR) 10Btullis: Add btullis to users to allow for router configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis) [14:13:24] !log rebooting people1003 (people.wikimedia.org) [14:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:35] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet [14:15:58] !log finished rebooting people1003 (people.wikimedia.org) [14:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:28] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet [14:20:08] (03PS1) 10Btullis: Revert "Add BGP neighbor data for the new dse-k8s cluster" [homer/public] - 10https://gerrit.wikimedia.org/r/826344 [14:20:47] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1004.eqiad.wmnet [14:21:09] (03CR) 10Btullis: [C: 03+2] Revert "Add BGP neighbor data for the new dse-k8s cluster" [homer/public] - 10https://gerrit.wikimedia.org/r/826344 (owner: 10Btullis) [14:21:52] (03Merged) 10jenkins-bot: Revert "Add BGP neighbor data for the new dse-k8s cluster" [homer/public] - 10https://gerrit.wikimedia.org/r/826344 (owner: 10Btullis) [14:23:02] (03PS1) 10Vgutierrez: trafficserver: Enable origin coalescing in cp600[78] [puppet] - 10https://gerrit.wikimedia.org/r/826576 (https://phabricator.wikimedia.org/T315911) [14:24:10] RECOVERY - Check systemd state on dse-k8s-worker1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:14] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36983/console" [puppet] - 10https://gerrit.wikimedia.org/r/826576 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [14:24:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove zookeeper_version [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) (owner: 10Muehlenhoff) [14:24:49] (03PS1) 10FNegri: Add cloudcephosd1029 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826577 (https://phabricator.wikimedia.org/T314870) [14:28:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet [14:29:06] (03CR) 10Ssingh: [C: 03+1] trafficserver: Enable origin coalescing in cp600[78] [puppet] - 10https://gerrit.wikimedia.org/r/826576 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [14:29:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:29:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [14:30:18] (03PS1) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 [14:30:51] (03PS2) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 [14:31:01] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Enable origin coalescing in cp600[78] [puppet] - 10https://gerrit.wikimedia.org/r/826576 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [14:32:22] !log enable origin coalescing in ats-be@cp600[78] [expect crashes] - T315911 [14:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:27] T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911 [14:32:31] gotta love my optimism [14:34:00] (03CR) 10CI reject: [V: 04-1] Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff) [14:34:14] :P [14:35:21] (03PS1) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174) [14:35:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [14:35:45] T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 [14:35:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Remove node for eventual reimage, T311686 [14:36:09] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel [14:36:22] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel [14:37:04] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826577 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [14:42:16] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803520 (owner: 10Hnowlan) [14:42:26] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel [14:42:39] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel [14:43:54] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel [14:44:08] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel [14:44:55] (03PS3) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 [14:46:39] hi mutante: Andrew Otto suggested I reach out to you to see if you could help us get this patch merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/811312 [14:47:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] C:profile::docker::storage removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [14:49:06] (03PS1) 10Hnowlan: Add blubber config file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/826585 (https://phabricator.wikimedia.org/T312104) [14:49:39] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) Dell technician will be on site today between 10am CT and 2pm. Is is possible to get this server offline for the back plane replacement? Thanks [14:51:40] (03CR) 10FNegri: Add cloudcephosd1029 to the Ceph pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826577 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [14:51:50] (03CR) 10FNegri: [C: 03+2] Add cloudcephosd1029 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826577 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri) [14:52:36] 10SRE, 10Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [M] Schedule image suggestions notifications - https://phabricator.wikimedia.org/T300024 (10CBogen) Tagging #sre in hopes that someone on clinic duty can help us get this patch merged, thanks! [14:53:13] (03CR) 10Hnowlan: [C: 03+2] Add blubber config file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/826585 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [14:54:44] (03Merged) 10jenkins-bot: Add blubber config file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/826585 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan) [14:56:45] (03PS1) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911) [14:57:32] (03PS8) 10BCornwall: varnish: Stop sending analytics cookies to API [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) [14:59:54] (03PS1) 10DCausse: wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) [15:00:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ottomata) Approved. [15:01:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup) [15:01:12] (03PS3) 10Ladsgroup: Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140) [15:01:19] (03PS2) 10Btullis: Add btullis to users to allow for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) [15:01:21] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140) (owner: 10Ladsgroup) [15:03:48] (03CR) 10CI reject: [V: 04-1] wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) (owner: 10DCausse) [15:03:54] (03PS2) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911) [15:04:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup) 05Open→03Resolved You should be able to access it in half an hour or so. If not, please reopen this ticket. Thank you for flying with Wikimedia SRE. [15:07:16] (03CR) 10Hashar: [C: 03+1] "Moritz and I talked about it this morning, then we had a Swift outage and I was dealing with the MediaWiki train. It is a bit late to get " [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [15:09:23] (03PS8) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [15:10:25] 10SRE, 10ops-codfw, 10Discovery-Search: elastic2054 is down with memory error - https://phabricator.wikimedia.org/T315989 (10Papaul) 05Open→03Resolved memory replaced, system is back online. [15:13:38] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:15:56] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:16:17] (03CR) 10Btullis: [C: 03+1] C:profile::docker::storage removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert) [15:17:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120 (T312160)', diff saved to https://phabricator.wikimedia.org/P33118 and previous config saved to /var/cache/conftool/dbconfig/20220825-151731-ladsgroup.json [15:17:36] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Dzahn) +1 to not using the same names for the different webperf roles, thought the same before, should match more the puppet role [15:17:37] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [15:18:48] (03PS1) 10Bartosz Dziewoński: Update VE core submodule to master (d4c438548) [extensions/VisualEditor] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826345 (https://phabricator.wikimedia.org/T316219) [15:18:58] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Dzahn) And yea, like the history says the discussion was to start from scratch once we get over the 16GB RAM limit. Hardware sounds the right way indeed. [15:19:47] (03PS9) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [15:22:44] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:22:44] (03Abandoned) 10BCornwall: admin: Add SSH key to mraish user [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez) [15:23:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ottomata) (^ lol) [15:23:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [15:23:46] (03CR) 10BCornwall: varnish: Stop sending analytics cookies to API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall) [15:23:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [15:23:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:24:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:24:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33119 and previous config saved to /var/cache/conftool/dbconfig/20220825-152417-ladsgroup.json [15:26:46] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided) [15:27:07] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 20s) [15:27:22] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:27:50] (03CR) 10Dzahn: "yep, all sounds good to me. back to this on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar) [15:29:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33120 and previous config saved to /var/cache/conftool/dbconfig/20220825-152932-ladsgroup.json [15:30:41] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Auth extremely slow on clouddumps100[12] - https://phabricator.wikimedia.org/T316123 (10Andrew) + Moritz because I think he had a patch in the works. If not let me know and I can likely figure it out :) [15:31:37] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided) [15:31:47] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s) [15:31:47] PROBLEM - Host ores2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:32:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120', diff saved to https://phabricator.wikimedia.org/P33121 and previous config saved to /var/cache/conftool/dbconfig/20220825-153237-ladsgroup.json [15:33:00] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) a:05Andrew→03cmooney This additional range was set up by @cmooney -- Cathal, is this something you can document as needed? [15:39:23] (03PS2) 10DCausse: wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) [15:41:47] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided) [15:41:57] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s) [15:42:45] !log restart backup1002 (interrupted before), backup1003, backup2003 [15:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P33122 and previous config saved to /var/cache/conftool/dbconfig/20220825-154438-ladsgroup.json [15:47:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120', diff saved to https://phabricator.wikimedia.org/P33123 and previous config saved to /var/cache/conftool/dbconfig/20220825-154743-ladsgroup.json [15:50:14] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided) [15:50:23] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s) [15:52:07] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided) [15:52:16] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s) [15:54:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P33124 and previous config saved to /var/cache/conftool/dbconfig/20220825-155401-ladsgroup.json [15:54:34] (03CR) 10Hnowlan: [C: 03+2] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [15:54:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:55:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:55:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33125 and previous config saved to /var/cache/conftool/dbconfig/20220825-155506-ladsgroup.json [15:55:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33126 and previous config saved to /var/cache/conftool/dbconfig/20220825-155529-ladsgroup.json [15:57:07] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) (owner: 10DCausse) [16:00:05] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:31] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided) [16:00:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33127 and previous config saved to /var/cache/conftool/dbconfig/20220825-160036-ladsgroup.json [16:00:40] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s) [16:00:52] (03Abandoned) 10Andrew Bogott: OpenStack nova.conf: set reclaim_instance_interval to half an hour [puppet] - 10https://gerrit.wikimedia.org/r/798772 (owner: 10Andrew Bogott) [16:01:13] (03PS1) 10Milimetric: airflow: disable lazy loading plugins [puppet] - 10https://gerrit.wikimedia.org/r/826600 [16:01:22] (03PS1) 10Ori: Increase roll-out of query-sorting to 15% [puppet] - 10https://gerrit.wikimedia.org/r/826601 (https://phabricator.wikimedia.org/T314868) [16:02:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120 (T312160)', diff saved to https://phabricator.wikimedia.org/P33128 and previous config saved to /var/cache/conftool/dbconfig/20220825-160250-ladsgroup.json [16:02:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:02:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [16:02:55] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [16:04:27] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) p:05Triage→03Medium [16:04:47] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10Papaul) p:05Triage→03Medium [16:07:22] (03Merged) 10jenkins-bot: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [16:07:23] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided) [16:07:32] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s) [16:07:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) Good afternoon Papaul, I have submitted DPS 432866984 for the replacement backplane to ship out. Service is scheduled for Thursday 08/25/22. The tech w... [16:08:35] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) Dell technician will be on site today between 10am CT and 2pm. Is is possible to get this server offline for the back plane replacement? Thanks [16:14:10] (03PS1) 10Hashar: doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604 [16:15:05] (03PS3) 10DCausse: wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) [16:15:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P33129 and previous config saved to /var/cache/conftool/dbconfig/20220825-161544-ladsgroup.json [16:18:21] (03PS3) 10Hashar: doc: properly redirect back compat URLs [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) [16:18:23] (03PS2) 10Hashar: doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604 [16:19:28] (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) (owner: 10DCausse) [16:19:30] (03CR) 10Hashar: "Daniel, got the documentation from your change introducing httpbb tests for doc.wikimedia.org 415616c37394d300700a6810797760e53aa702b3" [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar) [16:19:38] (03PS2) 10Milimetric: airflow: disable lazy plugins and add datahub conn [puppet] - 10https://gerrit.wikimedia.org/r/826600 [16:21:14] (03CR) 10Hashar: "The back compatibility Apache redirects got broken at some point in the past. This convert them to Rewrite rules which I have tested local" [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [16:23:00] (03PS1) 10Dzahn: Revert "Revert "c:spamassassin move Spamassassin updates from crontab"" [puppet] - 10https://gerrit.wikimedia.org/r/826607 [16:23:56] (03CR) 10Dzahn: "@AOkoth Could you maybe take this and see if you can reproduce and catch the error we saw yesterday?" [puppet] - 10https://gerrit.wikimedia.org/r/826607 (owner: 10Dzahn) [16:24:03] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:27:34] (03CR) 10Dzahn: "yep, confirmed it works that way. the only problem is of course the part that tests come after deployment." [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar) [16:28:08] (03CR) 10Dzahn: [C: 03+2] doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar) [16:28:21] (03CR) 10Dzahn: [C: 03+1] doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar) [16:28:54] mutante: I wanted to explore how to provision an apache from puppet and run httpbb against that but gave up. It is a long tail of complexity :) [16:29:18] I guess one way is to deploy the httpbb tests on the deployment server and the target host then run the tests manually [16:29:23] (03CR) 10Dzahn: [C: 03+1] "one thing though. if the tests are not changed and succeed both before and after the redirect change.. then aren't they missing tests to t" [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar) [16:29:46] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [16:30:13] (03PS1) 10Jdrewniak: Add clearfix to .mw-body-subheader [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826608 (https://phabricator.wikimedia.org/T316134) [16:30:43] (03PS3) 10Ebernhardson: query_service: Avoid passing content body to internal auth endpoints [puppet] - 10https://gerrit.wikimedia.org/r/825925 (https://phabricator.wikimedia.org/T306899) [16:30:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P33130 and previous config saved to /var/cache/conftool/dbconfig/20220825-163050-ladsgroup.json [16:32:36] (03PS3) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911) [16:32:44] (03CR) 10Bking: [C: 03+2] wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) (owner: 10DCausse) [16:33:38] hashar: for doc specifically, we had a test setup in devtools but gave up on it afair [16:34:04] hashar: for mw appservers the way we do it is to disable puppet on mw*, enable it only on mwdebug, run puppet, run tests.. if we like it..enable puppet on all [16:34:10] yeah I built that when I have split the published artifacts to their own dir (`/srv/doc` iirc) [16:34:15] of course we dont have docdebug [16:35:15] hashar: i think the realistic way to test is to disable puppet on doc1002, run puppet on doc2001, run test against doc2001, enable on both [16:35:43] hashar: but re: setup apache in cloud VPS, I made the role simplelamp2 for that, just apply and setups apache [16:35:49] possibly yes. I guess we will find out next time we have a big Apache configuration change to make [16:36:51] so you have a change that fixes something, but if the tests work already before the fix..maybe it is missing a test for something [16:37:09] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: backplane replacement [16:37:23] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: backplane replacement [16:37:30] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c4a39dbe-2fb0-4745-99c3-76e40de3820e) set by eevans@cumin1001 for 1 day, 0:00:00 on 1 host(s) a... [16:37:56] (03CR) 10Hashar: doc: document how to run httpbb tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar) [16:38:14] mutante: yeah I get your point, then the redirects are currently broken ;) [16:38:30] I could theorically write a test which shows they give a 404 [16:38:47] then amend the current changes which would replace the 404 tests by 302 ones [16:39:08] but I don't think that adds any value in this case [16:39:36] (03CR) 10Dzahn: [C: 03+1] "my point was just to add tests for the "compat URLs" because I notice you say they are broken but all the tests succeed" [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar) [16:40:03] !log shutting down ms-be2067.codfw.wmnet for backplane replacement -- T314049 [16:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:09] T314049: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 [16:40:48] (03PS10) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) [16:41:11] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Eevans) @Papaul the host is shut down; Please let me know as soon as it's back up [16:42:21] (03PS3) 10Dzahn: doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar) [16:42:44] (03PS2) 10Bernard Wang: Add clearfix to .mw-body-subheader [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826608 (https://phabricator.wikimedia.org/T316134) (owner: 10Jdrewniak) [16:42:54] (03CR) 10Dzahn: [C: 03+2] "rebased, merging, comments-only" [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar) [16:44:20] I think I broke the existing `Redirect` when introducing the `RewriteRule` [16:44:29] apache is full of surprises [16:45:50] it's frankly amazing how hard it is to properly configure most http servers :) nginx is sadly almost as bad as apache ... [16:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33131 and previous config saved to /var/cache/conftool/dbconfig/20220825-164556-ladsgroup.json [16:46:22] the good news is that we have Apache hackers at the wmf :-] [16:47:31] (03CR) 10Dzahn: [C: 03+2] "I had it and still failed to save it, will reproduce it." [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn) [16:48:33] (03CR) 10Bking: [C: 03+2] query_service: Avoid passing content body to internal auth endpoints [puppet] - 10https://gerrit.wikimedia.org/r/825925 (https://phabricator.wikimedia.org/T306899) (owner: 10Ebernhardson) [16:49:03] (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [16:49:53] (03CR) 10Dzahn: "@cmooney there is a follow-up at https://gerrit.wikimedia.org/r/c/operations/puppet/+/824542" [puppet] - 10https://gerrit.wikimedia.org/r/824495 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [16:51:57] (03CR) 10Dzahn: "ok, thank you. I will comment here if the alert comes back ever." [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede) [16:52:12] (03Abandoned) 10Dzahn: Revert "Revert "c:spamassassin move Spamassassin updates from crontab"" [puppet] - 10https://gerrit.wikimedia.org/r/826607 (owner: 10Dzahn) [16:52:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33132 and previous config saved to /var/cache/conftool/dbconfig/20220825-165213-ladsgroup.json [16:52:28] (03PS4) 10Hashar: doc: properly redirect back compat URLs [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) [16:53:17] (03CR) 10Hashar: "rebased since the child change got cherry picked and merged and ended up causing a conflict." [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar) [16:54:03] (03PS1) 10BryanDavis: developer-portal: Bump container to 2022-08-23-080429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/826627 [17:00:04] bd808: May I have your attention please! Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1700) [17:03:59] 10SRE, 10Traffic, 10Patch-For-Review: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) Change of plans: Kwaku has expressed an interest in backwards-compatibility so ATS 8 support will be added. [17:04:01] mutante I rebased the apache redirect patch since it ended up conflicting ;) [17:04:16] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2022-08-23-080429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/826627 (owner: 10BryanDavis) [17:04:30] (03PS3) 10Ryan Kemper: opensearch: replace outdated config [puppet] - 10https://gerrit.wikimedia.org/r/826383 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [17:07:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P33133 and previous config saved to /var/cache/conftool/dbconfig/20220825-170719-ladsgroup.json [17:07:30] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2022-08-23-080429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/826627 (owner: 10BryanDavis) [17:08:50] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:09:15] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:09:24] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:10:03] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:10:11] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:10:58] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:21:56] (03PS4) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911) [17:22:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P33135 and previous config saved to /var/cache/conftool/dbconfig/20220825-172225-ladsgroup.json [17:29:11] (03CR) 10Vgutierrez: "Tested cookie hiding for caching purposes in our WMCS environment, works as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez) [17:29:13] (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:29:47] (03PS1) 10Bking: deployment-prep: remove defunct elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/826630 (https://phabricator.wikimedia.org/T316240) [17:36:31] (03CR) 10Bking: [C: 03+2] elastic: don't start es7 unit until we tell it [puppet] - 10https://gerrit.wikimedia.org/r/826396 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [17:37:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33136 and previous config saved to /var/cache/conftool/dbconfig/20220825-173731-ladsgroup.json [17:38:20] (03CR) 10Hashar: "We could surely use some monitoring for the releng images. Probably not by failing the unit, but some kind of weekly report by email or si" [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff) [17:38:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:39:04] mutante: would you merge the doc redirect fix up https://gerrit.wikimedia.org/r/c/operations/puppet/+/824542 ? the other comments only change got merged so I though you would deploy the fix as well :) [17:39:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [17:41:24] hashar: no, I was not going to merge that right now based on the history with doc redirects and the tests thing, I added reviewers and person who merged the last change though [17:42:05] I merged the other thing because it was comments only and confirmed the docs [17:43:23] (03CR) 10Bking: [C: 03+2] elastic: don't start es 7 until ready [cookbooks] - 10https://gerrit.wikimedia.org/r/826397 (owner: 10Ryan Kemper) [17:44:02] well that previous change got blindly merged as part of clinic duty [17:44:07] (03CR) 10ArielGlenn: "Hannah and I looked at this, seems good to me, merge at will." [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah) [17:44:09] but well guess that can wait ;) [17:44:12] maybe that was the issue then [17:44:27] (03CR) 10Majavah: [V: 03+1] P:dumps: remove ipv4/ipv6 separation from internal_rsync_clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah) [17:44:35] clinic duty does not even include merging puppet changes [17:44:55] well that is how I get those puppet patches merged most of the time [17:45:12] anyway it is not an urgent patch [17:45:14] I would prefer if we could change that [17:45:27] ok, great [17:46:47] (03CR) 10Majavah: [C: 03+1] "looks fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [17:47:21] I have some other things going on but it won't be forgotten, it's in the queue [17:47:30] 👍🏾 Thanks Daniel [17:47:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2115.codfw.wmnet with reason: Maintenance [17:48:09] oh.. wrong channel. thanks anyway. :-) [17:48:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2115.codfw.wmnet with reason: Maintenance [17:48:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2115 (T312160)', diff saved to https://phabricator.wikimedia.org/P33137 and previous config saved to /var/cache/conftool/dbconfig/20220825-174826-ladsgroup.json [17:48:32] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [17:49:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [17:49:34] mutante: no worries :-] [17:49:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance [17:49:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T316186)', diff saved to https://phabricator.wikimedia.org/P33138 and previous config saved to /var/cache/conftool/dbconfig/20220825-174946-ladsgroup.json [17:54:55] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:57:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T316186)', diff saved to https://phabricator.wikimedia.org/P33139 and previous config saved to /var/cache/conftool/dbconfig/20220825-175715-ladsgroup.json [18:00:04] hashar and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1800). [18:01:30] I am going to use the train window to deploy a new version of scap [18:04:03] unrelatedly, could anyone here review this short patch that i'd like to backport later today? https://gerrit.wikimedia.org/r/c/mediawiki/skins/Timeless/+/826633 [18:04:37] (03PS1) 10Stang: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620) [18:05:29] PROBLEM - Host ms-be2067.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:06:47] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:11:46] !log dancy@deploy1002 install-world aborted: (duration: 00m 02s) [18:11:51] !log dancy@deploy1002 Installing scap version "4.15.0" for 557 hosts [18:12:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P33140 and previous config saved to /var/cache/conftool/dbconfig/20220825-181221-ladsgroup.json [18:13:21] !log dancy@deploy1002 Installation of scap version "4.15.0" completed for 557 hosts [18:18:44] !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided) [18:18:53] !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s) [18:19:26] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided) [18:20:40] (03PS1) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826636 (https://phabricator.wikimedia.org/T308620) [18:22:32] (03CR) 10Dzahn: [C: 03+2] Add systemd timer to run scap stage-train on Tuesday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [18:22:42] (03PS1) 10Stang: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620) [18:24:12] (03Abandoned) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822197 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [18:25:47] (03PS2) 10Stang: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620) [18:27:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P33141 and previous config saved to /var/cache/conftool/dbconfig/20220825-182727-ladsgroup.json [18:27:37] (03PS3) 10Stang: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620) [18:31:35] RECOVERY - Host ms-be2067.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.19 ms [18:33:33] !log rolling restart of eventgate-analytics-external to pick up retroactive schema change for android schemas in T316047 [18:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:37] T316047: Make provisions for geodata in all MEP schemas - https://phabricator.wikimedia.org/T316047 [18:33:45] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync [18:34:07] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync [18:34:18] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync [18:35:01] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync [18:35:33] (03CR) 10Dzahn: [C: 03+2] "change has been deployed. on deploy1002 the timer and service has been created but of course it's just waiting now for next Tuesday. optio" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy) [18:36:03] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync [18:36:37] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync [18:38:51] (03PS1) 10Bking: Revert "Revert "elastic: enable ES7 repo on cloudelastic"" [puppet] - 10https://gerrit.wikimedia.org/r/826609 [18:38:53] (03PS1) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826639 (https://phabricator.wikimedia.org/T308620) [18:39:42] (03CR) 10Gehel: [C: 03+1] Revert "Revert "elastic: enable ES7 repo on cloudelastic"" [puppet] - 10https://gerrit.wikimedia.org/r/826609 (owner: 10Bking) [18:40:06] (03CR) 10Bking: [C: 03+2] Revert "Revert "elastic: enable ES7 repo on cloudelastic"" [puppet] - 10https://gerrit.wikimedia.org/r/826609 (owner: 10Bking) [18:42:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T316186)', diff saved to https://phabricator.wikimedia.org/P33142 and previous config saved to /var/cache/conftool/dbconfig/20220825-184233-ladsgroup.json [18:42:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [18:42:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [18:43:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T316186)', diff saved to https://phabricator.wikimedia.org/P33143 and previous config saved to /var/cache/conftool/dbconfig/20220825-184301-ladsgroup.json [18:45:52] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@d00af45]: bump elasticsearch-hadoop to 7.10.2 [18:47:40] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [18:47:44] T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159 [18:48:00] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@d00af45]: bump elasticsearch-hadoop to 7.10.2 (duration: 02m 07s) [18:48:22] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [18:49:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T316186)', diff saved to https://phabricator.wikimedia.org/P33144 and previous config saved to /var/cache/conftool/dbconfig/20220825-184911-ladsgroup.json [18:54:07] (03PS1) 10Bking: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) [18:54:56] (03PS2) 10Ryan Kemper: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [18:58:33] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) 05Open→03Resolved @Eevans thanks the host is back online. the back plane replacement fixed the issue . [18:58:39] (03CR) 10CI reject: [V: 04-1] elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [19:03:20] (03PS1) 10Urbanecm: cswiki: Add extendedconfirmed group/protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826641 (https://phabricator.wikimedia.org/T316283) [19:04:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P33145 and previous config saved to /var/cache/conftool/dbconfig/20220825-190417-ladsgroup.json [19:07:55] (03PS3) 10Bking: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) [19:10:11] (03PS4) 10Ryan Kemper: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [19:19:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P33146 and previous config saved to /var/cache/conftool/dbconfig/20220825-191924-ladsgroup.json [19:22:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace cloudnet100[34] with cloudnet100[56] - https://phabricator.wikimedia.org/T316284 (10Andrew) [19:24:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Andrew) [19:25:13] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:25:34] (03PS1) 10Bartosz Dziewoński: Hide new 'associatedPages' navigation items [skins/Timeless] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826610 (https://phabricator.wikimedia.org/T316196) [19:27:42] (03PS1) 10Andrew Bogott: Remove refs to cloudservices1003 [puppet] - 10https://gerrit.wikimedia.org/r/826645 (https://phabricator.wikimedia.org/T316285) [19:29:02] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1003 [19:29:29] (03PS1) 10Bartosz Dziewoński: Make DiscussionTools autotopicsub also opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826646 (https://phabricator.wikimedia.org/T314693) [19:29:39] (03PS1) 10Ryan Kemper: elastic: no need to run puppet during es 7 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/826647 (https://phabricator.wikimedia.org/T308676) [19:31:01] (03CR) 10Andrew Bogott: [C: 03+2] Remove refs to cloudservices1003 [puppet] - 10https://gerrit.wikimedia.org/r/826645 (https://phabricator.wikimedia.org/T316285) (owner: 10Andrew Bogott) [19:31:10] (03PS2) 10Andrew Bogott: Remove refs to cloudservices1003 [puppet] - 10https://gerrit.wikimedia.org/r/826645 (https://phabricator.wikimedia.org/T316285) [19:31:49] 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T316194 (10Ladsgroup) [19:32:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T312160)', diff saved to https://phabricator.wikimedia.org/P33147 and previous config saved to /var/cache/conftool/dbconfig/20220825-193238-ladsgroup.json [19:32:43] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [19:33:37] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [19:33:48] (03CR) 10Bking: [C: 03+2] elastic: no need to run puppet during es 7 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/826647 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [19:34:24] (03CR) 10Bking: [V: 03+2 C: 03+2] elastic: no need to run puppet during es 7 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/826647 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [19:34:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T316186)', diff saved to https://phabricator.wikimedia.org/P33148 and previous config saved to /var/cache/conftool/dbconfig/20220825-193430-ladsgroup.json [19:34:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:34:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [19:34:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:35:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:35:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T316186)', diff saved to https://phabricator.wikimedia.org/P33149 and previous config saved to /var/cache/conftool/dbconfig/20220825-193513-ladsgroup.json [19:36:57] !log rebooting ms-be2067 to "fix" disk enumeration(?) -- T314049 [19:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:01] T314049: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 [19:37:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:37:27] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices1003 [19:37:46] (03Merged) 10jenkins-bot: elastic: no need to run puppet during es 7 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/826647 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper) [19:41:02] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [19:41:06] T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159 [19:41:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T316186)', diff saved to https://phabricator.wikimedia.org/P33150 and previous config saved to /var/cache/conftool/dbconfig/20220825-194129-ladsgroup.json [19:42:07] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [19:45:39] (03PS5) 10Ryan Kemper: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [19:45:55] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) a:05cmooney→03Andrew @Andrew I indeed routed the subnet, which was already allocated to WMCS in codfw. It seems I failed to update the description fo... [19:47:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P33151 and previous config saved to /var/cache/conftool/dbconfig/20220825-194744-ladsgroup.json [19:51:00] (03PS6) 10Ryan Kemper: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [19:55:57] (03CR) 10Ryan Kemper: [C: 03+2] elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [19:56:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P33152 and previous config saved to /var/cache/conftool/dbconfig/20220825-195635-ladsgroup.json [20:00:05] brennen: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T2000). [20:00:05] jan_drewniak, koi, Urbanecm, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] o/ [20:00:25] o/ [20:00:28] o/ [20:00:28] (03Merged) 10jenkins-bot: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [20:01:11] hi [20:01:20] howdy all [20:01:39] looks like a full window :D [20:01:53] thcipriani: yup! i'm happy to deploy if you want me to, or i can leave it to you. [20:02:08] last backport window of the week :P [20:02:10] we can probably do all of the non-config patches in parallel [20:02:11] (03CR) 10Thcipriani: [C: 03+2] Add clearfix to .mw-body-subheader [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826608 (https://phabricator.wikimedia.org/T316134) (owner: 10Jdrewniak) [20:02:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P33153 and previous config saved to /var/cache/conftool/dbconfig/20220825-200250-ladsgroup.json [20:02:57] urbanecm: well. We've got no takers for backport training today. I'm happy to yield the deployment conch to you if you're up for it. [20:03:08] sure [20:03:21] (03CR) 10Urbanecm: [C: 03+2] Update VE core submodule to master (d4c438548) [extensions/VisualEditor] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826345 (https://phabricator.wikimedia.org/T316219) (owner: 10Bartosz Dziewoński) [20:03:34] (03CR) 10Urbanecm: [C: 03+2] Hide new 'associatedPages' navigation items [skins/Timeless] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826610 (https://phabricator.wikimedia.org/T316196) (owner: 10Bartosz Dziewoński) [20:03:35] <3 [20:05:49] (03CR) 10Urbanecm: [C: 03+2] Make DiscussionTools autotopicsub also opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826646 (https://phabricator.wikimedia.org/T314693) (owner: 10Bartosz Dziewoński) [20:06:42] (03Merged) 10jenkins-bot: Make DiscussionTools autotopicsub also opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826646 (https://phabricator.wikimedia.org/T314693) (owner: 10Bartosz Dziewoński) [20:07:01] (03CR) 10Urbanecm: [C: 03+2] Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:07:04] (03PS2) 10Urbanecm: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:07:09] (03CR) 10Urbanecm: [C: 03+2] Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:07:15] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [20:07:16] (03PS2) 10Urbanecm: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826636 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:07:19] T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159 [20:07:29] (03CR) 10Urbanecm: [C: 03+2] Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826636 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:07:33] (03PS4) 10Urbanecm: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:07:36] (03CR) 10Urbanecm: [C: 03+2] zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:07:50] MatmaRex: your config patch is at mwdebug1001, can you have a look please? [20:07:57] (03Merged) 10jenkins-bot: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:08:00] looking [20:08:04] (03Merged) 10jenkins-bot: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826636 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:08:25] (03Merged) 10jenkins-bot: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:08:44] koi: fyi, i'm going to do the first three patches, the last one separately, as it changes other wiki (and depends on whether the first three are w/o issues). [20:09:05] got it, thanks [20:09:13] urbanecm: seems good [20:09:16] thanks, syncing [20:11:17] php-fpm restart has a progress indicator now, great. [20:11:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P33154 and previous config saved to /var/cache/conftool/dbconfig/20220825-201141-ladsgroup.json [20:11:54] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [20:12:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:13:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:13:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:13:24] urbanecm: You're welcome. :-) [20:13:52] :) [20:14:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:14:30] !log re-rebooting ms-be2067 to "fix" disk enumeration(?) -- T314049 [20:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:34] T314049: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 [20:15:51] hmm. my connection to deploy host terminated, scap process is not running apparently, but lock was not released [20:16:00] can someone help please? [20:16:31] it _looks_ like i can just remove `/var/lock/scap.operations_mediawiki-config.lock` and re-sync, but I'd like confirmation before doing that. [20:16:37] yes, you can do that. [20:16:37] dancy: maybe you can help? :) [20:16:42] okay. [20:16:47] doing [20:17:07] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10cmooney) Nice work! Eventually all things considered it's probably best to control it from Netbox. But I agree the existing mechanism works well i... [20:17:21] !log [urbanecm@deploy1002 ~]$ rm /var/lock/scap.operations_mediawiki-config.lock # connection to deploy1002 handled, to let me re-sync [20:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:28] and syncing again [20:17:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T312160)', diff saved to https://phabricator.wikimedia.org/P33155 and previous config saved to /var/cache/conftool/dbconfig/20220825-201756-ladsgroup.json [20:17:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [20:18:01] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [20:18:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance [20:18:46] (03PS3) 10AOkoth: gitlab: copy ssh host keys for failover [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713) [20:18:48] (03PS1) 10AOkoth: vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650 [20:19:05] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:19:28] (03PS2) 10AOkoth: vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650 [20:21:46] (03PS1) 10Bking: elastic: use correct systemd command [cookbooks] - 10https://gerrit.wikimedia.org/r/826651 (https://phabricator.wikimedia.org/T308676) [20:22:19] (03Merged) 10jenkins-bot: Add clearfix to .mw-body-subheader [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826608 (https://phabricator.wikimedia.org/T316134) (owner: 10Jdrewniak) [20:22:21] (03Merged) 10jenkins-bot: Update VE core submodule to master (d4c438548) [extensions/VisualEditor] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826345 (https://phabricator.wikimedia.org/T316219) (owner: 10Bartosz Dziewoński) [20:22:29] (03Merged) 10jenkins-bot: Hide new 'associatedPages' navigation items [skins/Timeless] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826610 (https://phabricator.wikimedia.org/T316196) (owner: 10Bartosz Dziewoński) [20:22:51] (03CR) 10Ryan Kemper: [C: 03+1] elastic: use correct systemd command [cookbooks] - 10https://gerrit.wikimedia.org/r/826651 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [20:23:28] (03CR) 10Ryan Kemper: [C: 03+2] elastic: use correct systemd command [cookbooks] - 10https://gerrit.wikimedia.org/r/826651 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [20:23:44] (03CR) 10Dzahn: "the 1 line for envoy needs to move to ./hosts/ but everything else should stay in common.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/826650 (owner: 10AOkoth) [20:23:56] (03PS3) 10AOkoth: vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650 [20:24:10] (03PS4) 10AOkoth: vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650 [20:24:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f37eff3f1607c898120c4f151b0af0d4b6bfdd19: Make DiscussionTools autotopicsub also opt-out on A/B test wikis (T314693) (duration: 03m 37s) [20:24:49] finally [20:24:51] (03CR) 10Dzahn: [C: 03+1] "yep, testing, fake values, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/826650 (owner: 10AOkoth) [20:24:51] T314693: [Config Change] Make Topic Subscriptions available by default at A/B test wikis (desktop) - https://phabricator.wikimedia.org/T314693 [20:25:12] (03CR) 10AOkoth: [C: 03+2] vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650 (owner: 10AOkoth) [20:26:02] MatmaRex: jan_drewniak: your backports are at mwdebug1001, please test [20:26:13] koi: your first three config patches are at mwdebug1001 too, please test [20:26:26] looking [20:26:32] thanks [20:26:38] (03Merged) 10jenkins-bot: elastic: use correct systemd command [cookbooks] - 10https://gerrit.wikimedia.org/r/826651 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking) [20:26:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T316186)', diff saved to https://phabricator.wikimedia.org/P33156 and previous config saved to /var/cache/conftool/dbconfig/20220825-202647-ladsgroup.json [20:26:51] urbanecm: mine looks good [20:26:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [20:26:59] thanks, syncing [20:27:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [20:27:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33157 and previous config saved to /var/cache/conftool/dbconfig/20220825-202716-ladsgroup.json [20:27:43] urbanecm: both look good [20:28:19] thanks, will sync too [20:29:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:29:49] RECOVERY - Check systemd state on cloudelastic1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:30:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:40] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.26/skins/Vector/resources/skins.vector.styles/layouts/screen.less: fe3382ea74a7ca5c8954ed456f4cd100208ed1e6: Add clearfix to .mw-body-subheader (T316134, T316095) (duration: 03m 25s) [20:31:45] T316134: Page indicators are in line with content - https://phabricator.wikimedia.org/T316134 [20:31:46] T316095: PAGEBANNER is not displaying at euwiki with New Vector - https://phabricator.wikimedia.org/T316095 [20:32:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:10] jan_drewniak: your patch is live [20:32:12] urbanecm: unfortunately it does not work [20:32:21] okay, so i'll revert (and skip the fourth?) [20:32:36] urbanecm: as always, thanks! [20:32:42] happy to help! [20:33:07] (03PS1) 10Urbanecm: Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826611 (https://phabricator.wikimedia.org/T308620) [20:33:15] (03CR) 10Urbanecm: [C: 03+2] Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826611 (https://phabricator.wikimedia.org/T308620) (owner: 10Urbanecm) [20:33:20] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826611 (https://phabricator.wikimedia.org/T308620) (owner: 10Urbanecm) [20:33:31] I thought is it ok to only revert to third one? I would like to figure out what to do later and the previous two has no affect [20:33:39] (03PS1) 10Urbanecm: Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826612 (https://phabricator.wikimedia.org/T308620) [20:33:53] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [20:33:58] T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159 [20:34:48] koi: what's the nature of "it does not work" please? if the bug is in the code you added to CS.php, wouldn't we need to rewrite it anyway (so revert is ok)? [20:35:10] I'm not really a fan of having variables in IS.php that are knowingly-broken [20:35:43] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.26/skins/Timeless/: ba0e981890aa6eb61598e4df786f7122e17b3002: Hide new associatedPages navigation items (T316196) (duration: 03m 41s) [20:35:47] T316196: Timeless’ namespace tabs are duplicated - https://phabricator.wikimedia.org/T316196 [20:37:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:38:11] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:38:11] these three patch should make everything looks the same before them, but the now the wrong logo was shown for some variant (cn/my/sg) [20:38:39] I'm fine with revert them all, and nvm about the reason I said that (keep broken thing inside CS.php) before [20:38:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:38:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:39:19] RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:25] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/VisualEditor/: 223e81f08e1f62b1ed78bcb2bdcc104e7fb60734: Update VE core submodule to master (d4c438548; T316219) (duration: 03m 42s) [20:39:30] T316219: Mention autocompletion doesn't work as expected with the reply tool - https://phabricator.wikimedia.org/T316219 [20:39:54] okay, i'll revert hem all in that case [20:39:57] MatmaRex: your patches are live now [20:40:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:40:06] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826612 (https://phabricator.wikimedia.org/T308620) (owner: 10Urbanecm) [20:40:08] thanks [20:40:19] (03PS1) 10Urbanecm: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826613 (https://phabricator.wikimedia.org/T308620) [20:40:26] (03PS2) 10Urbanecm: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826613 (https://phabricator.wikimedia.org/T308620) [20:40:29] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826613 (https://phabricator.wikimedia.org/T308620) (owner: 10Urbanecm) [20:40:50] (03PS2) 10Urbanecm: cswiki: Add extendedconfirmed group/protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826641 (https://phabricator.wikimedia.org/T316283) [20:40:53] (03CR) 10Urbanecm: [C: 03+2] cswiki: Add extendedconfirmed group/protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826641 (https://phabricator.wikimedia.org/T316283) (owner: 10Urbanecm) [20:41:40] (03Merged) 10jenkins-bot: cswiki: Add extendedconfirmed group/protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826641 (https://phabricator.wikimedia.org/T316283) (owner: 10Urbanecm) [20:42:37] patch works, syncing [20:42:44] (03CR) 10Andrea Denisse: doc: Fix smalll typos in the systemd::sysuser documentation. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826490 (owner: 10Andrea Denisse) [20:42:48] (03CR) 10Andrea Denisse: [C: 03+2] doc: Fix smalll typos in the systemd::sysuser documentation. [puppet] - 10https://gerrit.wikimedia.org/r/826490 (owner: 10Andrea Denisse) [20:42:59] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:45:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:45:47] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2067.codfw.wmnet [20:45:47] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2067.codfw.wmnet [20:46:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:46:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:46:48] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1aafdf0bd1d33929f2dd75ef4da9772d8832a31c: cswiki: Add extendedconfirmed group/protection level (T316283) (duration: 03m 42s) [20:46:52] T316283: Create `extendedconfirmed` at cswiki and make it possible to protect pages on that level - https://phabricator.wikimedia.org/T316283 [20:46:54] and, looks like we're done [20:47:07] !log UTC late B&C window done [20:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:47:47] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:48:33] 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Andrew) a:05Andrew→03Cmjohnson [20:49:06] 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Andrew) @cmjohnson, this is another host that will need its drives wiped, as the cookbook seems to be bad at that lately. Thanks! [20:51:43] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): decom cookbook often fails to wipe drives in HP systems - https://phabricator.wikimedia.org/T316292 (10Andrew) [20:52:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:53:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:53:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:53:43] RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:56:33] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [20:56:38] T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159 [20:59:43] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [20:59:54] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10Reedy) [21:01:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33158 and previous config saved to /var/cache/conftool/dbconfig/20220825-210130-ladsgroup.json [21:02:04] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [21:02:08] T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159 [21:02:20] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) The label should just be 'public floating IPs for cloud-vps codfw1dev' -- by their very nature the actual use of any particular IP will shift over time bas... [21:04:09] RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:27] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:12:03] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159 [21:12:09] T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159 [21:16:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P33159 and previous config saved to /var/cache/conftool/dbconfig/20220825-211637-ladsgroup.json [21:29:13] (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:31:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P33160 and previous config saved to /var/cache/conftool/dbconfig/20220825-213143-ladsgroup.json [21:35:23] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:12] 10SRE: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10mpopov) [21:46:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33161 and previous config saved to /var/cache/conftool/dbconfig/20220825-214649-ladsgroup.json [21:47:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [21:47:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [21:47:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33162 and previous config saved to /var/cache/conftool/dbconfig/20220825-214722-ladsgroup.json [21:52:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33163 and previous config saved to /var/cache/conftool/dbconfig/20220825-215247-ladsgroup.json [22:07:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P33164 and previous config saved to /var/cache/conftool/dbconfig/20220825-220753-ladsgroup.json [22:09:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2131.codfw.wmnet with reason: Maintenance [22:09:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2131.codfw.wmnet with reason: Maintenance [22:09:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2131 (T312160)', diff saved to https://phabricator.wikimedia.org/P33165 and previous config saved to /var/cache/conftool/dbconfig/20220825-220937-ladsgroup.json [22:09:42] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160 [22:22:33] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Thanks Andrew, I've updated the description for the codfw range now. In terms of DNS I don't seem to get any PTR records back for the ranges in codfw: `... [22:23:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P33167 and previous config saved to /var/cache/conftool/dbconfig/20220825-222259-ladsgroup.json [22:30:38] (03PS1) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/826674 (https://phabricator.wikimedia.org/T308501) [22:32:25] (03CR) 10Dduvall: "From the commit msg:" [puppet] - 10https://gerrit.wikimedia.org/r/826674 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [22:34:47] PROBLEM - DNS on cloudservices1003.mgmt is CRITICAL: Domain cloudservices1003.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:38:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33168 and previous config saved to /var/cache/conftool/dbconfig/20220825-223805-ladsgroup.json [22:38:40] (03CR) 10Ebernhardson: [C: 03+1] "confirm these hosts are all decom'd" [puppet] - 10https://gerrit.wikimedia.org/r/826630 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking) [22:48:03] (03CR) 10Ahmon Dancy: [C: 03+1] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/826674 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall) [23:13:04] (03PS1) 10Stang: bewikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826677 (https://phabricator.wikimedia.org/T310961) [23:16:28] (03PS1) 10Stang: euwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826678 (https://phabricator.wikimedia.org/T310961) [23:18:35] (03PS1) 10Stang: cswikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826679 (https://phabricator.wikimedia.org/T310961) [23:20:14] (03Abandoned) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826639 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [23:20:45] (03PS1) 10Zabe: Use real transactions when managing the voter list for a poll [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826614 (https://phabricator.wikimedia.org/T316150) [23:22:44] (03CR) 10CI reject: [V: 04-1] Use real transactions when managing the voter list for a poll [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826614 (https://phabricator.wikimedia.org/T316150) (owner: 10Zabe) [23:23:30] (03PS1) 10Zabe: phan: Fix use of IMaintainableDatabase::tableExists [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826615 [23:23:39] (03PS2) 10Zabe: Use real transactions when managing the voter list for a poll [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826614 (https://phabricator.wikimedia.org/T316150) [23:30:39] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:53:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T312160)', diff saved to https://phabricator.wikimedia.org/P33169 and previous config saved to /var/cache/conftool/dbconfig/20220825-235300-ladsgroup.json [23:53:07] T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160